Application Performance Monitoring (APM) with Elasticsearch, Elastic Stack (ELK Stack)


When you think about typical data streams
for monitoring applications or, you know, running an operational system around applications
that end up being deployed, logs are a very common data source that are relatively passive.  The application spews out logs depending
on what the developer wrote, and end up capturing data and being able to monitor it. Metrics are another relatively passive system,
there’s APIs that the systems expose, where it is NGINX or something else, and you end
up pinging them, and being able to get their health.  But there is another data set that is as
valuable as the other two. And this dataset, we started to see happening
— basically, thanks to you, the community, as it grew and grew. And we started to see APM-type data starting
to be stored in Elasticsearch.  I think one of the best examples that I
can say is an open source project like Open Zipkin, for example — one of the most popular
outputs for it was Elasticsearch. And we really took notice, and we started
to say, okay, it is really interesting what’s going on.e are working really hard to make
Elasticsearch a great engine, not only for text or for unstructured data, but for numbers,
for many different reasons that we have talked historically about, and maybe APM is an actual
place for us to go to. Almost eight or nine months ago, we joined
forces with a company in Copenhagen called Opbeat. Opbeat has developed a SaaS service for APM.  We are very excited about what the team
has done, especially partially the use of our technology to achieve some of the operational
aspects of running it.  And over the past few months we have worked
pretty hard in terms of taking all of that technology and making it a feature in our
stack that you can go and download and run yourself.  This feature has GA’d about a month ago,
and I’d love to show you where we stand when it comes to APM data in the Elastic Stack. For that, I’d like to welcome Ron on stage. Ron? (Applause).>>RON COHEN: Thanks, Shay. Hello, everyone.  My name is Ron Cohen, and I’m the tech lead
for the APM group at Elastic, and I’m extremely excited to be here today to show you what
APM on the Elastic Stack looks like. As Shay mentioned, it became generally available
with version 6.2 of the Elastic Stack, which got released just this month. When you open up Kibana, you will see this
sweet new APM tab on the left side. And, when you press it, it will immediately
show you how to get started with instructions right in Kibana.  And the way it works is, you have these
APM agents that you install into your application, just like you install any other package or
module or dependency.  And they instrument your application automatically
in order to give you this very rich, application-level performance information that helps you debug
your applications.  The APM agents talk to an APM server that
you run centrally, and that puts the data into Elasticsearch, and you can then consume
it using this UI I will show you in a second. So let’s actually try to set it up. So you download and install the APM server,
and then you run it.  So let’s try to do that. There we go. The APM server is running locally on my machine. Yep.  The next thing I need to do is, I need to
install an APM agent into my application. And let’s try it with a Node.js application
that I have here locally. So, all I need to do is install the Elastic
APM node module, like this. There we go. And then I need to put in a few lines of code
that is going to configure my — the module that I just installed, like so. Because I have the APM server running on the
local host here on the laptop, I don’t really need that.  But this, here, I put in a service name. So I will just give the service a name, let’s
call it opbeans node, like this.  And then I need to re-deploy my application,
and you should see the data is starting to stream into the UI.  Here we go. Now I should probably mention that getting
this automatic instrumentation right is actually very, very tricky, so we worked really hard
to make it simple.  All right. Let’s see if the data is coming in, yep.  So here I see the application that I just
set up.  I can dig into it here. So here I see three different tabs: The first
tab shows me information about incoming requests to my applications.  So that’s Web Requests.  The second tab will show me any background
jobs that run in my application, and the third tab shows errors that are collected in my
application.  So we are also automatically collecting
errors that happen. So right out of the box, we give you response
times, we show you the average, the 95th percentile and 99th percentile response times, we show
you how many requests per minute your application is serving, split into the different HTTP
status codes, so you’ve got your 200s, 300s, 400s, and 500s here.  And then we have the table of endpoints
in your application. For reach endpoint I can show you the average
response time, the 95th percentile, and how many requests per minute minute that particular
endpoint is serving.  We can dig into it, and here I have the
same two graphs filtered down for this particular endpoint. And then here I have a response time distribution
graph.  This is useful in order for you to see what
the distribution of your response times is. What typically happens is you have a bunch
of requests that are served really quickly, and then you have this long tail of requests
that take longer to serve.  And the idea is that I can press these different
buckets, and I will get a sample from that particular bucket. So a sample for which the response time fell
within that bucket that I selected.  So let’s pick a slow one, that one, for
example. So, looking at the sample, I can see the URL
that was requested, I can see the results, I can see the timeline here, which is really
interesting, because the timeline shows me what my application was doing trying to respond
to that incoming request. So the orange here is SQL queries, but we
will also show you any templates that were rendered, any cache calls, any calls to external
services that happened during the call to this particular endpoint. And it looks like the sheer number of SQL
queries is what is slowing down this particular — (laughter). The good thing is, I can open up and get more
information about what is actually going on. So each individual SQL query only takes about
eight milliseconds here, but because there are so many of them, it adds up.  You can see the actual SQL query that was
executed, and you can see the stack trace from where in your code this particular SQL
query was executed, and that makes it really easy for you now to go in and start optimizing.  So clearly there’s a SQL query in a loop
here somewhere that I need to go and fix. I can dig into more information about this
particular sample, like which server responded, information about the process. But I can also see the user that requested
this particular request.  And, if I want to, I can tag individual
transactions. For example, with the customer ID, or something
similar. That makes it very easy for me to go in and
find particular samples for particular users. And now is a good time to remind you that
this is all just documents in Elasticsearch.  The UI that we are looking at is really
just showing you documents in Elasticsearch. There you go. And so you see, if you want, you can always
dig into the actual data in the Discover tab.  And that also means you can create your
own dashboards, you can start to call it with other types of data. I also wanted to show you the errors. So, because we get this very rich performance
data from your application, we get really rich information about errors that happen
as well.  And what we do is we group errors together,
so different instances of the same error gets grouped together.  And we sort them by in which group the most
recent occurrence happened. And this makes it very easy for you to get
a good overview of what the errors are that are happening in your application. I can dig into a particular error here.  I can see the message, I can see the occurrence,
the frequency of this particular error, and here’s a particular occurrence of the error,
and I get a really nice stack trace that shows me the actual code that was executed. I will get more information, just like you
saw before, about the process and the service and, again, I can put user information in.  I can see which users experienced this error,
and I can tag the different transactions so I can see the customers, for example, that
experience particular errors. That’s it, thank you.>>SHAY BANON: Thank you very much, Ron. (Applause).

Leave a Reply

Your email address will not be published. Required fields are marked *