Discover Implementation Best Practices, Monitoring & Diagnostics tools for your Azure applications


So I’m Bhavya Nag,
I work as a product marketing manager in the Azure
platform group and today I’m gonna talk to
you guys about Azure. So just a quick show of hands,
who here already has workloads and applications up and
running on Azure? Right, so a good number of you. So for those of you who already have
stuff up and running in Azure, this session will give you a good sense
of whether you have the right design principles already in place for
high availabilities, scalability. And also it will give you a sense
of some of the new tools and services we have for
end to end monitoring on Azure, and across your entire IT deployment. Which might span Azure and
other clouds, or on-prem deployment as well. And for those of you who don’t yet
have workloads up and running in Azure, this will give you
a sense of the best practices of how you wanna implement and
manage your environment. So lets get going. So here is a brief overview of
the stuff I’m gonna talk about. So I’m gonna briefly cover the different phases in the life
cycle of a typical Azure customer. That’s you guys. Then I’m going to deep dive into
two phases of the life cycle, the implement phase and
the manage phase. And as I mentioned,
as part of my presentation, I have a bunch of demos for some new
tools that we have, especially for monitoring and diagnostics. And also just to keep it interactive
I have a few quiz questions peppered around my presentation. And I have some cool Azure goodies
that you guys can take home with you, so keep an eye out for
those, all right. So this is what the typical life cycle of a customer
looks like in Azure. You start by discovering
the platform, learn more about what it can do for
you. I’m guessing all of
you guys are here so you already passed that stage. Then you envision what
your Azure workloads and applications are gong to look like. And then you get into the stages
which are going to be featured in this session, implement and manage. And overarching all of
this is Azure Support. And we wanna make sure that we
deliver the best customer experience to you guys across all
phases of the lifecycle. So let’s do a double click into
the implement and manage phases and a brief overview of the kind of
stuff I’m gonna talk about and cover in this session. So when we talk about
the implement phase, the first thing I’ll talk about
is the balance of responsibility. So what that means is that from
Microsoft’s perspective we try to make sure that we deliver to you
the most reliable cloud platform. But having said that, service disruptions will
occur from time to time. And that’s true of any service. And there are some things that you
guys need to do as well to make sure that you’re setup for
success and make sure your applications
are resilient and scalable. Then I’m gonna talk about
some best practices and recommendations to help you guys
actually achieve that goal. And then I’m gonna talk a little
bit about compliance a story and the value proposition that
offers to you as customers. Then we’re gonna switch gears and
go into the last phase which is managing your Azure work loads and
your Azure applications. Here I’m gonna talk about some new
tools and services we have that make it very easy for you guys, to do
proactive and reactive monitoring, set up diagnostics, alerts,
notifications, all the good stuff. And I’m gonna also talk
about how you can leverage some self help tools that
we’ve enabled on the platform. And of course we’re gonna talk about
how to reach out to support when all else fails and
you have to reach out to get help. All right, with that let’s
dive into the implement phase. So I’m gonna begin by describing
some cloud service realities. So as I already
mentioned very briefly, even though from Microsoft’s
perceptive, it’s our commitment to make sure that we deliver to
you a reliable cloud platform. From time to time
service disruptions and service impacting incidents like
outages are bound to happen. And that’s true of any service,
it’ll be true for any cloud provider. And there are a variety of
reasons why that would happen. So one of them is
software will have bugs. Of course we try to mitigate
this by making sure that we have vigorous change management
and safe deployment practice. We roll out any change in a staged
fashion in what we call rings of deployment which I’m gonna talk to
you guys about a little bit later. Secondly, hardware will fail. So we have tons of hardware that’s powering our Azure services
sitting all across the world. And just an industry-level statistic
here, according to research, 3% of any given hardware
installation will fail in a year. So what we do at Microsoft
to protect against this is engineer the software to
protect against this. And how we do that is we have
a bunch of machine learning and AI services which
are monitoring all of our Azure infrastructure across the world. And anytime we have a predicted
fault we have the processes, and tools, and people in place to
mitigate the impact of that. And lastly,
humans will make mistakes. And the answer to that of
course is investing in automation as much as possible. And that’s part of our
staged roll out story. And I’m gonna talk to you guys
a little bit about that as well. With that, let’s talk about
the balance of responsibility. So it used to that in
the on premise world you were responsible for everything. Right from data governance and rights management,
to clients endpoints, all the way down to physical infrastructure
like managing the physical host, the network environment, and
the actual data center, right? Now as you start using Microsoft,
IaaS, PaaS, and even our SaaS offerings,
like Office 365. That balance of
responsibility gets more and more weighted towards Microsoft,
right? But having said that, there
are still a bunch of things that you guys need to be do in order to
make sure that your applications are resilient, that you have high
availability and scalability. And that you are protected from
platform level outages and disruptions that might occur. We want to bring you guys
into the conversation so I want to talk to you guys about
how you can set yourselves up for success and
some best practices around that. All right, so here are some
high level principles to think about when you are designing for
the cloud. So first you wanna make sure that
you don’t have a single point of failure which can bring down
your entire application. So an example of that is having a
single instance virtual machine for running your entire application. So a good practice in that case
is to make sure you have multiple virtual machines in
an availability set and I’m gonna talk more
about what that entails. Secondly you wanna make sure that
the components of your application are loosely coupled. So make sure that you don’t have
a single monolithic application in a single virtual machine. But you wanna make sure that you
have different tiers in your application. So an example of that is
if you have a web app, then you want to think about
having like three different tiers. So like an application tier, a logic
tier, and a data tier, right? And you wanna make sure that each
of these tiers is set up for high availability individually so that
the entire application is resilient. And I’m gonna talk more
about that as well. Thirdly, and this is a key point of
differentiation when you compare things to the on-premise world. Build for scale out, not scale up. So it used to be in the on-prem
world, the solution to growing your application would be to just
throw more hardware at it, right? So you went ahead and set up a data
center and you took good care of it. You put in the best physical drives, the best network card,
redundant power supplies. You took good care
of the data center. And once your application scaled, and you were nearing the limits of
what the data center could do for you, the solution was to rip and
replace. Go in for a bigger data center. Now, the problems with
that approach are twofold. Firstly, costs don’t scale linearly
when you’re scaling up like that. If you have a data center
that’s twice the size, your costs are much more than
twice as compared to before. And secondly your applications
are still brittle, right. You still have that
single point of failure. So, In Azure,
you have a much more intelligent way of scaling out as
opposed to scaling up. So you can like build
automated scaling for that particular process
that needs to scale. And I’ll talk about how
you can achieve that. Process centrally, deliver locally. Now, if you have your data split
across multiple geographies in the world,
it can be a fairly complex process to get all of those bits and
put them all together. With Azure it’s a lot easier for you to manage all of that
from a central location. If you do need to do multi-region
deployments for whatever reason, we give you the capabilities
to achieve that. But hat would entail
higher complexity, and I’m going to talk about those
considerations as well. Lastly, automate,
automate, automate. People will make mistakes. Another industry level stat here, over 50% of data center level faults
are caused by human error, right. So it’s very easy to like think
about, hey every time I need to make a change, I’m just going
I’m just gonna make it, push it to the server and
I’m gonna do this every time. It’s fast, it’s easy,
it’s a bad practice. So whenever you have an operation
that has any amount of intervention, think about automating that. And I’ll give you some
examples of how you can do it. For example, for scaling with
virtual machine scale sets. All right, with that,
let’s talk about some tools and resources that you can leverage to
design for high availability and scalability in the cloud. So the first one is
availability sets. So this is how you
achieve high availability from the most basic
standpoint in Azure. So instead of having a single
virtual machine powering your entire application, you want to make
sure that you have at least two virtual machines that you put
inside of an availability set. Now what that does is it offers you
protection from faults and updates. Let’s talk about each
of those one by one. So faults, so when you have
two virtual machines that you put inside of an availability set, it makes sure that it puts
them in separate file domains. And what that means is we make sure
that they have independent power and networking infrastructure
powering those machines. So anytime there is a physical
fall that might bring down one virtual machine,
the other one still stays up and your application is resilient. The second thing is update domains, so again, every time you
have two virtual machines or more in one availability set, we
from Microsoft make sure that every time we push out an update
to the platform, we won’t push it out at the same
time to both virtual machines. So if the update causes some kind
of disruption to your application, it will still be resilient because
at least some virtual machines will still be up and running. You also wanna make sure that
availability sets are tied to a role in your application. And what I mean by that is if
you have multi-tier application, let’s say Web App Touch Web tier,
logic tier, data tier. You wanna make sure that you
put each of those tiers, all the virtual machines
powering each of those tiers, into a different availability set. And that would make your entire
application resilient because at least one virtual machine
in each tier would be up, regardless of any false-start
disrupt some of them. And you want to make sure you that
you put load balancers in front of every tier to make sure that
your network traffic is routed appropriately. Also keep in mind that having at
least two virtual machines inside of an availability set
is a requirement for having a financially backed 395 SLA. So if you just have one virtual
machine powering your application, you don’t get to take
advantage of that. Now, having said that, a lot of our
customers, when we talk to you guys told us that it was near impossible
for them to migrate to Azure. Simply because in that on-premise
environment they had a bunch of applications that were running on
single instance virtual machines. And in many cases it was too
costly or too complex for them to rearchitect those apps to a multiple
virtual machine environment. So, we’ve heard you guys, and we’ve recently made life, our
Single Instance Virtual Machine SLA, which is a financially
backed with three nines. The only requirement here
is to make sure that your Single Instance Virtual Machines
are tied only to premium storage accounts, and
not standard storage accounts. So, the premier storage
accounts are SSD. And if you guys have
mission critical apps, as a good practice you should
use premium storage anyway. The improvements that you get
in terms of performance and durability is magnitudes higher than
what you get with standard storage, and it’s all about the eye ops,
right? So you want to make sure that
you use premium storage for your mission critical apps. And the price differential Is
like more than compensated for by the performance and
your ability improvements. We are the first global
public scale cloud to offer a single instance
virtual machine SLAs, our top two competitors do
not offer this as of now. But keep in mind that this is
still not a replacement for high availability, right. So you don’t get to take
Advantage of full domains and updated domains that you
get with availability sets. And also keep in mind that
your SLA is slightly lower, you don’t have 395, but just 39s. All right let’s talk about scaling
out in an intelligent way. What you want to do for that,
is use virtual machine scale sets. So what this means is, that you have
the stamp of a virtual machine, and you can use the image of that stamp
to automatically create more IaaS as your application scales. And then what you can do is, is you
can write scripts that say hey, if my network traffic
exceeds a certain threshold, deploy more IaaS so
I can support the application and it doesn’t go down with
the increased load. And then when that network
traffic goes down you can throw away the additional IaaS and
go back down. And all of this can be done easily
and in an automated fashion. And you can set up virtual
machine scale sets either using Azure Resource Manager templates or you can do it from
the Azure portal as well. Right, now let’s talk
about deployment scale and multi-region deployments. Now everything that
we’ve talked about so far, availability sets and
virtual machine scale sets, now they make your applications
scalable and highly available. Unfortunately, if there is a geo
level event, let’s say an act of God or some kind of geopolitical event
that brings down an entire region, then your application may
still not be resilient. And if you have needs for
resilience beyond that, then you want to think about
multi-region deployments. And how you will actually go about
doing it is making sure that you take advantage of paired regions. So paired regions give you
isolation from faults. And what that means is that for
every pair that we have, we make sure that those data centers
are at least 400 miles apart. The only exception to that is
Japan because it wasn’t physically possible for us to separate our
data centers by 400 miles but it’s true in every other case. So what that means is that
typically every pair of those data centers will be separated
by different climate zones. In many instances they are in
different flood planes and even in different tectonic
zones in some instances. So you’re protected from
acts of God and earthquakes, those kinds of things. Another thing that we offer
with paired regions is something called sequential updates. So it’s kind of like update
domain but at a geo-level. So we make sure that if we push
out updates, we won’t push them out to both regions, in the same
place at the same time, so that your applications don’t go down,
because of an update related fault. And then a data residency, so for
compliance reasons, some of you, may want, may have to, make sure that your data resides in
say Germany, or China or whatever. And in those instances you can
still take advantage of our multi region deployments and
have data residency. Okay, all right, so quick quiz
question, even though I already gave you a brief preview of this slide
but we’ll go ahead with it anyway. Can anyone here tell
me how many regions is Azure present in across the world? Yeah, all right, just make
sure you collect your koozie, it’s like an actual beer sweater,
after the presentation, it’s perfect for this weather. All right,
Another thing you guys wanna do is make sure that you don’t
have a single point of failure. Now, even if you’re
using availability sets, storage can potentially still
be a single point of failure. And what I mean by that is, if all
of the virtual machines that you’ve put in a single availability set are
linked to a single storage account, that unfortunately
a storage level fall can mean that your entire
application goes down. Right, so as a best practice you
want to make sure you do two things. Firstly, make sure that not all
virtual machines in one availability set are linked to a single storage
account And secondly, so if you have to share storage accounts, make
sure you’re doing it across some virtual machines that might be in
different tiers of your application, on different availability sets. And not all virtual machines and one availability set share
a single storage account. And secondly like I mentioned
before, especially for your mission critical apps,
make sure you use premium storage. And, as I mentioned before, the performance and
durability improvements that you get are magnitudes higher
compared to standard storage. Now, having said that we realize
that that is not an ideal scenario for you guys, you don’t want to
be in a situation where you’ve configured for high availability for
your virtual machines, but then a storage-level fault
wipes out everything. So we’ve launched a feature
called managed disks, it’s in preview right now, and it’s going
to be generally available soon, so look out for that announcement. And what that does is, it basically partitions your storage
account into multiple fault domains, kind of like what happens with
availability sets anywhere. So this will ensure that your
storage level fault will not bring down your entire availability set. Another thing that you guys should
think about doing is using CDNs. That stands for
content delivery networks. And this is especially true for applications where you might be
transmitting a large amount of data. Especially video and audio content. So what CDNs do basically is
they make sure that there is a local cache of your data
insights all across the world. So if you use Azure CDN, you have the option of using
either Verizon or Akamai. And both of them have thousands
of sites across the world, right? And so the data is stored in
multiple locations and your customers can access it from a place
that’s closer to where they’re from. And so that, of course, gives huge
advantages in terms of latency. There’s also some interesting
security benefits, like protection from DDOS
attacks and origin obfuscation. Those are not the main
features of my presentation. I wanna talk about high
availability and scalability. And how CDNs help you achieve
that is when your customers are pinging your applications,
they’re actually getting the data first from the CDNs and only
then do they ping your machines. So your machines are able to scale
easier and cater to a larger load than they would have been able
to if there weren’t any CDN. I wanna talk to you guys about
the resiliency spectrum. So all of the things I’ve
talked about, basically, involve making choices, right? Do you wanna have a single instance
virtual machine deployment or multiple virtual machines? Do you want to have standard
storage or premium storage? Do you want to have CDNs or
not, right? And we realized that it might be
overwhelming to think about all of these choices. So how you should think about is, where do you need to lie along
the resiliency spectrum? And, as you think about that,
there are three key factors. So first is RPO and RTO. RPO stands for
recovery point objective. RTO stands for
recovery time objective. These are fancy terms, but
basically what it comes down to is, you wanna think about how many down
minutes can you afford to have for your applications in any
given time frame, right? So, that’s the first factor. Second factor, of course, is cost. As you go from single instance
virtual machines to multiple virtual machines. Or as you go from single-region
deployments to multi-region deployments, your cost
will obviously go up. So that’s another consideration
that you guys should think about. And lastly, complexity, of course. As you make these changes, your application is bound
to get more complex. So you wanna think about how much
complexity are you comfortable managing across your
IT organization. And then, considering all these
factors, it boils down to an internal conversation that you
guys have among your teams or with your external stakeholders and
decide on where you want to be. Now with that, let me transition to a brief
video from our resiliency team.>>Keeping applications up and running is of utmost importance
to any IT operation. There’s time, money, customer
satisfaction, or even individual or public safety at stake.>>Microsoft is focused heavily
on making sure that customers can improve their uptime. We have a dedicated resiliency team
that spends a 100% of our time working upon,
how do we make that better? When you build on top of Azure,
you get the benefit of all the engineering knowledge that
Microsoft’s Cloud has brought to bear for every customer and
every customer’s needs. And as people are moving their
mission critical applications onto Azure, uptime is their business.>>So Office Timeline is really
a project visualization tool. And we started out as
an on-prem application. Over time, what we’ve realized is if
we wanna really scale the business, we have to move into the Cloud. So we started out with one
app service and one database. And since that time, we’ve
organically grown to about a hundred different resources in Azure. Any amount of downtime
means revenue loss for us. And I can’t remember the last
time I’ve had to worry about reliability or scalability.>>We really care about making
sure that your uptime increases. There’s a sign that
sits in my office and it says, we only win when
the customer does, and it’s true.>>I would encourage everyone to
take a look at azure.com/resiliency, to get the expert guidance and best practices needed to
build a resilient Cloud. [MUSIC]>>All right, so make sure you guys
check out azure.com/resiliency. All of the best practices I’ve
shared so far are great canonical examples, but
they’re not a comprehensive list. So there’s a bunch of other
stuff that you guys can do, especially in a network layer and
with other services. And if you go to
azure.com/resiliency, it’s gonna give you
a complete check list. And best practices and
recommendations for how you can architect your
Azure applications for high availability and resiliency. All right, with that, let’s switch
to our compliance portfolio. So as Jason Zander mentioned in
the keynote presentation yesterday, we have the largest compliance
portfolio in the industry. So this makes it very easy for
you guys to get up and running. Especially if you have compliance
requirements that are specific to certain industry verticals, or
generic compliance requirements around safety and
security of your applications. Now, the great value proposition for
you is that, if you were to go ahead and build and obtain
these certifications yourselves. It would be a huge exercise
in terms of complexity and how time consuming it would be and
the effort it would require. So by renting from us, you can
have that right out of the box, instead of building it from scratch. And we continue to invest in
our compliance portfolio. So the most recent one
we have is ISO 22301. So this is a premium standard for
business continuity. And basically what this means
is that we have the people, the processes, and the tools in place to make sure that
we can prevent, mitigate and recover from events that might cause service
impacting outages to our platform. All right, let me talk to you
guys about a new feature in the Azure Portal that we launched
sometime last year in preview. So this is Public Preview,
which means that all of you guys can actually access it in your
Azure portal right now. So this gives you actionable
recommendations to improve resource availability, security and
gives you some recommendations to improve performance and
manage costs better as well. And the great thing is, it’s
specific to your Azure resources in the regions and environments and subscriptions that you have
configured in your Azure portal. So think about it as your
personalized Cloud consultant. And as I mentioned before,
it’s launched in Public Preview, and it continues to get
better as we get more and more data from the resources
that are running on all your applications and
environments across the globe. So here’s a snapshot of
how it works in action. So here’s an example of some cost
recommendations I’m getting to lower the cost of my applications. So here, as you can see,
it’s telling me that I have multiple SQL databases as
part of my Azure environment. And I can potentially save money by
combining them into an Elastic Pool. So an analogy of that is,
if there’s four of you who are part of a family and you
have individual cellular plans, it obviously is more cost effective for
you guys to switch to a family plan. And that’s pretty much what this is,
right? And the cool thing is
it gives you an idea of the estimated monthly
cost savings as well. So make sure to check out this
feature and see if you get some insights into how you can improve
across availability, security, performance, and
even lower your costs. Right, so I briefly talked about
our stage deployment program. So we have what we call
the Azure Canary Program. This is our
Early Updates Access Program. And basically,
any change that we make on the Azure platform goes through
a rigorous, staged rollout process. So across rings of deployment. So the first one is our
internal DEV/TEST environment, where we make sure that everything’s
working fine, nothing’s breaking. And once we are confident of that, we roll it out to our
internal stage environment. And once we are comfortable
there as well, then we sort of roll it out to
what we call the Canary Ring. And this is where we would like
to invite you guys to join and take part. So think of the Canary Ring
as your pre-production environment where you
can get early access. To tools and features and
functionalities, which are going to be generally
available in the upcoming weeks. So as an action item, you guys
can each send an email to that email address on the top right,
[email protected] And we’ll tell you more about the
Canary Program and how you can join. And, of course, once we have
a feedback from you guys who are in the Canary program and
we’ve implemented the changes, which we might have to do in order
to make sure everything’s up and running properly, we roll it
out to a worldwide deployment. And there, again, we have
a flighting process, where we make sure we are doing it in clusters, or
individual fault domains, individual update domains, and then individual
geos for paired regions as well. Right, so everything I’ve talked
about so far has dealt with individual services in Azure and
the underlying technology. But we also wanna have
a conversation with you about some out-of-the-box solutions
that we offer So think about these as being closely
tied to business outcomes, which start from basic clouds and
IOs and go up to more differentiated solutions with advanced analytics or
with things like doing predictive maintenance
across your IoT deployment. Basically, what this gives you is
a bunch of reference architectures, preconfigured solutions,
Azure Resource Manager templates. So you can get up and
running right away. So instead of having a bunch of Lego
blocks that you need to figure out how to assemble, we give you a nice,
packaged box with illustrations and a step-by-step manual. So you can build a truck or
a helicopter or whatever you want. So make sure to check out
the solutions webpage, we keep adding to our
list of solutions. It’s a great way for you guys to
start off if there are specific business scenarios that you need to
build your Azure applications for. Right, so let’s talk about
the summary of some of the implementation, best practices
that we’ve been talking about. Make sure you can figure
out multiple virtual machines in an availability set. That’s a core requirement for availing of our financially
bagged 395 SLA. Make sure that each application tier
is in a separate availability set. Make sure your entire
application is resilient, and put a load balancer in front of
each tier to make sure network traffic is routed appropriately. Take advantage of fault domains and
upgrade domains. And if you have a very
stringent resiliency and high availability requirements,
think about how you wanna deploy across multiple regions, taking
advantage of our paired regions. It’s as I mentioned before,
check out the Azure Resiliency page, azure.com/resiliency. All of the examples I talked about
are great to start off with but by no means comprehensive. So on the resiliency website you’ll
get a comprehensive checklist and reference architectures for
how you wanna architect your applications and
your Azure resources for resiliency. All right, with that,
let’s move to the manage phase. So now you’ve built your
Azure applications, you have some workloads up and
running. Now you wanna make sure that
you have a way of managing your applications end-to-end. Your internal stakeholders and your
customers increasingly have very low tolerance for issues with resiliency
and with bugs in services. People expect services to pretty
much work like electricity and people get very unhappy when
they flick a light switch and the ball does not turn on, right. Think about a time when you
accessed your banking website, and you logged in and you saw this
big banner on top that said, hey, we’re gonna be down for
planned maintenance on Sunday. And your first reaction to that is,
what? Are you kidding me? You’re a bank, you’re supposed
to be up all the time, right? And this is increasingly true. People expect services to be up and
running all the time. And the mechanics of how you
ensure that in the cloud are completely different. So long gone are the days when you
would log into the specific machine that was running your website and
you would try to remotely diagnose and trouble shoot what’s
going on with it. In the era of the cloud you need new
tools and new services to be able to manage your applications
in a more resilient way. And I’m gonna talk to you guys about one such way that we launched
recently, it’s called Azure Monitor. So again, this is also a new service
that’s in preview right now, but again this is a public preview, so you guys have access to it in
your Azure portal right now. So what this gives you is built-in
monitoring capabilities for all your Azure resources
across all your subscriptions. So it’s a single entry point where
you have all of your monitoring. It gives you out of
the box metrics and logs you don’t have to
configure anything, you don’t have to configure storage
account, or anything is just there. You can also setup alerts and
notifications so you can take actions even when you’re not sitting
in front of your Azure monitor. And you can create a single
dashboard that you can customize and share with other
people in your team. And of course we are a platform,
so we provide APIs for third party integration. So if you’re already using
an incident management software or an APM monitoring solution or
chat apps, what have you, we provide first class integration
through APIs into those. And I’m gonna talk
about that as well. So, here’s the screenshot of what
the monitor experience looks like. So, you click into monitor
on your Azure portal and it opens up this blade which gives
you everything that you wanna monitor in your Azure environment. So, the first thing
is activity logs. And that’s basically a log of operations which have
been made on your Azure resources. So what are the virtual machines
that were stopped or started, who made those operations, where the SQL database is
that were deleted by someone? So you wanna know that because if
there’s something that goes wrong in your Azure environment it’s most
probably because of a change that someone made. So you wanna know what were
the changes that were made? Who made them? And what were the other things that
are going on at the same time? So that’s activity logs. It also gives you metrics on
what’s happening with your Azure resources right now. So what does my CPU or RAM usage look like,
how fast is my SQL database going. All that kind of stuff. Diagnostic logs for
when things are going wrong. You want a quick and easier way to
be able to search diagnostic logs. And we have enabled first class
integration with operations management suite, OMS. So you can not only look at
diagnostics for logs in Azure, but if you have enterprise-wide IT
deployments that span let’s say Azure or on-prem, or even multiple
public clouds like Azure and AWS, you can monitor all of those
logs in one single location. And I’m gonna show you guys
how that works as well. Lastly you wanna make sure you can
setup alerts and notifications, so like I said before you’re not
sitting in front of your Azure monitor all the time. So you need to a have a way to make
sure you can get notified by email, SMS or webhooks that send it
to other third party apps, and we provide a way for you to do that. The SMS and webhooks capabilities
are currently in the middle of preview so, you might not be
able to access them right now. But we are working on making
sure that they are generally available very, very shortly. Okay, with that,
let’s switch to a demo. Okay, so this is my Azure portal. Let me just refresh that to make
sure that the network connection hasn’t dropped out Okay, so
I’m gonna go into monitor And just like you saw in the
screenshot it’s like a single blade which has monitoring set
up across multiple layers. So what do you want to do when
you think about monitoring is you wanna think about
three levels of monitoring. So the first level is platform level
monitoring where you get logs and metrics for
your Azure resources, right? So virtual machines being stopped or
started or RAM utilization, CPU usage,
all of that kind of stuff. That’s at the platform level. What you might also wanna do is
think about simple scenario where you have something like a WordPress
or website running on Azure. There not only do you wanna
look at what’s happening with your platform in the virtual
machines and your SQL Server, but you also wanna get
application level insights. Things like how many people
are viewing blog post one versus blog post two. How many people are going
to the about me page? And we also have first class
integration to app insights that lets you achieve that, as well. The third thing you wanna do is
monitoring on a global scale. So as I mentioned before, if you
have an IT deployment across Azure and on-prem or across Azure and
AWS you wanna make sure you have one single space to monitor
all those logs. So, those three scenarios are I
what I’m gonna demo for you. So beginning with
the platform level logs, the first thing we have
here is activity log. So here you can filter
by resource type, so if you wanna look at only virtual
machines, you can do that. If you wanna look at
specific operations, you can filter to that as well and you can filter to the specific
time span you wanna look at. What we also have up here is
something called quick insights. So if you’re new to Azure and
you know very little about hey, what does a resource group mean, what
are the different resource types, we give you quick access to what
we think are important things for you to focus on. So let’s say I see here
that I’ve had 11 errors. So I can click into that and quickly get a sense of
what those errors were. And if I click into them,
I can see a brief summary. And I can also get an adjacent view
that I can copy paste to an e-mail that I’ll send to my IT pro,
who can help debug what is going on. And that’s activity logs. And one thing I forgot to mention,
these activity logs are available to you without you having
to configure anything. You don’t have to set
up a storage account. All these logs
are available to you and you get a duration
of the past 90 days. If you need your logs to
be stored for over 90 days. Let’s say you have
compliance requirement and you can export them to your storage
account that you’ve configured. Let’s see. So here you can configure storage
to a specific storage account, so you have logs for longer than
the default 90 day period that we store them for you anyway. And you can also export
them to an event hub. So let’s say you wanna do
some realtime streaming. You can push them from an event
hub onto Azure Stream Analytics. And you can use something like
Power BI to look at your activity logs in real time. So all of those are things
that are enabled from you from right within the portal. Next thing I wanna
talk about is Metrics. So, as I mentioned before, this are stuff like what’s my CPU
utilization on my virtual machines, RAM usage, how many requests
are coming to my web apps. So here I’ve set up this demo web application which is basically
a loan processing application. It’s a fairly simple app,
you put in your details and you can create a loan application. I won’t go into the details
of this application, but let’s see if I can get some
metrics for this right away. So let me go to the resource group
where that web application lives. I’m gonna select web apps
as my resource types and I’m gonna go into the specific web
app that I wanna take a look at. So now this, right off the bat, this gives you a list
of available metrics. And these are context specific, so it only gives you metrics
which are specific to web apps. And I can select a few of them, so let’s say I wanna look at requests,
I wanna look at server errors. And it gives you a nice
chart that if you wanted, you can pin to your dashboard. I’m not gonna do that right now
because I’ve already pinned a bunch of stuff to my dashboard,
which I’m gonna show you guys later. But again, these metrics
are available to you at danularity of one minute. And you have retention
up to 30 days. And again all of this is
right out of the bat. You don’t have to do
anything to configure this. It’s just there when
you go into Monitor. Another thing you can do is you can
look at alerts that you have set for your resources. Here’s an alert that I
have already configured. Let’s look into how that works. So think about alerts in
terms of three things. So there’s a source,
there’s criteria and there’s an action that
you wanna undertake. So the source is coming
from my metrics, the criteria is in my web app. If the average response time on
the server goes beyond a certain threshold, I wanna be notified. And I can specify emails, I can specify, A webhook if I’m
using a third party application for my performance monitoring on Azure,
and I can get those alerts. All right,
let’s go into something else. All right, so that’s as far as
platform monitoring is concerned. I’m gonna also talk to you guys
about application level monitoring, so what’s happening on the
individual web pages for example, what are my page view load times and
so on. So again, we have first class
integration with App Insights. Let’s take a look
at thow that works. I’m gonna go into my
loan application. And let’s see if we
can do a live stream. Wow, there’s a countdown. Very exciting. All right.
Let’s see if I can do a split screen to see how this
live stream works. Okay. There you go. Right, so I have my application on
the right side of the screen and the live stream on the left part. Let’s just reload this a few
times and see what happens. Yeah, so every time I’m reloading
I can see a live stream of the incoming requests. It’s extremely satisfying
to keep doing this. But of course it’s useful against
data but of course you wanna do more than just look at a live
stream of your requests. And so
here if you go into Analytics, it opens up
the Application Insights page. And here you can write queries based
on the application insight SDK. So I can look at stuff like
page views, for example. Let’s look at just ten of them. Then I can look at these logs. So the great thing is all of
this is integrated right from within the Azure portal, inside
the monitor experience, right? And you can set up
customer tricks as well, which might not be
available off the bat. And you can look into those. And there’s first class integration
as long as your application is built on ASP.NET using the MVC framework. If you’re using something else, you might have to do a little
bit of instrumentation but you can still look into
Application Insights for app level monitoring beyond what you
already have at the platform level. Okay, let’s look at one
more thing over here. I wanna look at diagnostic logs. So, every resource that
you have in your Azure environment is already
emitting a bunch of logs and you can like disable or
enable them at different points. So let’s see, let’s go to my
LoanApp, which is a logic app, and I’ve enabled a diagnostic here,
so let’s see. So let’s go into
Diagnostic settings. So what you can do with
your diagnostic logs is you can either archive them
to a storage account, let’s say you have
compliance requirements. For having an archive of
diagnostic logs, you can do that. You can stream them to
an event hub and again you can use something like Power BI
to have stream analytics on it. Or what I think is the most
interesting thing is, you can send it to log analytics, which is a deep integration with
operations management suite. OMS. So let’s actually take a look
at that and see how that works. So when I click on that, it actually opens up the specific
OMS workspace, right. And then what I can do is I
can go into my OMS workspace. Let’s see. Okay, let me just go
a different route. Right, so
I just click on Log Search and that led me to an OMS workspace
that I’ve already configured. And when I click on Log Search
over here, it opens up my OMS workspace directly inside
my Azure portal, right? So let me just type the simplest
of queries here, star. And it tells me I have 52K logs and
it took like a second to load those. For those of you who attended
the keynote session yesterday, you would’ve seen that the person who
presented has 60 million or so logs. I just wanted to make sure
my demo works for sure. So I made sure I had much
fewer logs in my workspace. But the basic idea is it’s very
scalable for millions of logs, you can load it in a second or
a few seconds. And here it tells you all the
different types of logs you have. So in my OMS workspace,
I only have Azure logs setup. But if you had,
let’s say a provide cloud and VMWare or logs coming in from AWS. You could potentially send them
to the same OMS workspace and you could look at them in
the same environment, right. And you can filter, let’s say I
wanna only look at diagnostic logs, so then it’s gonna filter
it down to 90 results from the bigger number I had before. I can go into these, and it gives me a bunch of other
things I can filter on. Now, As a developer or
an IT pro, you might not know the internal protocols
of every single Azure resource. So for example, I have no
idea what DIRECTION_S means. I have no clue. So what we do is make
sure that it’s easy for you to get started with
Log Search analytics. And how that happen is,
apart from this Log Search, where you can go into things and
find them when you know exactly what you’re looking for,
we also offer solution banks. So let me give you
an example of that. So instead of Log Search,
if you click on Overview, it’s gonna take me into
the solution packs I configured. So I’ve configured the solution
pack for activity logs. And this is something that’s
provided by the operation management suite team. We’re talking to a bunch of
customers and figuring out hey, what are the most common
scenarios for charts and graphs that people want when
they’re looking at activity logs? Look at how pretty this looks. You can drill into those, so
basically it’s kind of like Instead of having an Excel file
with like thousands of rows and columns, which might be like
hard to graph and make sense of. We’ve sort of created like
nice charts and graphs for you to look at. And then you can pin those
individually to your dashboard, and let’s see what it looks like. When everything is
pinned to the dashboard. So here,
is a single dashboard I have, so let me quickly talk through it. On this column, I have
everything on the platform side. So the earlier graph I had made
on HTTP requests and errors, I’ve pinned to my dashboard and I can see what’s happening
on my platform level. Here on the same dashboard, I have stuff coming in from
Application Insights, right? So things like page view load times,
a failed request, server response times. And then, again, on the same dashboard,
I have my OMS solutions packs. So things like security and
audit, system update assessment, which just gives me overview of how
many of my machines need updates, SQL and VM monitoring,
which right now I don’t have set up. And for those of you that attended
the keynote session yesterday, I also have a bunch of stuff
coming in from Security Center. Security is obviously important, even though I’m not
focusing on that right now. So you have everything, right,
from your platform metrics to your application level metrics to
your enterprise-wide IT metrics. Spanning Azure and your on-prem or your multiple public
cloud environments. So let’s switch back to the slides. So here you have the same
kind of dashboard but in a much cooler black theme. I wish I had configured that. It would have looked much cooler
than the blue color palette I have right now. So the same thing,
you have Platform Monitoring, Application Monitoring,
OMS Solutions and potentially Security Center,
all of it in a single dashboard. And what you can do is,
you can edit this dashboard, so you can look at specifically
the resources and the diagnostics that
you are interested in. You can even share them with
different people in your team. So everyone doesn’t have
to start from scratch, you can just share
it among your team. And it’s just like any
other resource on Azure, it’s completely sharable, just
like you share any other resource. All right, so that’s it for
Azure Monitor. So just some high level highlights. So Azure Monitor gives you easy
discovery with just a single entry point for
your end-to-end monitoring, across platform application and
enterprise-wide IT deployments. For platform level resource metrics, you have logs coming in at 1 minute
granularity and 30 day retention. And all of this is set up for
you right out of the box. You don’t need to set
up a storage account, you don’t need to do anything. It’s just there. And you can set up a single
dashboard to monitor everything, as I have shown in
the previous slide. It’s deep integration with
Application Insights for monitoring custom metrics and
application level metrics, and with OMS for log analytics across
your entire IT deployment. And of course, we are a platform, so
it directly integrates with the rich ecosystem of partner
monitoring tools. So if you’re already using
incident managing software, APM monitoring solutions, ChatOps, we have a bunch of them that you
can directly integrate with. So here are some of our partners in
the space that we integrate with. And let’s hear from a customer
who’s actually used one of these. I think it’s Datadog, and how they’re using Datadog
in conjunction with Azure.>>My name’s John Keane. I’m the CTO for Allrecipes and
Meredith Digital food groups. Allrecipes is a leading digital
food brand in the world, and we serve a community of
80 million home cooks. Our 24 sites receive over a billion
visits every year from these home cooks, who are looking to both
solve an immediate task, like to construct a meal, or to be inspired
to find something new to cook. Whatever that is, they come to us at these moments
when they need us to help both, deliver on the task for them, but
also inspire them in some way. These moments of needs
tend to create a traffic pattern that is very,
very spiky across the year. The day before Thanksgiving is
our canonical example of that, where we see a month’s worth
of traffic in a single day. So we had a legacy environment
that over the years acquired many different patchwork solutions that
worked together, but it wasn’t the way that you would construct
it if you were building it new. We needed a partner in
our hosting experience that would allow us to scale and
compress, and would self-heal, and would react,
and handle the ups and downs of our traffic on a daily,
weekly, monthly basis. Datadog really integrates
very tightly with Azure’s monitoring APIs and allows
us to see things that are going on in the Azure environment in
real-time and respond to events. On the other side, we also get
access to a Cortana Analytics and Cognitive Insights platform that
really allows us to help drive some improved recommendations and
behavior for our customers. Datadog allowed us to create
this single pane of glass in our build environment that would
show us build over build, and show them how they’re performing. If there’s alerts that are
triggered, it’ll trigger right away. It’ll show up on Slack,
so the team can react. Likewise, if somebody’s looking
at a dashboard or graph, they’ll see instantly. Azure’s partnership with
Datadog enables us to realize the full potential of Azure, as well as Datadog within
our production environment. I mean, in the end, this is really
about serving the needs of these 85 million cooks that I
mentioned earlier, worldwide. And I think now with this
partnership between the two companies, it’ll make it even
easier as we go forward.>>Cool. All right, so let’s say now that
you’ve set up your monitoring, either through the first party Azure
Monitor solution that we have, or through third party software
that you might already be using. And now you’ve figured out there’s
something wrong with one of your resources or
one of your applications. How can you sort of diagnose
that in a self-help sort of way, without having to reach out
to Azure Support every time? So, for that what we’ve done is
we’ve built an experience called Diagnose and Solve Problems. So right from within the Azure
portal, you can go into diagnose and solve problems. And what it’s gonna do is, it’s gonna give you a curated
list of some common problems. And this curation is based on actual
support cases, which you’ve had in the recent past, let’s say over
the last three years or so. And it’s gonna give you
an actionable step by step way to try to resolve that problem. And of course if you’re not
able to resolve the problem, we give you a way inside the portal
to create a support ticket. And I’m gonna show you guys
how that works as well. Another thing that’s available
within the Azure portal is something called resource health. Think about this as an individual
heartbeat for your Azure resources. So at a high level, we tell you if
your Azure resource is available or unavailable, and
if it’s unavailable, we try our best to tell
you why it’s unavailable. And of course, like with everything,
it’s accessible through an API, so you can integrate resource health
metrics into your third party monitoring apps as well. Then we also have targeted
communications within the Azure portal. This is different from
what you might be used to seeing on the Azure
public status dashboard. It gives you a list of outages and disruptions across geographies and
services. This is a signed-in experience
that’s available within your Azure portal, so it’s specific
to your Azure environment, to the services that
you have configured and the regions that you
have deployed them in. And you can filter by region,
events and different date ranges. And we also publish post mortem
reports on when things go wrong, why they go wrong and what we are
doing to mitigate them and so on. So, let’s do a quick demo
of diagnostics as well. So let’s see. Let me take a look
at virtual machines. So let’s say that my diagnostic
logs told me that here, there’s something wrong with
one of my virtual machines. So let’s take a look at them. Let’s see, this is a Windows virtual machine
that’s part of my loan application. And I can click here on Diagnose and
solve problems. And when I do that, It’s gonna
load a bunch of common problems, which are based off of actual
support tickets that our customer support team has got from customers,
right? So, stuff like hey, I can’t connect
to my Windows VM, can’t restart or resize an existing VM. VM is slow, SQL related issues
with virtual machines and so on. And when you click into these,
you get a step by step instruction of how you can
potentially solve for this. And this is a great way for you to
like try to resolve problems before having to engage with Azure Support. Having said that, we do provide
a way for you to create support tickets automatically, and
let’s see how you can do that. Another thing I talked about
was resource health, so right now I see it’s available,
so that’s great. Let’s click into More details to
find out if at any point in the last two weeks, this resource was
unavailable, click on View History. Looks like within the last 14 days,
this resource was available, so that’s great. If for some reason,
the resource health check told me, the machine is unavailable,
we provide you an easy way from within the portal to
contact Azure support. So when you do that,
let’s see what comes up. So here’s a simple way for you to create a Azure support
ticket from within the portal. And keep in mind, any time an Azure
resource in your environment fails a health check, we offer a 24
by 7 break-fix free of cost. So that’s just included as part
of your Azure subscription. So here you can tell us if
it’s mission critical or not a critical production issue. You can tell us your
preferred contact method, either through email or phone. So let’s say it’s
a mission critical app, you probably want someone to
phone you right away, and then you can create
a support ticket right away. Let me go back to my
virtual machines and open up a Linux virtual machine,
because Microsoft loves Linux. And see what happens when
I click on Diagnose and solve problems over here. Aha, it’s context aware. It knows this is a Linux
virtual machine. So all of the common problems and steps that I get here will
be specific to Linux, right? And you can create a support
ticket in the same way. So lets go back to our presentation. That’s a great segue to talk about,
our support offerings, right? So, as I mentioned,
included in your Azure subscription is 24 by 7 by 365 or
in the case of last year, 366, since it was a leap year,
included support, right? So you can create a support, if at any point a resource
fails a health check, you can create a support ticket and
we’ll look into that for you. Beyond that is a plan
called Developer Support. Now how you should think about that
is let’s say you’re just starting off building applications in Azure,
right? And you ask some questions, there are some things you
need technical help with. You can leverage our
Developer Support plan and we’ll have those questions
answered for you. So things like, hey, how do I build
my application for availability, scalability, the stuff we
talked about earlier, but how do you do that in a way that’s
specific to your environment? Now beyond that, anytime you have production
environments in Azure, you wanna make sure that you have a support
plan beyond Developer Support. And here we have three
tiers of support, Standard, Pro Direct and
Premier support. And in a case where you have
a mission critical workload. Let’s say you’re launching
a new product and you’re expecting a huge spike in
traffic, we really encourage you to go for our Azure Rapid Response
support plan. We have a dedicated rapid response
team that’s ideally geared up towards dealing with those massive
launch types of scenarios. So that’s an overview of
our support offerings. And of course, tweet us! Our handle for
support is @AzureSupport. We are very, very active on Twitter. A typical response time is within
six minutes of a customer tweet. We field any questions that
you might have related to your Azure environment on
this Twitter handle and we route it to the appropriate
engineering team. We also publish postmortems
on our Twitter account. So any time there is an outage or
a service impacting incident, a detailed postmortem of what
happened, what went wrong, what steps we took to
mitigate all of that. All of that will be published
on our Twitter feed. So subscribe to @AzureSupport. Great, so that brings me towards
the end of the presentation, so action items for
you guys going out of this meeting. Make sure to check out
the azure.com/resiliency webpage. All of the stuff that I talked about
in terms of high availability and scalability is a great
starting point, but it’s not a comprehensive list. And you can get more detailed
reference architectures and ARM templates and
checklists on the resiliency page. Also check out documentation for
Azure Monitor, so I know I covered a lot
of stuff in the demo and you might wanna take
a look at that again. And also check out third-party
partner integrations that I talked about, especially if you are already using
a tool that you wanna leverage. Okay, time for
a couple more quiz questions. So can anyone tell me, on Azure
Monitor for platform level metrics, what’s the time granularity
that we offer to you guys? Yeah?>>One minute.>>Okay, cool. One more? Can anyone tell me,
on Azure Monitor for Platform level metrics, how many
days of data retention do you get right out of the box without
configuring anything?>>30 days.>>Cool, looks like all of you guys
were paying attention, awesome.>>[LAUGH]
>>[LAUGH] Great. All right, so
we still have a few minutes. Any questions? Yes?>>90 days out of
the box [INAUDIBLE]>>Yeah, yeah, that’s a good point. So there’s two different things. So there’s activity logs and
there’s metrics. For activity logs,
you get 90 days out of the box. For metrics, which are more
continuous like CPU utilization and RAM usage, it’s 30 days. So there is that distinction. Good catch. Yes?>>You were talking about Azure.>>Yes.
>>I wanna create Azure [INAUDIBLE] share with
someone who doesn’t have access?>>Yes.>>Is that taken care of or
do I have go into each resource?>>Yes, so
it’s tightly integrated with RBAC, role-based access control. So if those people do not have
access to those resources, they will not be able to
see the logs for them. So they would need to have
access to them of course. Any other questions? Sure.>>[INAUDIBLE].>>Yeah, so
we don’t have a mobile app. What we have is, like I mentioned,
you can set up alerts and notifications. And in private preview, we actually
have SMS capabilities as well, which will go live shortly. But we don’t have a mobile app
right now, you can get emails and SMS alerts and webhooks. But you could use potentially one
of our third party providers. I’m not sure if any of
them have mobile apps, but that’s one way you could
look into doing that. All right, thanks a lot guys.

Leave a Reply

Your email address will not be published. Required fields are marked *