Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Proactive Monitoring, Share your thoughts and Best Practices

conversation
bestpractices
implementationideas

#1

Hello All,

I am one of a recently New Relic user who has grown into using it a lot day to day, but now its come to a point where my Brain starts to freeze and gives me ALERTS like, am I doing the right thing VS what is the best approach others are doing when it comes to Proactive Monitoring.

We all are working crazy hours to do and deliver the best, but how do we leverage all these awesome tools stack in the market NR included to get to a point where we have Doctor-patient relation ship with our apps.( App being the patient and SRE/DEVOPS/AppSupport being the doctor) Instead of trying to find the problem the app tells us how its feeling and whats going on with it and how it the app making sure that we are informed before its in deep stress including the infra its running on.

What do I do or how do I setup things to make my apps make sure I can sleep worry free and wake up only if things are really not good. Can have dinner with family without looking at my phone ( emails, sms etc)

I want the people part of this community to start sharing their thoughts and experience or may be the way they have setup things to have the all I mentioned before.

Currently I am using NR ( Synthetics, Alerts) and ELK Stack to make my life easy.

NR keeps check on the apps and infra and my Scripts keep checking the systems in every few mins.

ELK is helping me detect things at the logs level.

What are you doing!!


#2

We currently have APM, Browser, Infra, Alerts and Synthetics and use similar to yourself. We currently have a gap on the log file front so we will need to work that out.

Alerts provide us critical business process checks along with generalised alerts such as looking for 500 status codes or specific exceptions which provide us early visibility on communication type issues such as firewall changes which block services.

One of the thing that is great is that we have had several issues which are not due to our applications but external factors, such as when Microsoft changed their SSL server chain and we didn’t have their ip addresses whitelisted. We were able to see specific exceptions coming from our apps which meant I could set an NRQL alert to look for that specific exception. With the runbook url added to the alert we now have a couple of areas we can check.

I use an Insights dashboard with the Synthetic results to provide a management view on service availability, page performance and page performance comparisons.

As with everyone, I just don’t get the time I would like to fiddle and tweak to ensure I have the correct levels so it is when I notice something or the alert condition I have set is too noisy so I play around until I think I have the Goldilocks level.

As with yourself, I am interested in learning from the community.


#3

Hey @stefan_garnham

thank you for starting this conversation.

First for the log File, I am using Logstash with free Elastic and Kibana mix for now, it helps me greatly to manage my logs and even do analytics on them. Something I highly recommend.

Regarding what you mentioned “Alerts provide us critical business process checks along with generalised alerts such as looking for 500 status codes or specific exceptions which provide us early visibility on communication type issues such as firewall changes which block services.”

If you can please share more use cases that you are using to make sure the web business is in tact, because this is something that I did not use so far, so really helpful.

Also on what all layers have you installed NR agents? do you get complete browser to app server details. I still need to work on the browser piece, its something I have not looked into yet.

Also I wanted to know what is the best config for Java apps to get complete details from the apps without having any overhead to the performance.

thanks


#4

Hi All,

We use NR alongside many others to offer a complete suite of monitoring solutions delivering value to the entire enterprise (with thousands of apps, hosts, systems, etc.) for nearly ever use case. To do this most effectively, we approach monitoring with a general long-term vision/strategy in mind to help focus our efforts in a singular ever-improving direction. The ideal end-state in most cases is incorporating automation & self-healing mechanics into as many operational tasks as possible to ultimately deliver a “self-healing or self-maintaining” system.

Getting to that state is easier said than done, but a good first step can be to consider your monitoring footprint in terms of maturity levels. Then based on current status of monitoring in your org, you can outline & take incremental steps to reach the next level of maturity on a gradual but regular basis. For example, a basic first step could be ensuring you have thorough (preferably total) monitoring penetration across all systems. If you’re beyond this, the next could be: ensuring all alert thresholds are valid & actionable from an operations standpoint. If they’re not, reconsider why the alert exists in the first place & how it can be improved or perhaps simply disabled.

If you’re getting awoken late at night due to alerts, consider why this may be & seek out ways to not be an integral part of the resolution process at that moment.

  • For example, if your presence is just to decipher the alert, consider putting other teams through New Relic training so they can become self-sufficient. Another approach is to document within a runbook (surely you have these for each & every alert policy - right? ;)) the known errors that then correspond to a resolution. In this case, you’re front-loading the work of interpreting the event so you don’t have to be awoken later on.

  • Unfortunately if you wear many hats and you also have to resolve the issue in the moment, then it may require the development of automations to facilitate base-level troubleshooting & error handling (ideally total issue resolution).

As it relates to total monitoring footprint, NR is best complimented by monitoring solutions that handle logs (generally Splunk / Logstash) & network monitoring. Neighboring technologies to monitoring involve configuration management & security vulnerability/compliance tools. Somewhere between all of these fields is the happy medium that works for the vast majority out there.

Best of luck finding what works easiest for your org!


#5

Hi @jlove

Thank you for the detailed share it was a good read and does give some high sense of how one should look at the approach.

So in your experience so far, can you share some best practice examples that you implemented yourself.

Some thing that we can learn from. particular use cases are good too.

thanks


#6

Great conversation everyone! @MKhanna - thanks for kicking this one off!


#7

@MKhanna,

Happy to share - my apologies for the delayed response.

A few tips based on experiences I’ve encountered personally:

Regarding APM,

  • Exhibit caution alerting against error rate specifically on low throughput applications (if the volume of requests per min (RPM) is low, a single error could cause large spikes in error rate which is common in a dev/stage environments).
  • Filter out non-errors so they don’t lower your app’s error rate. Work systematically to reduce the error counts where possible.
  • If you can afford the additional costs, empower app teams with data earlier in the SDLC by monitoring the performance of your staging/pre-prod environments to better gauge the impact of new releases.
  • When deploying APM or Infrastructure agents, consider who will own the process of updating the newly deployed agent. In many cases, once an agent is deployed, the thought of updating it is never considered thereafter, even though newer agent versions may offer better performance or additional data.
  • Consider calibrating the Apdex T value for your app to better align with the goals of your organization or app team. The default values can sometimes be considered “too loose” for highly performant apps, so in such a circumstance you would generally want to lower the Apdex T value to a more acceptable threshold.

Regarding Browser,

  • If you can implement Browser monitoring, then by all means do so. The data here is fairly straightforward & invaluable.
  • I highly recommend coupling the data gathered here with Insights dashboards to help with prioritization of frontend bug fixes/features (for example, x% of our site visitors are running Chrome/Firefox/IE version y).

Regarding Synthetics,

  • Generally speaking, I wouldn’t recommend using code builders as our experience with them has been that they are fairly cumbersome & time-consuming to use. Instead, consider establishing a “baseline” script that can be replicated quickly without needing to debug the functionality within the script itself. This way you can focus on identifying the best xcode selectors (we use the Chrome addon “xPather” to help). Not sure where to start with this? Here are some functions we use: transaction start/stop timers, user input (clicking/typing), element validation, frame/window manipulation, user login, & mouse hovering.
  • From a monitoring-quality standpoint, consider whether your monitor should check for general frontend elements loading out (are all expected elements (links, divs, etc) present?) OR instead validating whether backend actions went through successfully (order/payment processing, data lookups, etc.) The idea here is to be considerate of what your actually checking.
  • It’s possible to run multiple checks from a single synthetic, then alert using NRQL queries. Won’t go into details as to how, but it’s definitely possible :slight_smile: .
  • Consider using Synthetics against competitors… :wink:

Regarding Insights,

  • Consider keeping a list of interesting queries with an accompanying screenshot of the widget it generates. This can become very handy once you have a sufficiently large number of NRQL widgets/use cases.
  • Consider using the “Dark Theme” for Insights Dashboards. It looks slick!
  • Don’t have a way to show the data how you’d like? You can always pull down data through an API call, massage it as needed, then send it back into NR through another API call & violà.
  • Consider starting your NRQL queries with a FROM instead of SELECT. This will allow you to more easily write the rest of the query.

Regarding Alerting,

  • If the people reacting to the alerts can’t immediately make heads or tails as to what it means, then you’re doing alerting wrong. To avoid this, try to name your failure conditions & alert policies in the most useful way possible to the people who are responding to the alerts. For example, an alert policy may be named after a specific application, with failure conditions relating to subcomponents of said app. Moreover, if I named the alert policy “Oracle ERP #1” a failure condition may be named, “High CPU% on HOSTNAME (IP or FQDN)” or “Server Unresponsive - HOSTNAME (IP or FQDN)”. When done this way, responders know exactly where to investigate while in the moment.
  • When possible, try to implement/use standardized runbooks for each & every failure condition. If you’re already taking the time to deploy monitors, you may as well document what remediating actions need to take place after the monitor’s failure conditions are triggered rather than waiting for the sh*t to hit the fan to do it in the moment at 3am before the end of the fiscal quarter…

There are other one-off tip I could point to; however the ones I’ve provided are probably the most general ones I could think of at this time.

For some good general reference performance monitoring concepts, I’d recommend doing a once-over of Google’s Web Fundamentals guide on Performance. Especially the “User-Centric Performance Metrics” section which covers ‘mapping metrics to user experience’.

Hope that helps!

JL


#8

Thanks @jlove

Interesting read, @hross hope there is a way to get more people to share their perspectives too.

thanks


#9

You bet! We’ll feature this convo with new Explorers starting next week @MKhanna. There are some real gems in here thanks to @jlove, @stefan_garnham, and you!


#10

@MKhanna: