Alerts Troubleshooting Framework False Alerts

This guide will walk you through steps on how to troubleshoot both false negatives and false positives when it comes to Alerts violations. We will mainly be focusing on NRQL and NRQL-centric alert conditions (Infrastructure Metrics is an example of NRQL-centric), but you could use this for any alert condition types, so long as you could come up with the queries to run to show the data. If you are still having issues after following this guide please reach out to New Relic Global Technical Support as they can help investigate further. Make sure to let them know you followed this guide, it will help avoid troubleshooting things you have already checked.

  1. You will need either an incident link (for a false positive) or a link to an alert condition and a time window when the metric breached the threshold, as well as any facet information (which app, which server, that sort of thing).

    • The incident will show you when the violation happened and which alert condition was responsible. If you’re dealing with a false negative, you should already have a link to the condition and an idea of a relatively short timeframe (1-2 hours) when a violation should have opened)
  2. Construct a query showing the data, and scope it so that you can set TIMESERIES to the aggregation window (e.g. TIMESERIES 1 minute for a 1-minute aggregation window). Remember that the system can only show a maximum of 366 buckets, so take that into account when setting your time window.

    • Pro-tip: use epoch timestamps for this (or any method of absolute time), rather than relative time, so that you can share a link and anyone can see the data that you are seeing.
  3. Evaluate whether the data supports the violation opening or no violation opening. If the data clearly shows that the violation was incorrect, or the lack of a violation was incorrect, then you can exit this framework and reach out to New Relic Support, make sure to let them know as much detail as possible.

    • Remember that for at least X minutes thresholds will need X consecutive minutes of breaching data to open a violation, while at least once in X minutes thresholds only need one single aggregated data point.
    • For sum of... thresholds, you will need to manually calculate the sum, since there is no way to display this using NRQL queries. Remember that the sum works on a rolling X minutes, where X is the threshold time window.
  4. For false negatives or violation not closing problems: take the query you used in step 2 and analyze the results for NULL values. Take a look at this article to learn the query order of operations, so you’ll know when to expect these: Relic Solution: How Can I Figure Out When To Use Gap Filling and Loss of Signal?

    • Unplanned NULL values can cause a false negative or cause a violation to fail to close on time. Look at using Loss of Signal or (possibly) Gap Filling, depending on the shape of the data
    • If the false negative was caused by NULL values, look at using Loss of Signal or Gap Filling
    • NULL values should never cause a false positive.

If you are still having issues after following this guide please reach out to New Relic Global Technical Support as they can help investigate further. Make sure to let them know you followed this guide, it will help avoid troubleshooting things you have already checked.

2 Likes