I’d like to spend a few minutes sharing a bit of insight with you all about how data makes its way into Insights, and how you can use this knowledge to your advantage when setting up NRQL Alert Conditions.
From your system to Insights
New Relic offers a spread of monitoring tools that all work differently, but they all have one thing in common: the data has to get to New Relic and get processed before you’ll be able to access it. Most of that data shows up in your New Relic UI. If you’re looking in APM, Infrastructure, Browser, Mobile or Synthetics, the data you’re seeing is near-real time. However, when that data is ingested into Insights, it has to go through several processing steps to get integrated into a vast cluster of databases, allowing you to freely query it using NRQL.
In practical terms, this means that you can set up an alert condition targeting one of the former products and your alert condition will expect a certain amount of data latency while we go about processing the data. However, when setting up a NRQL Alert Condition, the amount of time it takes to ingest the data starts to become very important, as you have control over Evaluation offset.
How the alerts evaluation system works
New Relic’s alerts evaluation system looks at 1-minute slices. Even if you have a threshold along the lines of
[...] for at least 5 minutes, the alerts evaluation system is still only looking at 1-minute slices and building its own 5-minute picture of the data.
In the case of alert conditions scoped to some non-Insights products (the Java APM agent, for instance), this 1-minute slice is essentially from 60 SECONDS AGO until NOW. Even in cases where there is some necessary harvest and aggregation time, the alerts evaluation system expects any necessary time taken to ingest the data, and that delay is built in to the condition to ensure data precision is high. However, because of this necessary ingest time, when you set up an alert condition to scope to Insights (by using a NRQL Alert Condition), you have to manually take into account that ingest latency. If you looked at Insights data from 60 SECONDS AGO until NOW, you would see, in many cases, a flat line – data either hasn’t been sent by your systems or hasn’t had a chance to get processed yet!
Putting it all together
When you set up a NRQL Alert Condition, one of the settings you can change (under
Advanced settings) is Evaluation offset. This tells the alerts evaluation system which 1 minute slice of data to stream through its rule checking system. We recommend setting this to 3 minutes, meaning the system will look at the data from 3 MINUTES AGO until 2 MINUTES AGO. In most cases, this is a long enough delay that the data will have had time to make its way into Insights from our language agents which have a 60sec harvest timer, so the alerts evaluation system will see it. However, in rare cases (or if you have your Evaluation offset set to under 3 minutes), you may see an alert violation open when it shouldn’t have, or fail to open when it should have, due to incomplete data being available during that 1-minute window.
What this will look like, in the case of a false positive, is an alert incident showing a graph that differs from the graph you’ll see if you insert the same query into Insights. This is due to the data coming in later than the alerts evaluation system expected it, then getting backdated. The graph shown in the incident is what the alerts evaluation system saw (ignoring latent data), while the graph you see in Insights includes latent (backfilled) data in NRDB since it is able to accommodate it while alerts needs to make decisions in real time.
Whether this results in false negatives or false positives for you, a good first step is to increase the Evaluation offset. You can increase this setting to as long as 20 minutes, which is almost always long enough to catch any latent data coming in to Insights.
One way to check if you should increase your Evaluation Offset is to look for a drop off in your data point values on the right side of the graph in your previews.
Cloud Integration alerts – a practical example
In the case of Cloud Integrations in Infrastructure, this data latency is exacerbated by the polling frequency and the reliance we have on Cloudwatch to send us timely data: if we are only polling AWS every 5 minutes (for example), then using the default Evaluation offset will result in the alerts evaluation system seeing a constant stream of
NULL values. The reason for this is because, when we poll a cloud service such as AWS, the service itself is still processing the data from right now, so we get data from X minutes ago. For example (using a 5-minute polling frequency), if we are sending a request to AWS for data at 10:00, we are actually getting data from 9:50-9:55 (or even earlier), as that is the data that’s finished processing at 10:00. That data then needs to go through the ingest pipeline into Insights (as I described above), so winds up falling completely outside of the Evaluation offset window.
The best way to set up an alert condition in this case is to use the
Integrations alert type in Infrastructure, then scope to the particular service and metric you’d like to alert on.
Integrations alert conditions expect to see this data latency, so will behave properly and alert you when the data violates the thresholds you set up.
I hope this helps you understand a bit better what’s going on behind the scenes with your NRQL Alert Conditions, and why Evaluation offset is so important to making sure your conditions work the way they should. Happy alerting!