Relic Solution: Simulate streaming alert evaluation in the Query Builder

If you’ve used NRQL, the Query Builder, or even a NRQL alert condition’s preview chart to investigate an alert incident, you may have noticed some discrepancies between your data at-rest and the data as it was evaluated by the streaming alerts platform (represented, for the record, under the NrAiSignal data type).

In fact, if you’ve spent any time with the Alerts product at all, you’ve probably noticed the following disclaimer under the preview chart:

There are a few things that lead us to include this disclaimer:

  1. The aggregation method you select will affect which particular data points are submitted for evaluation, and an improperly configured aggregation method (or events that are considerably time-skewed) may cause data points to be dropped from evaluation despite their appearance in an at-rest query.
  2. There isn’t a comprehensive NRQL-based solution to simulate gap filling.
  3. The streaming alerts platform relies on an order of operations that isn’t present when running a query against at-rest data.

While the behavior outlined in the first point is too complex to represent in an at-rest query, the gap filling behavior in the second point can now be simulated through the Query Builder’s “Null values” configuration panel:

The behavior outlined in the third point is a little more complex, but with conditional statements now available in NRQL, we can achieve this with a little finessing.

Order-of-operations

For any given aggregation window, our streaming alerts platform may choose not to execute the SELECT statement in a query, resulting in a loss of signal.

Specifically, the SELECT statement is not executed when there is no data matching the conditions set under the FROM and WHERE clauses.

As an example, let’s consider an alert condition targeting a Synthetics monitor that only runs once every 15 minutes. In this scenario, the account also has monitors running once every minute, although these monitors are not being evaluated by the alert condition in question.

Let’s say this is the query that’s in use:

SELECT count(*) FROM SyntheticCheck WHERE monitorName = '15min monitor' AND result = 'SUCCESS'

In an at-rest query, any window where the monitor is A) not reporting a successful result or B) not reporting any result at all, the result of the count(*) operation will always be zero. This is due to the behavior of the count() aggregator function, which always returns zero in an absence of data.

In a streaming query, the result of the count(*) operation will also always be zero… as long as the SELECT statement is actually executed. And because the SELECT statement is only executed when the conditions of the FROM and WHERE clauses are met, there is no scenario met by this query in which count(*) will execute over a period of time in which there are zero events.

The filter() workaround

Our documentation suggests using the filter() function to work around this. Here’s another example query to consider:

SELECT filter(count(*), WHERE monitorName = '15min monitor' AND result = 'SUCCESS') FROM SyntheticCheck

Let’s also say that the condition has a window duration equal to 1 minute. In this scenario, our streaming alerts platform will check whether any events match FROM SyntheticCheck before deciding whether to execute the SELECT statement, and because we know the account has a monitor running every minute, the operation SELECT filter(count(*), WHERE monitorName = '15min monitor' AND result = 'SUCCESS') will be executed every minute. This is what allows us to finally see a true value of 0 from the result of a count() function.

Visualizing the above scenarios

Of course, because an at-rest query has no concept of the Alerts order-of-operations, an absence of data will return a 0 in both scenarios. It can be very useful to simulate the output of the streaming alerts engine for a given query, especially if your query is complex and especially if you don’t have an alert condition currently set up (leaving you unable to use NrAiSignal for analysis).

With the recent release of the if() function, it’s now possible to write a conditional statement that simulates the order-of-operations. Here’s a basic template you can use:

{WITH clause} FROM ({FROM clause} {SELECT statement} AS 'queryResult', count(*) AS 'countResult' {WHERE clause} {FACET clause} TIMESERIES {Window duration} {SLIDE BY clause}) SELECT latest(if(countResult > 0, queryResult, NULL)) {FACET clause} SINCE 6 hours ago TIMESERIES {Window duration, or slide-by duration if sliding query}

In summary, this is a nested query * with two aggregator functions on the inner query — one that’s the original aggregator function used by the alert condition (queryResult), and one that’s a simple count(*) used to determine whether any events are matching (countResult).

If countResult is equal to 0, then the outer query outputs NULL for a given window. But if countResult is greater than 0, the outer query becomes the output of queryResult, thereby simulating the decision our streaming alerts platform makes in order to determine whether to execute the SELECT statement. **

Putting it all together

Now let’s consider a slightly more complex alert condition. Its query is the following:

FROM SyntheticCheck SELECT uniqueCount(location) WHERE monitorName = '5min monitor' AND result = 'SUCCESS'

…it has a 1-minute window duration, and gap filling is set to “last known value.” The preview chart or the Query Builder output of your query is going to look like this by default:

…but as we know from our journey into the order-of-operations, evaluation isn’t going to look like a contiguous line with drops to 0 — it’s going to look like a series of 1s broken up by NULLs. So now if we break the query up into the template I provided earlier, we end up with a scatter-plot-like chart that’s a more accurate representation of streaming alert evaluation:

And to take things a step further, we can select “Preserve the last value” in the “Null values” configuration panel to simulate our gap filling configuration:

…and we’re now left with a much-more accurate window into evaluation, one that you should find lines up closely with NrAiSignal, barring any data that’s dropped from evaluation due to time skew.

I hope that helps demystify some of the more obscure concepts of evaluation! Let me know what you think in the comments.


* If your alert condition’s query itself is a nested query, the order of operations will apply to every level of the query — but note that loss of signal thresholds are incompatible with alert conditions that use nested aggregation.

** The template doesn’t support subqueries, but guess what — neither does Alerts!

8 Likes