Relic Solution: Extending the functionality of NRQL alert conditions

Let’s peek under the hood for a moment

Let’s talk for a moment about how New Relic Alerts evaluates your data. Once per evaluation window (which can be configured from 30 seconds all the way to 2 hours), every alert condition looks at the data stream it’s given and evaluates it numerically against the condition’s threshold. One window later, the data is again evaluated. Each window is evaluated discretely (on a pass/fail scale), without regard to any data before or after that single window*.

The evaluation system will build a model of the data, if you have a threshold with a duration of more than 1 evaluation window (e.g. “for at least 15 minutes” with a 1-minute evaluation window will keep track of each minute’s pass/fail result for a rolling 15 minutes). Once the alerts evaluation system gets enough fail results in a row, it will open a violation.

Keep in mind that each window is evaluated discretely. The evaluation system does not look at any of the windows in the past, other than to develop the pass/fail model over a rolling time window*.

One window at a time, got it

If you think about this for a moment, you might see how NRQL queries using percentile or stddev are a lot less useful than they seem, when used in an alert condition. After all, if you calculate the standard deviation over 24 hours, that can be meaningful. But stddev(duration), or percentile(duration,95) calculated over only 60 seconds is less meaningful.

Whoah! So … how do I set up an alert condition to monitor standard deviation over the past 24 hours?

Since the alert evaluation system only looks at a single, discrete window at a time, but NRQL queries in general are much more flexible, you just need to figure out a way to wrangle a standard NRQL query (which can perform functions over longer periods than a single alerts evaluation window) into an alert condition. Here is one way you can accomplish that.

  1. Set up a cron job to run a script once per minute (since the alerts evaluation system expects to see a data point every minute). Alternatively, you can use a Synthetics Scripted API Test Monitor to run a script for you once each minute.
  2. In the script, use the Query API or Nerdgraph to run the exact NRQL query you want. As an example, SELECT stddev(someAttribute) FROM SomeEventType SINCE 24 hours ago.
  3. The script should then parse the JSON that is returned and extract the value or values that are important.
  4. Next, the script should re-package the important values as a JSON object.
  5. Finally, the script would use the Insights Event API to insert the JSON as a custom event.
  6. Once the cron job and script are up and running, set up a NRQL alert condition to monitor the attribute in the custom event that is of interest.

With this method, you get exactly the value you are looking for inserted as a custom event into Insights once per minute (or whatever time frame you’d like), which allows the alerts evaluation system to evaluate, for example, a 24-hour calculation of standard deviation, or the 95th percentile over the past 12 hours – anything you can write a standard NRQL query in Insights for, you can now set up an alert condition to monitor!

I hope this helps to better understand how the alerts evaluation system works, as well as providing a way to expand its functionality. Let us know if you come up with other ways to do this!


*The exception to this model is Sliding Windows Aggregation (SWA), which will evaluate 1 window’s worth of data every X seconds, and you determine the value of X using the slide-by interval. Read more about how SWA works in this documentation.

4 Likes
How to configure NRQL alert to check the past data
Alerts for weekly scheduled task
New Relic Alert on Rate of Change
Alerting on trends
Is it possible to join two nrql query?
Histogram of requests per unique uuid
Tips and trick to setup alert to trigger if JVM/application active locked threads more than % in 24 hours
Alert policy threshold to be more than 120 Minutes
Create an alert for a daily lambda
Alerting on non numeric event data
Infrastructure Alert if VM running for over 24 hours
Is it possible to use count(*) as bucket attribute?
Feature Idea: Create an Alert based on Metric and APM
How to alert on a single data value (aka we rolled up 24h data and wrote it back into NR)
Feature Idea: Trigger error for % of errors and number of users
Threshold NRQL Alert at least once in a month?
Sum of Query Results - Clarification
Alerting on Spotty Metrics - Possible to increase Aggregation Window times?
Alert threshold: Alerts with a time duration of 24 hours
Alerting on the status of a daily cron job / Avoiding misleading recovery alerts
Feature Ideas: Threshold alerting at 24 hr timescale
Custom Event Not Reporting Alert
Get the max rpm in a period of time
How to do an alert that uses 30 minutes ago inside the query?
Create an alert on Server up time
Cannot generate alert
Alert on missing logs
Will condition not work for loss of signal
NRQL: Building percentage-based alerts
Disable auto-close alarm in New Relic
Check for alert only once per day
How to drop ALL data ingest based on a monthly cap?
Report devices that haven't been reporting data since X number of days ago
Alert query includes "SINCE 1 day ago" failing
Creating Alert with mean percentage fulfillment time for given minutes
Report alert on daily count
JOIN or multi-select on multi-facet cases?
Combining different alerts to react accordingly
Combine dashboard results
Feature Idea: Alert condition 120 minutes limitation
Alerts condition don't open incidents

This is awesome! I’ve actually been working on a script to this effect for a while - I polished it up and posted it below in the hopes that it can others who go the scripting route. My script uses Python 3 and the Requests library.

I made mine for a use case where someone might want to alert on the rate of change of an average (i.e. alert whenever the rate of change is greater than X amount). Here’s the query I used as an example:

SELECT average(duration)/1000, stddev(duration)/1000 from SyntheticCheck since 8 minutes ago until 5 minutes ago COMPARE WITH 13 minutes ago where monitorName = 'Test Monitor'

My SINCE clause is a bit complicated because I had to build in my own evaluation offset to ensure latency wouldn’t prevent me from getting complete results. Using a longer time period (like 24 hours) would probably render that moot.

Script:

#!/usr/bin/python
import requests
import urllib.parse

#This is the file where all your account/query/API key info is.
import request_config

#This bit builds our Query API call.
account_id = request_config.stuff['account_id']
nrql_query = urllib.parse.quote_plus(request_config.stuff['query'])

q_headers = {'X-Query-Key': request_config.stuff['query_key'], 'Content-Type': 'application/json'}

def build_url():
	q_url = 'https://insights-api.newrelic.com/v1/accounts/{}/query?nrql={}'.format(account_id, nrql_query)
	return q_url

#This is where we make the Query API call.
r = requests.get(build_url(), headers=q_headers)

#This is where we parse the content of the call.
results = r.json()

#This is where we do math.
avg = results['current']['results'][0]
prev_avg = results['previous']['results'][0]

diff = avg['result'] - prev_avg['result']

#This bit builds our Insert API call.
i_url = 'https://insights-collector.newrelic.com/v1/accounts/{}/events'.format(account_id)
events = {
	"eventType": "MyCustomEvent",
	"latestAverage": avg['result'],
	"delta": diff
}
headers = {
	"X-Insert-Key": request_config.stuff['insert_key']
}

#This is where we make the Insert API call.
r = requests.post(i_url, headers = headers, json = events)

print(r.status_code)

And here is the request_config file where you put in your account, API, and query information:

stuff = {"account_id" : [Your_Account_ID], 
		"query" : "SELECT function() FROM EventType WHERE attribute = 'value' SINCE 24 hours ago",
		"query_key" : "[Your_Query_API_Key]",
		"insert_key" : "[Your_Insert_API_Key]"}

Fun fact: you may have noticed my query uses average and standard deviation functions. My next goal was to tweak the math so that I also calculate whether the current average is within one deviation of the previous average.

Hope this helps! I’d love to see other use cases people have scripted.

3 Likes

Hi,
It didn’t work for me. I want to setup alert when backup file is NOT created within 24 hr on windows using nri-flex integration. I setup the task, as you mentioned , running every 5 min and my powershell script is below:

$directory | Get-ChildItem | Where-Object { $.PsIsContainer -eq $false -and $.Name -like $filePattern } | Measure-Object | Select-Object -Property @{ expression = { $directory }; name = “directoryName” },@{ expression = { $filePattern }; name = “filePattern” },@{ expression = { $_.Count }; name = “fileCount” } | ConvertTo-Json

as well as a query:
PS C:\Program Files\New Relic\newrelic-infra\custom-integrations> Invoke-WebRequest -Uri https://insights-api.newrelic.com/v1/accounts/22404
61/query?nrql=SELECT%20*%20FROM%20fileTestlookup%20SINCE%2010%20minutes%20AGO -Headers @{“X-Query-Key”=“xxxxxxxxxxxxxxx”} -
ContentType “application/json” -Method GET

but nothing is happening. Can you give me a push in the right direction?

Thank you in advance

I got it working for 3 min period. Now testing for 24 hrs. :crossed_fingers:

Hey @ababichenko, Let us know how you get on with it :slight_smile:

2 Likes

Hey , Yes it’s working. Wasn’t that bad when learn more about that. Thank you All :slight_smile:

4 Likes

Fantastic! Thanks for confirming @ababichenko