Alerting when disk will fill up within X days

Hi all,

I am trying to create a NRQL alert for a disk which is going to fill up within a number of days. I could get a query which returns a number, however I don’t see any value when using the same query as a NRQL alert.

The following query applies to RabbitMQ Prometheus scrapped data:

SELECT
   ifthen(predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) < 0, -10 / (1 + predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) / latest(rabbitmq_disk_space_available_bytes) ))
FROM Metric FACET podName

It has been mostly converted from a Prometheus Alert, however with a change on the else part.

The query will not return anything, given the ifthen, unless the disk would fill up in 10 days, using the current values.
For a disk which is going to fill up, I have a number, which for me right now is 4.03. I created a NRQL alert, with a static condition, using the number 9 for warning and 4 for critical.

If I remove the ifthen, and get a query like:

SELECT
    -10 / (1 + predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) / latest(rabbitmq_disk_space_available_bytes) )
FROM Metric FACET podName

I get data for the alert, however I can’t create a useful alert.

Hi @caiotiago, Sorry you have been waiting awhile for a response from our community. I’m going to bring this back to the attention of our support team. Thanks for your patience!

Neal Mc

@caiotiago Following up to see if you were able to create this NRQL alert? We rely on community members to respond in this category and so far it seems like this issue has stumped them.

Hi, yes, I was able to create an alert.

I created an alert with this condition for RabbitMQ:

SELECT
ifthen(predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) < 0, 10 / (1 - predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) / latest(rabbitmq_disk_space_available_bytes) ))
FROM Metric where clusterName = 'prod' FACET podName

and this for Kubernetes Nodes:

SELECT
ifthen(predictLinear(k8s.node.fsAvailableBytes, 864000 SECONDS) < 0, 10 / (1 - predictLinear(k8s.node.fsAvailableBytes, 864000 SECONDS) / latest(k8s.node.fsAvailableBytes) ))
FROM Metric FACET host

I had to fine tune the condition thresholds to avoid false positives. Otherwise a disk usage of a few hundred MBs within a few minutes would trigger the alert. The warning and critical thresholds are 30 and 60 minutes respectively. Warning and critical values are 9 and 2 respectively. I never had the Kubernetes one to fire in production, however the RabbitMQ one is similar and it triggered properly. They are doing fine, by my tests.

A brief explanation:

ifthen(
    predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) < 0,

That part is a plain “if the disk usage prediction for ten days from now would be negative”. It is required for the alert to be useful, because that predictLinear would return negative and zero values as well and I can’t create an alert for “lower than” on New Relic which would ignore the negative values.

 10 / (1 - predictLinear(rabbitmq_disk_space_available_bytes, 864000 SECONDS) / latest(rabbitmq_disk_space_available_bytes) ))

The predictLinear thing will give a value. the 1 - predicted/latest will return the rate. The 10 / rate will return the number in days where the disk would be filled up.

As suggested on another post, you should still have a static alert for a given threshold, because the alert here will fire only if the disk usage is increasing. It was a valid use case for us because we had a situation where the RabbitMQ would slowly increase until the disks were full.

So, if the free available space was 10GB, and if the predictLinear for 10 days from now were -40GB, the return value would be 2:
10 / (1 - (-40 / 10)) = 10 / (1 + 4) = 10 / 5 = 2

It accounts for 5GB per day, 2 days, 10GB which was the correct prediction.

If the free space available was 30GB and the predictLinear was -1, we would have:

10 / (1 - (-1/30)) = 10 / (1 + 1/30) = 9.67741935484 days

Which accounts for 3.1GB per day, 9.67741935484 * 3.1 = 30GB.

2 Likes

That is great and THANK YOU for sharing :star_struck: I’m sure other community members will find this very helpful!

Just a note: I cannot find any reference of ifthen function and I doubt this query works (anymore). The NRI query builder returns an error. Can somebody confirm?

1 Like

This is not consistent when using TIMESERIES, which are implicitly used in alert conditions.
I do not also see anywhere any reference to ifthen. Alerting based on the disk fill rate is so essential, so it will be good if we have some clarification on this. Alerts based on the absolute free disk space like those based on the percent and bytes free are not usable because it is not the same when you have disks of 100GB and 100TB. 10% in the first case is 10GB and in the second is 10TB, which is not small. I would prefer to have a fill rate in these cases to be able to correlate with the time needed to remediate the issue.