Drive/Volume Failure Alert Needed

Hello, we need a volume failed alert when a volume is no longer responding, such as when a drive dies. And we need to know what host it is/was connected to.

Since New Relic Infrastructure Agent is always querying volumes to see if their capacity changes, would there be a way to alert us when the agent can’t get capacity data anymore regardless if drive removal was intentional or not? Ideally we’d only get an alert if the agent knew the OS was still running.

1 Like

Hey @bgaudreault! I see you also have a private support case going with a Support Engineer. (ticket #354048)

Let us know what helped you get all sorted—I am sure it will be something we can all learn from in the future. Thanks!

@bgaudreault
As there is not an out of the box solution, you will first need to setup custom attributes as described here. You would create a custom attribute called something like mountCount. On each host with 5 mount points you would then edit the config file so the value is 5 for mountCount. Do this similarly for each group of hosts with an identical number of drive volumes, so that each host has a custom attribute of mountCount with a value of the number of drives on that host.

You will then be able to use something like the following query in a condition in your policy. You will only need one policy for multiple conditions where each condition would scope to the group of hosts containing identical mountCount values. Here is an example for the group of hosts with 5 mount points:

SELECT uniqueCount(entityAndMountPoint) FROM StorageSample where mountCount = '5' FACET hostname

You could then set the threshold to watch for this count to go below 5, along the lines of Query returns a value below 5 for at least 5 minutes. The for at least 5 minutes will watch to make sure the count remains lower than 5 for a full consecutive 5 minutes, so that ephemeral internet glitches do not result in a false positive violation.

I realize this is quite a bit of work. Ultimately, being able to create “Disk Not Reporting” alert conditions out-of-the-box is an excellent idea for a new feature and I have submitted a Feature Request for this. If you have an opportunity, you might also want to visit our Feature Ideas section of Explorer’s Hub where you can add your use case for the proposed feature as a Feature Idea. This will allow other New Relic users to discover your idea and participate in the conversation by adding their use cases, possible workarounds, and even their vote.

1 Like

Can you help with custom attribute ?

Here’s my file /etc/newrelic-infra.yml

enable_process_metrics: true
status_server_enabled: true
status_server_port: 18003
license_key: KEY

mount_count: 5

Is it supposed to look like that ?

HI @stanislav.kuchuk :wave:

I hope you are doing well.

Custom attributes could be set as shown here in the sample config file.

In your case, I would say like :point_down:

enable_process_metrics: true
status_server_enabled: true
status_server_port: 18003
license_key: KEY
custom_attributes:
   mount_count: 5

I hope this will be helpful to you.

Please do not hesitate to contact me in case of any additional queries or issues. I will be happy to help you.

I hope you have a wonderful day.

1 Like

Thank you.

I added it to the file, restarted newrelic agent, but i don’t see it present in Newrelic Web UI.

I’m running
SELECT * FROM FileSystemSample in the query builder

Hi @stanislav.kuchuk :wave:

I think you will need to check the storage sample.

SELECT * FROM StorageSample

Please do not hesitate to contact me in case of any additional queries or issues. I will be happy to help you.

I see only data like

/
/boot

but I need to see the mount volume

Device Mounted on
//las1cifs1.riotgames.com/las1brepo > /las1cifs1/las1brepo
Why it’s not showing me up?

Any help here ?
Do you have any ideas?

Hi @stanislav.kuchuk,

I hope you are doing well!

If your server uses disks with file systems other than the supported file systems StorageSample events will not be generated for those disks.

Please do not hesitate to contact me in case of any additional queries or issues. I will be happy to help you.

Thanks,
Jay Vadera

Please suggest a work around

Hi :wave:

one can make use of our really cool Integration nri-flex.

Flex can take any input using data source APIs, process it through functions, and send metric
samples to New Relic as if they came from an integration.

For a quick introduction to Flex, read our blog post. You can also have a look at the 200+ example integrations!

As for your case, we have this example for you, which is very easy to set up. All you have to do is follow the step-by-step instructions to obtain the required metrics.

This example shows how to collect disk metrics from file systems not natively supported by New Relic using the df command in Linux.

I hope this helps you in getting the desired metrics.

Please do not hesitate to contact me in case of any additional queries or issues. I will be happy to help you.

Thanks,
Jay Vadera

I’m familiar with nri-flex.
I used https://github.com/newrelic/nri-flex/blob/master/docs/basic-tutorial.md#YourfirstFlexintegration

My query is
FROM FileSystemSample SELECT *

See the attachement.

The problem is to create an alert.

How to create an alert if over 3-5 minutes this mounted volume is not present on the host.

Hi @stanislav.kuchuk :wave:

Will this work?

select count(*) from FileSystemSample

if the count is = 0 for 3-5 minutes it means that it’s not reporting correct ?

is that what you are looking for?

Yeah, but how to create an alert of that kind ?

Hi :wave:

You can create an NRQL-based alert.

Please follow this document for creating that.

I hope this will be helpful to you.

Please do not hesitate to contact me in case of any additional queries or issues. I will be happy to help you.

I hope you have a wonderful day.

Not working for some reason. Need help
https://one.newrelic.com/nrai/alerts-classic/policies?account=2340385&state=7e9ba002-89c5-367b-1383-0ac8f61346d4

Why I cannot upload attachments?

Hi @stanislav.kuchuk :wave:

When I impersonated your account I was able to see the alert condition successfully set were you able to set it up?

I also saw an incident raised for the same condition when there were 0 counts for more than one minute.

Can you share what’s not working ?