Alert on network metrics

It would be helpful to alert when network errors are happening. It would also be helpful to know when the network interface is reaching a certain percentage of total bandwidth.

@Trevor_Dearham - The agent logs some basic network error counts as can be seen on the network tab of the server agent screen. You can create a custom metric alert using the metric name of System/Network/{InterfaceName}/All/errors/sec

You can also create custom metric alerts on recieved and transmitted network stats using the metric names of System/Network/{InterfaceName}/Received/bytes/sec and System/Network/{InterfaceName}/Transmitted/bytes/sec

In addition if you want to aggregate all network stats, you can use All in the place of the interface name to get the aggregate stats for creation of an alert.

1 Like

@sdelight I raised a ticket about the value reported by this metric they stated it “reports the number of times we check for errors which is currently 3 times per minute”.

From verbose logging I see that Server Monitor was reporting the following:

"name": "System/Network/All/All/errors/sec"

I assume that one or more of the zeros are the actual number of network errors, but the 3 prevents this metric from being used to just alert on the number of network errors.

@Trevor_Dearham - This would be the expected metric data for the server agent with a value of zero. The first number is the count of the data points within the duration, in this case 3 because we query once every 20 seconds for the Windows and Linux server agent. The other 5 values correspond to total errors, min, max, exclusive time(which doesn’t correspond to anything for network errors) and sum of squares.

The count value shouldn’t cause any issues with alerting off the metric as it should use the provided values to accurately assess the average and such. This is a similar model we use for all metrics within New Relic. It’s certainly possible that we have a bug in the new Alerts system (it is in beta after all) but I would need a specific incident to investigate. Is it possible for you to send me a link to an incident that you feel fired incorrectly?

@sdelight Thank you for the explanation of the values. As this is using the custom metric value, there would ideally need to be a way to select which value in the array to use for the alert, so that the total number of errors could be selected in this case and the other values could be ignored.

This is the incident I raised.

@Trevor_Dearham - The way our metric system works is that you select a value function that evaluates the data submitted by your agent, not interact directly on the submitted metric data. We use the submitted data to calculate the the available value functions. Those value functions are evaluated for the time period selected to determine if a violation has occurred or not. Since not all metrics are reported exactly in the same manner, the descriptions break down a little bit but for the network error rate, the “an average” value function should match the errors graph from the server network graph.

1 Like

@sdelight Thank you. I think that is what I had before, which caused Incident 59 on my account, but I’ve set it up again and it seems to be working now.

1 Like

Glad to hear it @Trevor_Dearham. Thanks for letting us know.