Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Feature Idea Poll: Agent Not Connected OR Server Not Responding?

feature-idea

#13

@keagan_peet unless each of your servers is publicly accessible, standard Synthetics monitors would not be able to tell if your servers are up. We do have a private locations feature that would allow you to run monitors within your network though. Then that would just require each server to have an http end-point for the private minion to check against.

Another way to solve for the hung JVM is to alert on the number of reporting app instances for a given application. In this case you would basically say, alert me if less than 10 JVMs are reporting data to NR. Would something like that solve the problem?


#14

thanks Nate-- when I referred to ping monitors in Synthetics, I was referring to internal minions we have.

I think that method of alerting on app instances could work, but we can test-- do you have an example of what that metric is called that you can share? (other than searching for it in the legacy dashboards metric picker)


#15

Would like to alert on server not responding (ping) and be able to turn that alert on or off per alert policy.


#16

Hi Nate, great poll to run. Personally speaking I would like to get alerted when the agent stops reporting, that way I can act accordingly. We run a number of plugins, and the fact we are unable to alert when say for example a mysql instance goes down is not good.
I appreciate your argument when you say the agent not responding might be different than just a site down, however I would like the opportunity in such a scenario to be able to tweak the settings of my agent to try and nullify such events going forward as opposed to not having anything at all as is currently the case.
If I have a java agent installed on my Websphere server, and that Websphere server falls over, normally, the least I would expect from a monitoring tool would be that I would get alerted when that server falls over, however that’s not the case.
The ping option I feel we can get around using Synthetics, so not too concerned on that front.


5/6/2016 Post of the Week! Synthetics, Alerts, and Plugins!
#17

Frankly, I’m fascinated with the frequent mention of this use case for “servers not responding to pings” being a metric that people want to see in an environment where every sever has an agent on it. I’m also confused as to why Newrelic seems to think they’re related projects to work on (while still trying to determine the relative prioritization of them?) a year into being told “agent not responding” is coming soon.

What kind of use case is there for “server not responding to pings”? Can someone who wants that please explain your environment? I know ICMP is turned off for production servers (at least, from arbitrary Internet hosts like Newrelic’s monitoring service would be) I’ve run at every job I’ve had for 20+ years. I can’t wrap my brain around an example where this check would be possible let alone useful.

And even if it WERE useful, there are many states a server can get into where it still responds to pings but is effectively offline: OOMkiller took out something critical; userland is shutdown but the kernel can’t restart for some reason; hung on some sort of I/O operation to a device that’s disappeared; etc. Most of these would cause the “agent not responding” check to trigger but not the ping one.


#18

Don’t get hung up on the word “ping”, it is used in the general sense (not ICMP). This type of monitoring is an active check that can issue a request to a host or application instance using any number of protocols and assess the response as good or bad.

One common pattern I’ve run across where this would be ideal for is in situations where hosts or app instances have a health check end-point exposed that run some health check code and returns anything from a simple 200 OK to a serialized response that can be parsed and its contents evaluated.

These end-points are not exposed to the public internet and require on-prem presence to access.

This is much different use-case than triggering on an agent not reporting to New Relic and was what prompted the poll. Does that make sense?

The comments and questions on this thread so far have been super valuable thanks everyone for participating!


#19

We are going to start using an old SiteScope server to do up/down monitoring for our servers because of this fatal flaw in New Relic. I think the New Relic community has made it clear that 15 months is too long to be talking about this. Please give us a firm yes/no on whether this feature will be enabled and when.


#20

Hi @jeff.hailer

We’re going to check into this for you and will get back to you some time early next week :slight_smile:


#21

You have a system daemon running on the system, no way that one will have no data to send - CPU, memory, etc - when that data is missing, report that as downtime, as it means something wrong is happening with the server, and we (I) want to know.


#22

That’s kind of how the legacy server not reporting feature works now. There are a lot of problems inherent in that mechanism though. One we see pretty commonly is that network issues that exist transiently between our infrastructure and your’s can lead to data not being sent up by the agent or not being received in a timely manner. This can set off a cascade of false positive alarms for customers and when you’ve got thousands of servers, owned by dozens of teams, in a handful of countries you’re starting to talk about getting a lot of people out of bed.

We want to try and ensure that we’re only notifying you when there really is a problem with your infrastructure so that you and your teams can trust that it’s worth getting out of bed in the middle of the night. We also know how hard noisy alarms can be for on-call engineers and we’re trying to not be the app that cried wolf. We’re working hard to find an intelligent, scalable solution that walks that line without having to burden an ops team with too much manual configuration.

While we work on that, the legacy server not reporting functionality is still available. You’re still able to notify PagerDuty, Slack, HipChat etc as well as using email.

https://docs.newrelic.com/docs/alerts/alert-policies/configuring-alerts/server-policies#downtime

If anyone doesn’t see the feature available for them please reach out to Support through a ticket or by replying to one of the legacy rollback threads/starting a new thread to request a rollback.

Please include a link to your account with your post so we can quickly get you taken care of.


Downtime Alerts for Server Monitoring
Downtime alerts?
#23

I find having it in the new alerts, with the caveat it might yield false positives, and users having an option of turning it on, or not, better than not having it at all. Even if it’s just temporary alert module, considering it’s taking very long to implement the intelligent one. I’d rather be able to choose whether I want to be false alarmed, than having servers down, being unaware of it.

I got the legacy alerts re-enabled by the support team, which sorts it out for the moment; still, I feel that other approaches might have been better than the one you’ve taken. If it matters… Looking forward to the unicorn downtime alert module!


#24

One we see pretty commonly is that network issues that exist transiently between our infrastructure and your’s can lead to data not being sent up by the agent or not being received in a timely manner. This can set off a cascade of false positive alarms for customers and when you’ve got thousands of servers, owned by dozens of teams, in a handful of countries you’re starting to talk about getting a lot of people out of bed.

Your customers’ apps running in The Cloud rely on dozens of external services: Payment, social media, file hosting, weather APIs, geographic IP lookups, email blasting, and on and on and on.

If there is a network connectivity problem between their servers and yours, then there’s a problem connecting to other services. The lack of data from the agent might be the very first detectable sign that something’s wrong, and I’d rather get woken up for nothing once or twice a year than find out three hours later that our code for retrying credit card authorization doesn’t actually retry and we’ve lost revenue.

On Legacy Alerts, there have been (knock on wood) like three big false alarm storms in the past two years. Pretty sure all of them were during north america business hours, too. If that’s the failure rate in this hypothetical new service, I’d be a pretty damn happy customer.

As it stands now, with no ability to trigger this kind of alert, and an increasingly-abandoned legacy product, I am an incredibly unhappy one.


#25

I agree with the poll and most arguments here. There should be an easy way to alert if plugin/agent has not been reporting metrics — why otherwise bother so much with metrics and alerts if we don’t know if there is any data to create an alert in the first place?


#26

@hrob Thanks for taking the time and offering your feedback! We love to hear it.


#27

I’m not seeing this in the voting area, but I vote for brining back the downtime alerts you had in the old system. We have weekend scheduled reboots and until we have a way to set a downtime we would recieve multiple tickets every Saturday because there is no way to account for this with your current alerts.


#28

Hey there @jdwhitmoyer - do you mean Downtime alerts, originating from our Availability monitor?

https://docs.newrelic.com/docs/alerts/alert-policies/downtime-alerts/downtime-alert-settings

If so, this is separate from the conversation above about our Alerts versus legacy alerting systems. The new and improved replacement for our old “pingers,” is the Synthetics Ping monitor. Which is still free (for the first 50 URLs for Lite subscriptions) and includes a lot more flexibility but covers the same purpose of our Availability monitor.

https://docs.newrelic.com/docs/synthetics/new-relic-synthetics/getting-started/new-relic-synthetics

Hope that helps =)


#29

Hi, we hit a scenario last week where we failed to act to app downtime because the server was not reporting to New Relic and so we got no alerts on a lack of disk space.

What’s the status on having some kind of alert/notification when the server stops reporting? This is crucial! Monitoring Server metrics is useless if the agent’s not reporting at all without us knowing about it!!


#30

Is there any news on this?

I would like to monitor my applications that they simply report back. At least number of instances should be configurable (default being one for non-clustered systems).

Background:
We run replicated environment but what once happened was that client accidentally shut down all the instances in a cloud environment. We didn’t get any alerts. Our infrastructure runs in a private network (and we don’t use private locations synthetics).

I’m currently looking into other products just to solve this case.


#31

Hi @room9 and @tsoikkeli,

This is still being looked into by our developers. I do not have a timeline for when we should expect a feature for this. More information on server not reporting can be found on this SNR post.

@tsoikkeli would it be possible to whitelist the public Synthetics minions IPs for this network? You can find the list of IPs here. Synthetics does sound like it would have helped notify you of these instances shutting down.


#33

Sums it up for me. Give us the option to alert on it please. If we find its not useful we can switch it off…