Detecting AWS AutoScaling Groups thrashing


I would like to configure an alert in NR to detect and notify when a AWS AutoScaling Group is thrashing, i.e., creating new short-lived instances continuously because the instances fail to start up properly for some reason (e.g., a failing UserData script) or the Load Balancer healthcheck fails to reach them (e.g., due to a misconfigured Security Group).

Up to now I’ve been able to visualise ASG activity (i.e., spawning of new instances) by requesting against the “InfrastructureEvent” type of events:

SELECT count(*) FROM InfrastructureEvent FACET `provider.autoScalingGroupName` WHERE changeType = 'added' SINCE 3 hours ago TIMESERIES UNTIL now

But I haven’t been able to set up an alert from that request, as the Alerts don’t seem to accept InfrastructureEvents as a data source.

Any idea how to achieve this?

Thanks a lot, any help greatly appreciated!

Hi @c.roche

Although we generally don’t recommend using InfrastructureEvent for Alerts (mainly because it reports irregularly), you could probably swing this sort of alert condition by using…

  1. A NRQL alert condition with a similar query as the one you showed (minus the SINCE and TIMESERIES clauses)
  2. A sum of query results threshold, so that you can keep a running total of how many hits the query generates over the course of an hour or two

This will essentially count the number of InfrastructureEvent events that match your WHERE clause. A violation will only be raised if the sum of the results over an hour or two (depending on how you configure it) goes over a certain numeric limit.

One other thing: count(*) can never return a value of 0, so you will need to include a Loss of Signal setting on your condition so that, when things go back to normal and you’re not getting any hits on your query, any open violations can close. Take a look at this article, which goes into detail on why you’ll need to do that and how that feature works.

This is about as much help as I can provide from a support standpoint. If you have more questions around implementation of this alert condition, our awesome community members should be able to help out! :smiley:

1 Like