How to query for number of EC2 instances?

How can I write a query for the total number of EC2 instances in an AWS account?

I know that I can do something like this to count unique instance IDs:

FROM ComputeSample SELECT uniqueCount(ec2InstanceId) WHERE provider=‘Ec2Instance’ and awsAccountId=‘446233199186’ AND … (whatever conditions identify the instances I want to count)

However, that’s subtly different from how many instances there are, and in a way that can become very misleading. It’s roughly equivalent if you only have long-lived instances, but if you have something like an active cluster that’s autoscaling frequently, or a worker pool where short-lived instances are being launched frequently, then the numbers you get from a query like that are highly dependent on the aggregation window size (aka “bucket size”, for one meaning of that term).

For example, if I have one new instance launching every 5 minutes which lasts 2-3 minutes and is then terminated, then if I do TIMESERIES 5 minutes I’ll get a line at 1, with an occasional jump to 2 if two instances happen to overlap into the same bucket. But if I do TIMESERIES 1 hour, the line will be at 12-13! I never have more than 1 instance active at a time, but that chart would show 12 pretty solidly.

So what I’m looking for is a way to count how many instances are currently there at any given point in time, and then I can use last() on that to graph it across any duration and aggregation window / bucket size I want, and get consistent and sensible results. Anyone know how to do that?

4 Likes

I asked New Relic support and our company’s contacts at NR to look into this, they ran it by a few people, and I learned from that experience that people seem to have a hard time understanding what I’m saying in this post. Adding two screenshots here that will hopefully illustrate the problem and make it clearer.

Over the course of a few hours, I manually simulated autoscaling behavior in a sandbox account by launching new EC2 instances and delete them every few minutes, in an AWS account that had 5 long-running instances. So, at any give time over this period of a few hours, there were between 5-10 instances - the 5 long-running ones, and between 0-5 short-lived test instances that I would create and terminate within about 10 minutes each.

I set the polling interval on our NR AWS EC2 integration for this account to 5 minutes, and ran two identical queries like the sample query in my original post. One is “TIMESERIES 6 MINUTES”, the finest resolution that makes sense with a 5 minute polling interval; the other is “TIMESERIES 30 MINUTES”.


You can see in the 6 MINUTES chart that there are peaks around 17 or 18 instances, even though there was never any time when there were actually more than 10 instances in the account. This is because if I terminate some instances and launch new ones within a <6 minute period, the integration will sometimes poll when the older instances are still there and record their IDs, and poll again when the new instances are there and record their IDs. Even though the older and newer instances were not up at the same time, they have different IDs, so counting IDs across the 6 minute interval will show a sum of all of them.

You can see in the 30 MINUTES chart how the problem is sensitive to aggregation window size (aka “bucket” size), and gets worse the longer a period you look at. On this chart, there are peaks up to 24 instances, more than double the peak number of instances that were ever present in the account at the same time. The longer a span of time is counted in each interval, the more unique IDs of short-lived instances are counted. Over a 30 minute window, if I create an instance and delete it after a few minutes, then several minutes later create another and delete it after a few minutes, and then do the same a few minutes later, all three IDs will be counted and the chart will show 3 even though there were only ever 0 or 1 instances present at any given time.

Not only does this give very misleading charts if you’re trying to do something like track the size of an autoscaling group (to see how large it got), but it’s especially confusing and misleading because the numbers can change dramatically with different TIMESERIES resolution.

For example, if I do “SINCE 8 DAYS AGO TIMESERIES 2 HOURS” the chart peaks at 48 instances!!!

What I am looking for is a way to chart, not the cumulative sum of how many instances ever existed, but the actual number of instances that are present at a given time. I want to be able to do a TIMESERIES of that.

1 Like

Also, I should clarify that I’m not looking for an explanation of why that query does not do what I want. I understand why, and explained it above.

What I’m looking for is a query that does do what I want. The reason I explained why the query above (and any other queries of a similar form) does not do it, is because when I ask this question people usually respond with some variation of that query. I want to make it clear why that query, and other versions of it, is actually not the answer.

I believe there may be a way to do it using subqueries, and something like last() or max() on the uniqueCount, to ensure that it is only ever showing counts of instances IDs from single polling samples, never adding together collections of instances IDs from multiple polling samples. I just haven’t hit on the right way to do that.

1 Like