Limit size of Kubernetes metrics (140GB per day now)

I just installed New Relic on one of our test Kubernetes clusters and it’s awesome as always (used APM before). But when I checked the usage stats yesterday I noticed we had blown through our quota and then some in just one day (actually 565GB). :scream: According to the usage dashboard most of this was metrics (70%).

What would be “normal” volume of data ingested from a Kubernetes cluster? I guess we have quite a few pods (330), but 140GB per day seems excessive? We are running on Azure with AKS.

Any way to for me to diagnose this further?

K8s(Volume|Container|Pod|Replica|Deployment|Endpoint)Sample metrics is (for each) about 50k entries for a 30 minute time period. So they seem to be the biggest at least.

Is there a good default limiting of metrics for a Kubernetes cluster? Does it make sense to just accept metrics from the “kube-system” namespace? Will that give the basic functionality that the Kubernetes dashboard and navigator gives?


Disabling logging and prometheus integration helped. I got a bit sidetracked because “upgrading” the Helm chart with new values (prometheus.enabled, logging.enabled) didn’t seem to create proper results (no data coming in at all).

So when I did an helm uninstall+install with the right settings, volumes are much more manageable. Still to early to say how much per day, but promising.

Now I just have to find out why we have so insane amounts of other metrics. :wink:

Glad to hear that you resolved your issue, @anderssv.ztl. Please let us know if you run into any other issue or have any further questions.


Things are looking better, but still ingesting about 35GB per day. The only things enabled through the Helm chart bundle is infrastructure and ksm.

I’m not super fluent in NRQL, but any tips as to analyze which events are using the most of our quota? Are there any signals that can be filtered away? Or is there a way to set the resolution of the signals coming in? What would a “normal” daily size be for a Kubernetes cluster?

For the our case with Kubernetes I am pretty sure we could live well with a 25% resolution or something.

From browsing around a bit in Insights I can see that the ProcessSample and NetworkSample events are quite large. In ProcessSample the following query gives a lot of events that are “pause”:

SELECT count(commandName) FROM ProcessSample SINCE 60 MINUTES AGO COMPARE WITH 1 WEEK AGO WHERE commandName = 'pause' TIMESERIES