How to drop ALL data ingest based on a monthly cap?

I am trying to create a Drop Data filter to allow me to define a monthly cap and drop everything possible when this condition is reached.

I am having trouble defining a NRQL query to satisfy this, though. I am trying to wrap my head around the syntax, but right now I am not even sure this is possible to be expressed in a query.

Can someone confirm if this is possible and, if so, how to achieve this?

Thanks!

Hi @ricardo.trindade

Thanks for reaching out and congrats on your first post in the community.

Its not always possible to drop all ingest to my understanding, however there are many things you can drop and some different methods you can use to better manage your data ingest. Please see the below,

  1. Control data ingest costs with ingest drill-down | New Relic Documentation

  2. Understand and manage data ingest | New Relic Documentation

  3. Drop data using NerdGraph | New Relic Documentation

Please do let me know if you find this helpful, or should you have any additional questions.

Thanks for the answer. The links are useful, but I already had them prior to opening this topic.

So my question then becomes, is there any way to mitigate exponential cost because of a crazy application log error? I am currently migrating away from Azure Log Analytics and bringing everything to New Relic. So far so good, but there is a certain condition in our apps and sometimes a log goes crazy (millions of lines per second) and there is a risk of several thousand dollars being spent if this happens and we do not detect it.

Alerts can be setup (just like in Azure) but if this happens outside of business hours, it can be a hell of a cost for us to simply absorb. In Azure, I set a daily cap which can cover us from major loss and the next day our monitoring stack is back. In New Relic, I see no options to mitigate this type of issue.

So, any ideas which could me avoid us potential infinite costs on the platform? This can be a blocker for migrating to New Relic.

@ricardo.trindade

I think the best step here is to send us a permalink to the logs where you are seeing this occur. Only New Relics will be able to access this link in your New Relic Account.

It will give us more insight a to what exactly is happening to cause the log to go crazy.

Maybe I was not clear. The crazy logs are an issue of our own applications, which at the moment cannot be dealt with. Or think maybe an unknown condition which can, at any time, generates tons of undesired stdout output which will be sent forward to NR. Sending this to you will not generate any extra insight (the problem is not even happening right now, so I have nothing to send to you).

So, rephrasing the question is, if a program generates tons of undesirable (and not known beforehand) logs, how can I prevent the NR logs cost to go to the stratosphere? I though of using a drop rule which would fire after a daily ingest number (so exactly like a daily cap), but I could not write this in an acceptable NRQL query.

For the known log, all the lines have the same syntax, so I could drop them using a drop filter, but I need some of them reaching the platform, so I can alert the condition (which usually requires an app restart to fix). I also could not find anything on the NRQL which could allow me to drop 99% of these lines and register only 1% of them (using a function like “random”, for example).

Actually for the known issue I was able to sample based on the message field + part of the timestamp of the event (miliseconds), so for this type of error I was able to sample 0.1% of the logs, which solves the issue in a nice way.

Now I just need something to cover unexpected repeated messages which could explode my cost exponentially, using anything which resembles a daily ingest cap. Any suggestions are appreciated.

1 Like

Hello @ricardo.trindade.

I believe you may be able to utilize muting rules in order to limit the amount of repeated messages or notifications you are receiving: Muting rules: Suppress notifications | New Relic Documentation.

Please let me know if this was not helpful in reaching your goal and I will gladly provide more assistance. I hope to hear from you soon, have a great day!

Muting rules does nothing related to the ingest cost of the repeated messages. I need a solution to avoid costs, not avoid alerts.

Have you tired Drop rules → Drop data using NerdGraph | New Relic Documentation

I already said drop rules does not work for my issue. If I don’t know which app or which condition could trigger millions of lines to reach my instance unexpectedly, so what rule can I use to drop these lines (and just these lines)? See the problem?

I need some kind of daily cap. If there is a way to express a daily usage cap in NRQL, I could use a drop rule. Otherwise, this does not fit the issue.

Hi @ricardo.trindade

Thanks for reaching back out. I can see currently our support engineering team are working on this. They will reach out here with their findings and guidance soon.

Should you have any additional updates or questions, please do reach out!

1 Like

Hi @ricardo.trindade - There currently isn’t a way to trigger a daily cap. You might be able to setup a NRQL alert condition with a Webhook notification to an endpoint that triggers a script which restarts the app. Another idea might be to include runbook instructions in the alert notification with instructions for how to resolve the runaway logging.

As far as the NRQL query, you won’t be able to use the query in the Manage Your Data page since those metrics are reported once a day. If you are looking for real time metrics from your Logs, you’ll need to somehow come up with a NRQL condition which measures rate of change. This isn’t something NRQL alert conditions do “out of the box”. Although this is not possible out of the box, if you could find a way to make this calculation (RoC), then you might be able to send it up as part of the data sent to New Relic. Alternatively, you can use a cron job to calculate RoC yourself and just send that number up as a custom event. This article goes into some detail on how to do that.

Another idea is if RoC over a single minute would indicate a problem, you can use the stddev() function in NRQL. This will return a higher number if there is more difference from the mean. You could then couple this with a Sliding Windows Aggregation Alert so that multiple positive values would be added to a rolling sum. If the rolling sum got too high, it would indicate a spiking RoC and the alert condition would then open a violation.

I hope this helps.

2 Likes

Wow, now that is useful insight! I will evaluate all your suggestions and see if I can make some of them work.

I will report back in case of success or not, so other customers can benefit from this analysis.
Thank you so much for all the insights!

Hey @ricardo.trindade,

We really appreciate that. Please also reach out if you have further questions.

Take care!

Finally I was able to make something work. Here was my solution…

Using the information from dkoyano topic, I was able to create a synthetic API monitor which:

  • Uses NerdGraph API to run an NRQL to check the last 24hs of Ingest;
  • Uses the Metrics API to ingest this value to a new metric, allowing me to alert in case this exceeds the expected value;
  • Uses NerdGraph API to run an NRQL to check if there is an NRQL drop rule active or not;
  • In case we exceeded the daily ingest cap and there is no drop rule active, makes one last NRQL call to create one, dropping all data from the Log table;
  • In case we do not exceed the daily ingest cap, there is a drop rule active and we are below the lower bound to reactivate the ingest, makes one last NRQL call to delete the current drop rule;

I am still validating the last corner cases, but it seems to cover exactly what I need, giving some peace in case of an unexpected software bug or a malicious agent tries to generate a gigantic NR bill by generating endless logs.

I would still like to suggest some kind of daily cap to be added to the product, since I am using my limited synthetic runs to achieve the same result, but at least I am happy I could achieve the desired effect. I am replacing Azure Log Analytics and App Insights with New Relic and although NR is leaps and bounds a better product, it’s the one feature missing from there.

Thanks for all the help!

1 Like