Time window of error count in new "Error Analytics"

Please paste the permalink to the page in question below:

https://rpm.newrelic.com/accounts/283012/applications/2635821/filterable_errors#/table?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&filters=[{"key"%3A"error.class"%2C"value"%3A"de.komoot.hcn.server.service.user.UserAlreadyExistsException"%2C"like"%3Afalse}]&_k=3pefhm

Please share your question/describe your issue below. Include any screenshots that may help us understand your question:

What is the time aggregation window of the error count? It obviously changes with the total time window but without knowing if it is per “minute/second/…” the error count is pretty useless to me

Jan

Hi Jan - First, thanks for the permalink!

As you point out, the whole page is scoped to the time window you selected. So, the bar chart on the left shows total counts for that window of those errors. The top 5 chart doesn’t do any aggregation by minute or second - the peaks and valleys in the chart show the number of errors at that exact point in time. So too, with the traces in the chart below.

The only time you won’t see every error in the counts for that time window is if your application is generating more than 100 per minute. But, if that were the case, you would see a message telling you that (and, you would probably have a pretty significant problem in your app - that’s a lot of errors, and we’ve found less than 1% of applications actually hit that mark.

Does that answer your question? If not let me know and I’ll respond.

Hi Sean,

Thanks for the insights. I have troubles doing the math how this works, unfortunately: Yo say, there is no aggregation (sum of errors per minute/second/…) in the right graph, right?

So if my application has two errors, one at time x and one at time x + 1 microsecond I would see a chart showing a flat line with a single spike with the value 1, right. No aggregation (sum) of the two errors as you said?

@scarpenter

I also noticed, that the y-axis values change when i zoom in or out of the graph. That doesn’t really fit to your explanation “The top 5 chart doesn’t do any aggregation by minute or second - the peaks and valleys in the chart show the number of errors at that exact point in time” but rather suggests that there is some kind of aggregation over a timeframe (and this timeframe changes with the size of the the total time window).

Jan

Ah, I see what you mean now from your examples. In theory, there are two points where aggregation could happen: 1. collection, and 2. display. As I pointed out, we do not aggregate the error events at collection for the Error analytics page. We collect up to the cap of 100/min - at which point we don’t aggregate, we stop collecting new ones, until the next minute. They each have a timestamp so we know when they were collected, and can order them correctly.

But you are asking about the display logic. The x-axis of the graph will not display a window of time smaller than 3 minutes. This is an artifact of the UI control we use to render the visual, not of the underlying data we store. The x-axis does not show data at the microsecond level of detail. So in your example of two errors occurring at subsequent microseconds, the two errors would look like they happened at the same time on the graph, the same second (count of two). We don’t render the data at that level of microsecond precision on the graphs. In your example, the graph would show two errors happening at the same second on the graph.

If you want the exact timestamps, you have a couple ways to see that. You could click one of the corresponding error traces in the table below the chart. Or, you could click on the “View in Insights” chart seatbelt item that appears when you mouse over the chart. That will open Insights in a new window, and execute the exact query used to generate that graph you are looking at. You can use that to explore the data at the level of precision you are looking at.

1 Like

@scarpenter thanks for the detailed explanation!

So what I take away from your response is, that:

  • you collect up to 100 errors/min
  • aggregate to a time window >= 3 minutes.

Do I got you correctly that I cannot find out how large the aggregation window is? So if the chart shows “50” that can be anything larger than 3 minutes. But I can’t find out if I have 50 errors in 3 minutes or 50 errors in 3 hours?

For this I should better use the “Errors” Tab which also shows the Error rate (which is super important for me).

Hi @jan - thanks for your curiosity about this! I followed up with @scarpenter and our engineers to make sure we get you the most helpful information. You’re absolutely right, there is a challenge right now with the tooltips we display on the error count graph in Error analytics, in that the time “bucket” width of the highlighted point is not displayed. This is something we are investigating currently and I’ll be happy to pass your feedback into our system to make sure it’s considered in our prioritization.

As far as the time bucket width, this graph is actually capable of displaying quite small intervals of time due to the way we now collect these errors. For example, for a 30 minute window the width of the bucket will be 15 seconds. So when you are at very small windows, the bucket could be as little as 1 second, while at larger windows (like 7 days) the bucket could be close to 90 minutes. With this amount of variation your point about the tooltip is very important, so we really appreciate that feedback!

The error rate graph is shown under the main graph on this page as well, and does have the more standard time bucketing behavior. I hope this is helpful in comparing the rate to the count.

2 Likes

Hey @jan - thanks again for your feedback on this! The new iteration of Error Analytics that we released today contains tooltip bucket width - check it out!

1 Like

Hey alexis, thanks for the update! It adds the necessary information to
relate the error count to the time frame, great! I still have to do the
math myself to normalize it to errors per minute but at least if have the
necessary information now.

Thanks,

Jan

1 Like