Signal Lost Alerts with AWS MSK (Managed Kafka) integration

Hello we are using the AWS Managed Kafka (MSK) integration in our infrastructure monitoring (see link).

We use this to monitor the CPU usage of our brokers among other things.
For that we use an NRQL-Alert. We also set a “signal lost” threshold to be notified, when the broker or integration if not working.

We are receiving signal lost alerts fairly often, however, without a real outage on AWS or (as far as we can tell) NewRelics side.

Is there any way to debug those “signal lost” alerts or some guidance for the NewRelic Managed Kafka on AWS integration regarding this problem?

Hi @sergej.herbert - Take a look at this article, which discusses when to use Loss of Signal. If you include a link to the alert condition we can do a breakdown of what the alert conditions is or is not doing and possibly find a way to get it to behave better. Note that in case you have concerns, the link won’t be usable by anyone that is not authorized on your account.

1 Like

I am sorry for replying so late (I need to figure out how to get e-mails about answers…).

We are using an NRQL alert like this:
SELECT average(provider.cpuIdle.Average) FROM AwsMskBrokerSample FACET provider.brokerId where provider.clusterName = 'production-cluster'

And here is the link in case you need more details: https://one.nr/0M8jqanAGjl

Hey @sergej.herbert,

It looks like this condition is using data from AWS, which has a longer polling delay for data than the default 3 minutes when New Relic collects and processes data. Without adjusting the evaluation offset of this alert condition, latent data will start being dropped because it doesn’t match the current timeslice being evaluated, which is seen as a “Loss of Signal” to our evaluators.

It looks like that integration has a polling delay minimum of 5 minutes, seen here: Amazon Managed Kafka (MSK) integration | New Relic Documentation

Using New Relic’s internal processing delay of up to 3 minutes, plus the 5 minute polling delay from AWS, I suggest starting your Alert Condition’s Evaluation Offset at 8 minutes and see if you notice a reduction in Signal Lost alerts (since more data will be available for evaluation). Here’s a document that explains Evaluation Offset: Create NRQL alert conditions | New Relic Documentation

Let us know if you have any questions about this.

Cheers!

Thanks. We did not have those signal lost alerts for a while now and I will try adjusting this value, when they pop up again!

Let us know how it works out @sergej.herbert!