Minions Issue, Job failure is large number, monitors are getting mixed responses

Hi,
We have upgraded the minions version recently to 3.0.28, we are at docker version 17.12.1-ce. We are getting mixed responses from minions, also we can see a large number of job are in queue for each minions. Are there any specific number of monitors needs to lined up for each minions. I am asking this because I wanted know if there would be a load balancing issue on minions. Currently there are around 550 monitors running in 2 private minions.

If you’re seeing the queue start to build, it’s generally advised to scale up or scale out, with scaling out being the preference in almost all cases. It sounds like the minion and docker versions are pretty much up to date, so I don’t imagine those are causing any issues. However, depending on the monitor types and configuration, 500 monitors can be a lot to run on two minions. I’ve asked someone that knows more of the details to comment as soon as they have a chance, but I can tell you that a system that just barely meets the requirements can run about two “heavy jobs” (ie. simple browser, scripted browser, or api test monitor) at a time before running out of resources.

1 Like

Hi @deyp02,

Thanks for posting your question here in Explorers Hub! I’m on the CPM support team at New Relic and happy to take a closer look. @babbott is correct that scaling out is a best practice. In other words, adding more hosts all pointing to the same private location (PL). This will not only load balance the PL, but also provide some level of redundancy and failover protection should one host go down.

In terms of assessing job demand and whether the host has enough resources to meet that demand, there are several factors including:

  • How many non-ping monitors are there? We mostly care about non-ping monitors since they take the most resources to run on the minion.
  • What is the average monitor frequency across all non-ping monitors?
  • How much memory does the host have available to Docker?
  • How many cpu cores are accessible to Docker? A quick docker info can provide you with those details.
  • Also important is knowing the capacity of your disk system in terms of maximum sustained operations per second (IOPS). Disk write is heaviest for scripted browser monitors since they save a large Chrome result after each job.

Here is a formula you can use to calculate job demand per heavy worker thread (each heavy worker handles a non-ping job in a runner container that gets created and destroyed for each job):

number of non-ping monitors / (t * number of heavy workers * replicas or hosts), where t = avg monitor frequency

Note that api test monitors use fewer resources than simple and scripted browser monitors.

For example, let’s say you have:

non-ping monitors     : 100
avg monitor frequency : 10 minutes
heavy workers         : 4 (typically equal to num_cpu)
hosts                 : 1

This would give you an estimate on how many non-ping jobs each heavy worker needs to process each minute, in this case 2.5 jobs per heavy worker per minute.

Note: If the average monitor frequency were to shorten to 1-minute, this would impact the resources required to process that additional job demand by an order of magnitude! 10x the memory, 10x the cpu cores, and 10x the IOPS.

This is just a rough estimate of job demand to gauge what size each host ought to be. Assume that each job takes 30 seconds to process (conservatively). That means each heavy worker could process 2 heavy jobs (non-ping) per minute. In this conservative case, you’d need more cpu cores, which translates to more heavy workers, which translates to more memory at 2.5 GiB (Gibibytes) per cpu core, our recommended level as @babbott referenced in the requirements link.

Here’s a query to see how many monitors are set to a PL by type of monitor:

SELECT uniqueCount(monitorId) FROM SyntheticCheck WHERE location = 'YOUR_PL' FACET type SINCE 2 days ago

This query will calculate the average monitor frequency across all non-ping monitors for a specific account:

SELECT 60*24*2*uniqueCount(monitorId)/count(monitorId) as 'Avg Frequency (minutes)' FROM SyntheticCheck WHERE location = 'YOUR_PL' and type != 'SIMPLE' SINCE 2 days ago

Note that if the timeframe is changed from 2 days, make sure to also update the 60*24*2 to reflect the number of minutes in that timeframe.

My colleague likes to think of it in terms of the rate at which jobs are getting added to the PL queue (publish rate) versus the rate at which jobs are being pulled off the queue by the minion (consumption rate).

If the publish rate exceeds the consumption rate, the queue will grow. The rate at which the queue grows can give you an idea about how much “extra” capacity the PL needs from hosts with minions running on them.

To see actual values for the publish rate that is in excess of the current consumption rate, i.e. how much extra demand is beyond the minion’s rate to process jobs off the queue:

SELECT derivative(checksPending, 1 minute) as 'queue growth rate (per minute)' FROM SyntheticsPrivateLocationStatus where name = 'YOUR_PL' SINCE 2 days ago TIMESERIES

To see actual values for consumption rate:

SELECT rate(uniqueCount(jobId), 1 minute) FROM SyntheticRequest WHERE type != 'SIMPLE' FACET location SINCE 2 weeks ago TIMESERIES

Also important are queries that can highlight performance issues and errors:

Assess memory usage

SELECT latest(minionPhysicalMemoryUsedPercentage) from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess Internal Engine Errors

SELECT 100*latest(minionJobsInternalEngineError)/latest(minionJobsReceived) as 'IEE rate (%)' from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess job failure rate

SELECT 100*latest(minionJobsFailed)/latest(minionJobsReceived) as 'job failure rate (%)' from SyntheticsPrivateMinion FACET minionLocation since 2 weeks ago TIMESERIES MAX

The rate of job failures vs total jobs can provide a measure of how significant failed jobs are to the minion over time. This query does not identify why the failures happened. They could be scripting errors, actual endpoint failures, or cpm issues (leading to alerts and false positives).

Monitors with failing checks contribute to retries on the queue, which require additional resources to process. So it’s best to try to fix monitors with script errors to reduce the demand on the minion.

Looking forward to hearing back on your host’s resources and how many jobs each heavy worker needs to process per minute.

2 Likes

We have resolved the issue, @kmullaney the query provided helped a lot to understand the issue. Actually we need to more minions, the disk space and memory was fine of the 2 minions we already had. There are around 600 tests running from those two minions mostly in 5min interval which was huge task for two minions to handle. We have added 6 more minions to load balance the minions. After adding and load balancing the minions, currently no job are getting failed or stuck in the queue.

1 Like