Relic Solution: Scaling and Rightsizing for the CPM

The CPM requires some careful capacity considerations. For one, scaling out is a best practice. In other words, adding more hosts all pointing to the same private location (PL). This will not only load balance the PL, but also provide some redundancy and failover protection should one host go down.

Rightsizing the host or Kubernetes cluster means the CPM will have the resources it needs to run reliably, long term. To start, we need to assess job demand and whether the host has enough resources to meet that demand. There are several factors at play including:

  • How many non-ping monitors are there? We mostly care about non-ping monitors since they take the most resources to run on the minion.
  • What is the average monitor frequency across all non-ping monitors?
  • How much memory does the host have available to Docker?
  • How many cpu cores are accessible to Docker? A quick docker info can provide you with those details.
  • Also important is knowing the capacity of your disk system in terms of maximum sustained operations per second (IOPS). Disk write is heaviest for scripted browser monitors since they save a large Chrome result after each job.

Here is a formula you can use to calculate jobs per heavy worker thread per minute. Each heavy worker handles a non-ping job in a “runner container” that gets created and destroyed for each job:

number of non-ping monitors / (t * number of heavy workers * replicas or hosts), where t = avg monitor frequency

Note that api test monitors use fewer resources than simple and scripted browser monitors.

For example, let’s say you have:

non-ping monitors     : 100
avg monitor frequency : 10 minutes
heavy workers         : 4 (typically equal to num_cpu)
hosts                 : 1

This would give you an estimate on how many non-ping jobs each heavy worker needs to process each minute, in this case 2.5 jobs per heavy worker per minute.

Note: If the average monitor frequency were to shorten to 1-minute, this would impact the resources required to process that additional job demand by an order of magnitude! 10x the memory, 10x the cpu cores, and 10x the IOPS.

This is just a rough estimate of job demand to gauge what size each host ought to be. Assume that each job takes 30 seconds to process (conservatively). That means each heavy worker could process 2 heavy jobs (non-ping) per minute. In this conservative case, we’d need to add additional cpu cores, which translates to more heavy workers, which translates to more memory at 2.5 GiB (Gibibytes) per cpu core, our recommended level as mentioned in our requirements.

Here’s a query to see how many monitors are set to a PL by type of monitor:

SELECT uniqueCount(monitorId) FROM SyntheticCheck WHERE location = 'YOUR_PL' FACET type SINCE 2 days ago

This query will calculate the average monitor frequency across all non-ping monitors for a specific account:

SELECT 60*24*2*uniqueCount(monitorId)/count(monitorId) as 'Avg Frequency (minutes)' FROM SyntheticCheck WHERE location = 'YOUR_PL' and type != 'SIMPLE' SINCE 2 days ago

Note that if the timeframe is changed from 2 days, make sure to also update the 60*24*2 to reflect the number of minutes in that timeframe. This query will not work as a timeseries because it will get skewed by the number of intervals in the chart.

My colleague likes to think of it in terms of the rate at which jobs are getting added to the PL queue (publish rate) versus the rate at which jobs are being pulled off the queue by the minion (consumption rate).

If the publish rate exceeds the consumption rate, the queue will grow. The rate at which the queue grows can give you an idea about how much “extra” capacity the PL needs from hosts with minions running on them.

To see actual values for the publish rate that is in excess of the current consumption rate, i.e. how much extra demand is beyond the minion’s rate to process jobs off the queue:

SELECT derivative(checksPending, 1 minute) as 'queue growth rate (per minute)' FROM SyntheticsPrivateLocationStatus where name = 'YOUR_PL' SINCE 2 days ago TIMESERIES

To see actual values for consumption rate:

SELECT rate(uniqueCount(jobId), 1 minute) FROM SyntheticRequest WHERE type != 'SIMPLE' FACET location SINCE 2 weeks ago TIMESERIES

Also important are queries that can highlight performance issues and errors:

Assess memory usage

SELECT latest(minionPhysicalMemoryUsedPercentage) from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess Internal Engine Errors

SELECT 100*latest(minionJobsInternalEngineError)/latest(minionJobsReceived) as 'IEE rate (%)' from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess job failure rate

SELECT 100*latest(minionJobsFailed)/latest(minionJobsReceived) as 'job failure rate (%)' from SyntheticsPrivateMinion FACET minionLocation since 2 weeks ago TIMESERIES MAX

The rate of job failures vs total jobs can provide a measure of how significant failed jobs are to the minion over time. This query does not identify why the failures happened. They could be scripting errors, actual endpoint failures, or cpm issues (leading to alerts and false positives).

Monitors with failing checks contribute to retries on the queue, which require additional resources to process. So it’s best to try to fix monitors with script errors to reduce the demand on the minion.