Relic Solution: Scaling and Rightsizing for the CPM

The CPM requires some careful capacity considerations. Scale up your host to make sure it is running well with enough cpu, memory, and disk IO. Then scale out, adding more hosts all pointing to the same private location (PL). This will not only load balance the PL, but also provide some redundancy and failover protection should one host go down.

The CPM Troubleshooting Framework details how to assess demand.

We have a max limit of 50 heavy workers and 1250 lightweight workers per CPM instance. See the MINION_HEAVY_WORKERS and MINION_LIGHTWEIGHT_WORKERS environment variables for more details.

The rate at which those workers can process monitor jobs depends on a couple of factors:

  • number of monitors
  • type of monitors
  • monitor frequency
  • number of worker threads per CPM
  • number of hosts running a CPM
  • average job completion time
  • job timeout seconds
  • percentage of jobs that time out

Intuitively, more monitors need more worker threads to process them. Similarly, a shorter monitor frequency (1-minute vs 10-minute) will generate more jobs and need more worker threads.

Monitor type matters because ping jobs don’t contribute much to the resource needs of the minion, but we do want to make sure there are a sufficient number of threads to process the number of ping jobs per minute (if there are any). Scripted browser jobs take the longest amount of time on average and so require the most resources in terms of cpu, memory, and disk IO.

What may not be intuitive from the above list is anything that causes jobs to take longer or occupy and worker thread for a longer amount of time will also require more worker threads. For example, increasing the job timeout seconds from the default of 180 to something higher like 300 will mean that any job that times out will occupy a worker thread for a longer period of time. That is time that could have been spent processing the next job, so more workers will be needed for longer timeouts and if the percentage of jobs that time out increases.

See this Google sheet to assist in calculating how many workers will be needed for the inputs listed above. This will help you to assess the required size of each host (scaling up) and how many hosts you need (scaling out).

Some things to consider when planning for a CPM install:

  • How much memory does the host have available to Docker or to a Kubernetes node?
  • Are there other projects running alongside? We recommend a dedicated host for the Docker CPM.
  • How many cpu cores are accessible to Docker? A quick docker info can provide you with those details.
  • If running the Kubernetes CPM in a multi-tenant cluster, setting resource quotas on each namespace will help to ensure enough resources are allocated to the CPM’s namespace.
  • What is the capacity of the disk system in terms of maximum sustained operations per second (IOPS)? Disk write is heaviest for scripted browser monitors since they save a large performance log file from Chrome after each job completes.

See this article on rightsizing for the Kubernetes CPM, especially if a node has insufficient resources, health checks are failing, the minion pod is restarting, and/or the PVC is requesting access mode ReadWriteOnce (RWO).

The number of CPM instances is determined by how many hosts or nodes you want to run the CPM on. Each instance helps to balance the load across the CPMs and provide some failover protection. This is a best practice.

The CPM is a container orchestrator, so it’s best to run only one CPM per host or node to avoid resource contention issues. It will scale to maximize the resources available to it up to the heavy and lightweight worker values.

For the Docker CPM:

The number of worker threads is set by the MINION_HEAVY_WORKERS and MINION_LIGHTWEIGHT_WORKERS environment variables, or by the cpu core count on the host if the environment variables are not set (num_cpu for heavy workers, 25*num_cpu for lightweight workers). The CPM will utilize the available hardware provided to it such that more CPU cores will yield more worker threads. The consideration here is that each heavy worker thread should be provided 2.5 GiB of memory to adequately support the CPM.

For the Kubernetes CPM:

The number of worker threads is set by the heavyWorkers and lightweightWorkers values in our Helm chart. They default to 2 and 50 respectively. These values can be increased provided that the node has sufficient resources. If provided a PVC that requests access mode ReadWriteMany (RWX), multiple nodes can be utilized which helps to reduce the resources required by any one node.

We have a formula to help get a general idea of how many jobs each worker thread can process per minute. A general rule of thumb is to keep heavy jobs per worker per minute to 2 or less. This leaves room for jobs that soft fail and need to be retried (up to 2 more times), and jobs that time out at 180 seconds (the default timeout value).

Here is a formula you can use to calculate jobs per heavy worker thread per minute. Each heavy worker handles a non-ping job in a “runner container” that gets created and destroyed for each job:

number of non-ping monitors / (avg monitor frequency * number of heavy workers * hosts)

Note: Scripted API monitors use fewer resources than simple and scripted browser monitors.

For example, let’s say you have:

non-ping monitors     : 100
avg monitor frequency : 10 minutes
heavy workers         : 4 (typically equal to num_cpu)
hosts                 : 1

This would give you an estimate on how many non-ping jobs each heavy worker needs to process each minute, in this case 2.5 jobs per heavy worker per minute.

Note: If the average monitor frequency were shortened from 10 minutes to 1 minute, this would impact the resources required to process jobs by an order of magnitude! 10x the memory, 10x the cpu cores, and 10x the IOPS.

This is just a rough estimate of job demand to gauge what size each host ought to be. Assume that each job takes 30 seconds to process (conservatively). That means each heavy worker could process 2 jobs per minute.

In this conservative case, we’d need to add additional cpu cores, which translates to more heavy workers, which translates to more memory at 2.5 GiB (Gibibytes) per cpu core, our recommended level as mentioned in our requirements.

This query shows jobs that are timing out, to help assess the job timeout rate:

SELECT 100*average(minionJobsTimedout)/latest(minionJobsReceived) as 'job timeout rate (%)' FROM SyntheticsPrivateMinion FACET minionLocation TIMESERIES MAX SINCE 2 weeks ago

Here’s a query to see how many monitors are set to a PL by type of monitor:

SELECT uniqueCount(monitorId) FROM SyntheticCheck WHERE location = '' FACET type SINCE 2 days ago

This query will calculate the average monitor frequency across all non-ping monitors for a specific account:

SELECT 60*24*2*uniqueCount(monitorId)/count(monitorId) as 'Avg Frequency (minutes)' FROM SyntheticCheck WHERE location = '' and type != 'SIMPLE' SINCE 2 days ago

Note that if the timeframe is changed from 2 days, make sure to also update the 60*24*2 to reflect the number of minutes in that timeframe. This query will not work as a time series because it will get skewed by the number of intervals in the chart.

Think of it in terms of the rate at which jobs are getting added to the PL queue (publish rate) versus the rate at which jobs are being pulled off the queue by the minion (consumption rate).

If the publish rate exceeds the consumption rate, the queue will grow. The rate at which the queue grows can give you an idea about how much “extra” capacity the PL needs from hosts with minions running on them.

To see actual values for the publish rate minus the consumption rate, i.e. how many extra jobs does the minion need to process to keep up:

SELECT derivative(checksPending, 1 minute) as 'queue growth rate (per minute)' FROM SyntheticsPrivateLocationStatus where name = '' SINCE 2 days ago TIMESERIES

To see actual values for consumption rate:

SELECT rate(uniqueCount(jobId), 1 minute) FROM SyntheticRequest WHERE type != 'SIMPLE' FACET location SINCE 2 weeks ago TIMESERIES

Also important are queries that can highlight performance issues and errors:

Assess memory usage

SELECT latest(minionPhysicalMemoryUsedPercentage) from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess Internal Engine Errors

SELECT 100*average(minionJobsInternalEngineError)/latest(minionJobsReceived) as 'IEE rate (%)' from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Assess job failure rate

SELECT 100*average(minionJobsFailed)/latest(minionJobsReceived) as 'job failure rate (%)' from SyntheticsPrivateMinion FACET minionLocation since 2 weeks ago TIMESERIES MAX

The rate of job failures vs total jobs can provide a measure of how significant failed jobs are to the minion over time. This query does not identify why the failures happened. They could be scripting errors, actual endpoint failures, or cpm issues (leading to alerts and false positives).

Monitors with failing checks contribute to retries on the queue, which require additional resources to process. So it’s best to try to fix monitors with script errors to reduce the demand on the minion.