Relic Solution: Scaling and Rightsizing for the CPM

The CPM requires some careful capacity considerations. Scale up (vertically) your host to make sure it is running well with enough cpu, memory, and disk IO. Then scale out (horizontally), adding more hosts all pointing to the same private location (PL). This will not only load balance the PL, but also provide some redundancy and failover protection should one host go down.

The CPM Troubleshooting Framework details how to assess demand.

We have a max limit of 50 heavy workers and 1250 lightweight workers per CPM instance. See the MINION_HEAVY_WORKERS and MINION_LIGHTWEIGHT_WORKERS environment variables for more details.

The rate at which those workers can process monitor jobs depends on several factors:

  • number of monitors
  • type of monitors
  • monitor frequency
  • number of worker threads per CPM
  • number of hosts running a CPM
  • average duration for successful jobs
  • average duration for failed jobs
  • average duration for jobs that time out
  • percentage of jobs that time out
  • percentage of jobs that fail

Intuitively, more monitors need more worker threads to process them. Similarly, a shorter monitor frequency (1-minute vs 10-minute) will generate more jobs and need more worker threads.

Monitor type matters because ping jobs don’t contribute much to the resource needs of the minion, but we do want to make sure there are a sufficient number of threads to process the number of ping jobs per minute (if there are any). Scripted browser jobs take the longest amount of time on average and so require the most resources in terms of cpu, memory, and disk IO.

What may not be intuitive from the above list is anything that causes jobs to take longer or occupy and worker thread for a longer amount of time will also require more worker threads. For example, increasing the job timeout seconds from the default of 180 to something higher like 300 will mean that any job that times out will occupy a worker thread for a longer period of time. That is time that could have been spent processing the next job, so more workers will be needed for longer timeouts or if the percentage of jobs that time out increases.

Even one non-ping monitor that times out regularly can significantly impact the performance of an otherwise healthy CPM if set to run at a frequency less than the timeout duration.

The risk of setting a 1-minute monitor frequency for non-ping monitors

If a non-ping job takes 180s to time out, but it is scheduled with a frequency of 1-minute, the queue will always grow because more jobs will be scheduled than can complete. This will lead to poor performance for the CPM since each minute another worker will become occupied, but only released after 3 minutes. Be cautious about setting a monitor to anything less than a 5-minute frequency due to the potential for timeouts.

         remaining seconds to job completion
minute  w1   w2   w3   w4   w5   w6   w7   w8
0       180
1       120  180
2       60   120  180
3       0    60   120  180
4       180  0    60   120  180
5       120  180  0    60   120  180
6       60   120  180  0    60   120  180
7       0    60   120  180  0    60   120  180

The above table highlights that with a single monitor timing out at 180 seconds and set to a 1-minute frequency, it will occupy 6 out of 8 heavy workers by minute 7.

Synthetics Job Manager (SJM)

If you’re looking to calculate parallelism and completions for the new Synthetics Job Manager (SJM) in Kubernetes, see this Google sheet.

Calculating Heavy Workers for the CPM

See this Google sheet to assist in calculating how many workers will be needed for the inputs listed above. This will help you to assess the required size of each host (scaling up) and how many hosts you need (scaling out).

Ignoring timeouts, this simpler spreadsheet can provide a rough estimate of the request and limit resources for the Kubernetes CPM.

For the K8s spreadsheet, the numbers will be a conservative estimate. For example scripted api monitors actually request 1.25 GiB with a limit of 2.5 GiB memory. Whereas scripted browser monitors request 2 GiB with a limit of 3 GiB. To err on the side of caution, we can assume all non-ping “runner” jobs are of the scripted browser type.

The results depend on the approximate number per minute and average duration of ping and non-ping (runner) jobs that will be run on a private location. To help calculate those values, the following NRQL queries will come in handy.

Jobs Running

FROM SyntheticsPrivateMinion SELECT max(minionJobsRunning) AS 'concurrent',clamp_min(derivative(minionJobsFinished,1 minute),0) AS 'finished per minute' WHERE minionLocation = 'YOUR_PRIVATE_LOCATION' TIMESERIES AUTO SINCE 2 days ago

Avg Duration by Type

FROM SyntheticCheck SELECT average((nr.internalQueueDuration+nr.executionDuration)/1e3) as 'avg job duration (s)' WHERE location = 'YOUR_PRIVATE_LOCATION' FACET type TIMESERIES AUTO SINCE 2 weeks ago

Note: The SyntheticsPrivateMinion event type is for the private location as a whole, whereas the SyntheticCheck event type will only give you results for one New Relic account and any monitors that exist therein. It’s not yet possible to get a holistic view of all monitors across all accounts unless you write in to New Relic Support and request this kind of analysis.

Some things to consider when planning for a CPM install:

  • How much memory does the host have available to Docker or to a Kubernetes node?
  • Are there other projects running alongside? We recommend a dedicated host for the Docker CPM, and a dedicated node for the Kubernetes CPM to avoid conflicts with other pods.
  • How many cpu cores are accessible to Docker? A quick docker info can provide you with those details.
  • If running the Kubernetes CPM in a multi-tenant cluster, setting resource quotas on each namespace will help to ensure enough resources are allocated to the CPM’s namespace.
  • What is the capacity of the disk system in terms of maximum sustained operations per second (IOPS)? Disk write is heaviest for scripted browser monitors since they save a large performance log file from Chrome after each job completes.

See this article on rightsizing for the Kubernetes CPM, especially if a node has insufficient resources, health checks are failing, the minion pod is restarting, and/or the PVC is requesting access mode ReadWriteOnce (RWO).

The number of CPM instances is determined by how many hosts or nodes you want to run the CPM on. Each instance helps to balance the load across the CPMs and provide some failover protection. This is a best practice.

The CPM is a container orchestrator, so it’s best to run only one CPM per host or node to avoid resource contention issues. It will scale to maximize the resources available to it up to the heavy and lightweight worker values.

Docker CPM

The number of worker threads is set by the MINION_HEAVY_WORKERS and MINION_LIGHTWEIGHT_WORKERS environment variables, or by the cpu core count on the host if the environment variables are not set (num_cpu for heavy workers, 25*num_cpu for lightweight workers). The CPM will utilize the available hardware provided to it such that more CPU cores will yield more worker threads. The consideration here is that each heavy worker thread should be provided 2.5 GiB of memory to adequately support the CPM.

Kubernetes CPM

The number of worker threads is set by the heavyWorkers and lightweightWorkers values in our Helm chart. They default to 2 and 50 respectively. These values can be increased provided that the node has sufficient resources. If provided a PVC that requests access mode ReadWriteMany (RWX), multiple nodes can be utilized which helps to reduce the resources required by any one node.

We have a formula to help get a general idea of how many jobs each worker thread can process per minute. A general rule of thumb is to keep heavy jobs per worker per minute to 2 or less. This leaves room for jobs that soft fail and need to be retried (up to 2 more times), and jobs that time out at 180 seconds (the default timeout value).

Here is a formula you can use to calculate jobs per heavy worker thread per minute. Each heavy worker handles a non-ping job in a “runner container” that gets created and destroyed for each job:

number of non-ping monitors / (avg monitor frequency * number of heavy workers * hosts)

Note: Scripted API monitors use fewer resources than simple and scripted browser monitors.

For example, let’s say you have:

non-ping monitors     : 100
avg monitor frequency : 10 minutes
heavy workers         : 4 (typically equal to num_cpu)
hosts                 : 1

This would give you an estimate on how many non-ping jobs each heavy worker needs to process each minute, in this case 2.5 jobs per heavy worker per minute.

Note: If the average monitor frequency were shortened from 10 minutes to 1 minute, this would impact the resources required to process jobs by an order of magnitude! 10x the memory, 10x the cpu cores, and 10x the IOPS.

This is just a rough estimate of job demand to gauge what size each host ought to be. Assume that each job takes 30 seconds to process (conservatively). That means each heavy worker could process 2 jobs per minute.

In this conservative case, we’d need to add additional cpu cores, which translates to more heavy workers, which translates to more memory at 2.5 GiB (Gibibytes) per cpu core, our recommended level as mentioned in our requirements.

Job Timeout Rate

This query shows jobs that are timing out.

FROM SyntheticsPrivateMinion SELECT 100*average(minionJobsTimedout)/latest(minionJobsReceived) as 'job timeout rate' FACET minionLocation TIMESERIES MAX SINCE 2 weeks ago

Unique Monitors by Type

See how many monitors are set to a PL by type of monitor.

FROM SyntheticCheck SELECT uniqueCount(monitorId) WHERE location = '' FACET type SINCE 2 days ago

Average Monitor Frequency (non-ping)

Calculate the average monitor frequency across all non-ping monitors for a specific account.

FROM SyntheticCheck SELECT uniqueCount(monitorId)/rate(uniqueCount(id), 1 minute) AS 'avg job frequency (minutes)' WHERE type != 'SIMPLE' FACET location SINCE 2 days ago

Queue Growth

If the queue is growing, we can assume the CPM is not performing well enough. This could be due to:

  • More jobs being scheduled to run at the PL than it can handle given the current level of resources and number of CPM instances
  • Internal Engine Errors or other configuration issues leading to poor performance (firewall blocking, /tmp directory permissions, AWS gp2 burst balance used up on disk IO, network congestion, etc.).

The rate at which jobs are getting added to the PL queue (publish rate) versus the rate at which jobs are being pulled off the queue by the minion (consumption rate) will lead to queue growth if the publish rate > consumption rate.

The rate at which the queue grows can give you an idea about how much “extra” capacity the PL needs from host machines or VMs with minions running on them.

Rate of Queue Growth

How many extra jobs does the minion need to process to keep up?

FROM SyntheticsPrivateLocationStatus SELECT clamp_min(derivative(checksPending,1 minute),0) AS 'queue growth rate' FACET name TIMESERIES MAX SINCE 2 weeks ago

Consumption Rate

FROM SyntheticRequest SELECT rate(uniqueCount(jobId), 1 minute) WHERE type != 'SIMPLE' FACET location SINCE 2 weeks ago TIMESERIES

Memory Usage

It’s important to keep an eye on resources since it can impact performance. Memory is more of a factor for the minion than CPU due to Chrome.

FROM SyntheticsPrivateMinion SELECT latest(minionPhysicalMemoryUsedPercentage) FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Internal Engine Errors

FROM SyntheticsPrivateMinion SELECT 100*average(minionJobsInternalEngineError)/latest(minionJobsReceived) as 'IEE rate' FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Job Failure Rate

FROM SyntheticsPrivateMinion SELECT 100*average(minionJobsFailed)/latest(minionJobsReceived) as 'job failure rate' FACET minionLocation since 2 weeks ago TIMESERIES MAX

The rate of job failures vs total jobs can provide a measure of how significant failed jobs are to the minion over time. This query does not identify why the failures happened. They could be scripting errors, actual endpoint failures, or cpm issues (leading to alerts and false positives).

Monitors with failing checks contribute to retries on the queue, which require additional resources to process. So it’s best to try to fix monitors with script errors to reduce the demand on the minion.

To see these and other queries, check out our Private Minion dashboard JSON which you can copy and import into a new dashboard to get started with some useful charts for assessing your CPMs and Private Locations.