Containerized Private Minion (CPM) Troubleshooting Framework General

General Containerized Private Minion (CPM) Troubleshooting Framework

Troubleshooting steps to address common issues faced when installing and configuring the CPM. This framework is for common considerations, there are others for Internal Engine Errors, Docker CPM specifics, and one for Kubernetes CPM specifics.

General issues that affect both the Docker and K8s Containerized Private Minions (CPM) include things like network and proxy config, checks pending, host environment, permissions on mounts, failing monitors due to script errors, job demand vs supply of workers, capacity planning and resource constraints, architectural design, failover, and load balancing.

Information Gathering

Private Location Specs

Gathering information on the specs of the running minions for private locations can be useful for troubleshooting. The following query can help to do that.

Quick things to check for potential issues:

  • There should be only 1 minion ID per host.
  • Used memory should not exceed 60% of the total on a healthy system.
  • For the Docker Containerized Private Minion (CPM), total memory should be at least 2.5x the number of cpu cores.
  • For the Kubernetes CPM, total memory should be at least 3 Gibibytes (Gi) x the number of heavy workers + 3Gi for a healthcheck pod + 1.6Gi for the minion pod.
  • For the Kubernetes CPM, total amount of milliCPU should be at least 1000m x the number of heavy workers + 1000m for a healthcheck pod + 750m for the minion pod.
  • The number of workers should equal the number of cpu cores on a Docker CPM, and defaults to two heavy workers per replica set on the K8s CPM. Resource allocation between the two CPMs is quite different with the Docker CPM having free reign of whatever resources are available on the host, and the K8s CPM being precisely controlled via request and limit values per pod.

General Specs Query

SELECT uniqueCount(minionId),latest(minionWorkers),latest(minionProcessors),latest(minionPhysicalMemoryUsedBytes/(1024*1024*1024)) as 'used memory (Gi)',latest(minionPhysicalMemoryTotalBytes/(1024*1024*1024)) as 'total memory (Gi)',latest(timestamp) FROM SyntheticsPrivateMinion SINCE 5 minutes ago FACET minionBuildNumber,minionLocation,minionOsVersion,minionDockerVer,minionIpv4,minionHostname LIMIT 100

Monitors

Gather a list of monitors and their associated monitorIds to associate with debug logs. This query will also show a count of how many private locations each monitor is set to run on. Note that if the New Relic account where the private location lives is not the same as where monitors have been created, then you’ll need to run this query from the monitors’ account. It’s currently not possible to run queries on SyntheticCheck that span all accounts where monitors exist. If that data is needed, please contact New Relic Support.

Monitor with IDs Query

SELECT uniqueCount(location) FROM SyntheticCheck FACET monitorName,monitorId SINCE 1 day ago LIMIT 100

Checks Pending

This is near the top of the list because it is an important measure to gauge how the minion is performing. If checks are piling up on the external queue, you know something isn’t right. Check if a monitor is scheduled to run more frequently, say every minute, but it takes the minion more than a minute to process jobs, then the minion will never catch up and the queue will grow indefinitely. Either the check will need to run less frequently or more resources will need to be given to the minion to reduce the average job duration so it completes before the next check runs.

Checks pending is a great way to alert on when the CPM is failing.

Average Job Duration Query

SELECT average(checksPending) FROM SyntheticsPrivateLocationStatus WHERE name = 'YOUR_PRIVATE_LOCATION' SINCE 1 week ago TIMESERIES max

It will be useful to assess the rate at which the queue is growing to estimate how much more capacity the CPM needs to pull or “consume” those jobs from the queue.

Queue Growth Rate Query

SELECT derivative(checksPending, 1 minute) as 'queue growth rate (per minute)' FROM SyntheticsPrivateLocationStatus WHERE name = 'YOUR_PRIVATE_LOCATION' SINCE 2 days ago TIMESERIES

Assessing Job Demand (and Supply)

1) Count Unique Monitors

It is useful to gauge job demand by counting how many monitors are assigned to a particular private location by type and their average frequency. In other words, how many heavy jobs per minute will the CPM need to process. Contact New Relic Support if the monitors span multiple sub accounts but all point to the same private location, we’re happy to provide this data on request.

SELECT uniqueCount(monitorId) FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' FACET type SINCE 2 days ago

2) Calculate Average Monitor Frequency

Calculate the average monitor frequency, t, for heavy jobs. Why not include lightweight jobs? They don’t significantly impact resource usage.

SELECT 60*24*2*uniqueCount(monitorId)/count(monitorId) as 'Avg Frequency (minutes)' FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' and type != 'SIMPLE' SINCE 2 days ago

The above formula in the Select statement takes the unique count of monitors divided by the total count of monitors (jobs) that occur in the 2-day time period. This works because there is a 1-1 ratio between a count of monitorIds and unique jobs. This gives us the rate at which monitors are running, the avg monitor frequency. If the job count increases over a fixed time, the avg frequency will decrease, meaning a faster frequency. For example, going from a 10 minute frequency to a 5-minute frequency is moving to a faster frequency.

There’s also some math to solve for the frequency per minute:

x / 2 days            = uniqueCount(monitorId) / count(monitorId)
x / (60x24x2) minutes = uniqueCount(monitorId) / count(monitorId)
x = (60x24x2) minutes x uniqueCount(monitorId) / count(monitorId)

Notes:

  • This is based on supply-side data coming from the active minion. So if the minion is not performing well, this number will be skewed to the high side, meaning a longer avg frequency than actual. Best to round this t value down for step 3 below.
  • If changing the time period from 2 days ago to 3 days ago, you’ll need to adjust the math to 60 minutes in an hour x 24 hours in a day x 3 days to get the number of minutes over 3 days to match the current time interval.
  • It’s not useful to view this on a timeseries chart because the chart skews the y-axis by the number of aggregated intervals in the window.

3) Assess Job Demand

Calculate the job demand in units of jobs per heavy worker per minute, demand is greatly affected by monitor frequency. A 1-minute frequency will require an order of magnitude (10x) increase in CPU, memory, and disk IO vs a 10-minute frequency.

number of non-ping monitors / (t * number of heavy workers * replicas or hosts), where t = avg monitor frequency

As an example, let’s assume there are 28 non-ping monitors, 12 heavy workers, 1 host, and the avg frequency is 12 minutes. Plugging in the values:

28 / (12 x 12 * 1) = 0.2 jobs per heavy worker per minute

This equates to 1 job per heavy worker every 5 minutes, a low demand for the resources available. This minion should perform very well with room for future job demand growth.

4) Assess Consumption Rate

On the supply-side, how many jobs per minute is the minion actually processing, and is it meeting demand?

SELECT rate(uniqueCount(jobId), 1 minute) FROM SyntheticRequest WHERE type != 'SIMPLE' FACET location SINCE 2 days ago

It’s useful to view this as a line chart to see how the job rate changes over time.

SELECT rate(uniqueCount(jobId), 1 minute) FROM SyntheticRequest WHERE type != 'SIMPLE' FACET location SINCE 2 weeks ago TIMESERIES

Another way to think of this is from the external queue (checks pending). Jobs get added to the queue at some rate, a “publish rate”. The minion then pulls jobs off the queue as it processes them at a “consumption rate”. If the rate at which the minion is consuming jobs is lower than the rate at which jobs are being published or added to the queue, then the queue will grow at rate equal to the difference. In that case, more heavy workers, more CPU cores, more memory, and a disk with sufficient IOPS will likely be the answer.

Assessing Performance

Memory

How has memory usage varied over time? Is it growing? Are you using the Docker CPM? Is it being affected by a memory leak?

SELECT latest(minionPhysicalMemoryUsedPercentage) from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Let’s focus in on memory usage per hostname, which in the case of the Docker CPM will give us the minion containers and in the case of the K8s CPM will give us the minion pods (1 for each replicaSet).

SELECT latest(minionPhysicalMemoryUsedPercentage) FROM SyntheticsPrivateMinion WHERE minionLocation = 'YOUR_PRIVATE_LOCATION' FACET minionHostname SINCE 2 weeks ago TIMESERIES Max

CPU

How about CPU usage? Is it spiking during periods of high job demand? Let’s look at each location as a whole.

SELECT latest(minionProcessorsUsagePercentage) FROM SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Now let’s focus in on each hostname for a particular location.

SELECT latest(minionProcessorsUsagePercentage) FROM SyntheticsPrivateMinion WHERE minionLocation = 'YOUR_PRIVATE_LOCATION' FACET minionHostname SINCE 2 weeks ago TIMESERIES Max

Internal Engine Errors

How about Internal Engine Error (IEE), any locations experiencing minion errors? If so, it would be good to capture debug level logs while the issue occurs. It’s useful to look at a rate of IEE vs total jobs to determine how significant the IEE issue is for the CPM.

SELECT 100*latest(minionJobsInternalEngineError)/latest(minionJobsReceived) as 'IEE rate (%)' from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES Max

Job Failures

Similarly, the rate of job failures vs total jobs can provide a measure of how significant failed jobs are to the CPM over time. This query does not identify why the failures happened. They could be scripting errors, actual endpoint failures, or CPM issues.

Monitors with failing checks contribute to retries on the queue, which require additional resources to process. So it’s best to try to fix monitors with script errors to reduce the demand on the minion.

Job Failure Query

SELECT 100*latest(minionJobsFailed)/latest(minionJobsReceived) as 'job failure rate (%)' from SyntheticsPrivateMinion FACET minionLocation SINCE 2 weeks ago TIMESERIES MAX

If a monitor has a high failure count and a short frequency, like 1-minute, it should be addressed quickly. That one monitor could be taking up a significant percentage of the minion’s resources just to process failures over and over.

Are the failures due to an endpoint in a monitor script that needs a proxy? This query will help to identify the monitor errors. If an error is identified that means the minion was able to process the job, record the error, and send the result back to New Relic. Typically that is an indication that the minion is operating normally and is not the source of the error.

Monitor Error Query

SELECT count(result),max(timestamp) FROM SyntheticCheck WHERE result = 'FAILED' AND type != 'SIMPLE' FACET error,monitorName,location LIMIT 10

Also useful it to use the SyntheticRequest type to see the http status codes returned by scripts.

Response Code Query

SELECT count(responseCode),max(timestamp) FROM SyntheticRequest WHERE responseCode not in (200,301,302) AND responseStatus != '' AND location = 'YOUR_PRIVATE_LOCATION' FACET monitorName,responseStatus,responseCode LIMIT 10

Durations

The portion of the lifecycle of a job that occupies the minion starts with the internal queue duration, when the job first makes it to the minion until it starts execution. It will then execute and results are posted back to New Relic. If the internal queue duration is higher than the execution duration, you know there’s a problem with jobs queuing at the minion. If execution duration is too long, jobs may time out. That’s a good indication that performance should be improved.

SELECT average(nr.internalQueueDuration/1e3),average(nr.executionDuration/1e3) FROM SyntheticCheck WHERE type != 'SIMPLE' FACET location SINCE 2 days ago

You may find it useful to know the job durations by type (BROWSER, SCRIPTED BROWSER, API TEST), result (SUCCESS, FAILED), and monitor to check if any single monitor is influencing the avg durations more than others. Look at how job duration has changed as monitors were added over time. If duration is increasing along with monitor count, that’s a sign the CPM is struggling to cope with the additional demand.

SELECT average(nr.internalQueueDuration/1e3)+average(nr.executionDuration/1e3) as 'avg job duration (s)' FROM SyntheticCheck WHERE type != 'SIMPLE' FACET location,type SINCE 2 days ago

SELECT average(nr.internalQueueDuration/1e3)+average(nr.executionDuration/1e3) as 'avg job duration (s)' FROM SyntheticCheck WHERE type != 'SIMPLE' FACET location,result SINCE 2 days ago

SELECT average(nr.internalQueueDuration/1e3)+average(nr.executionDuration/1e3) as 'avg job duration (s)' FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' FACET monitorName AND type != 'SIMPLE' SINCE 2 weeks ago TIMESERIES AUTO

A key indicator to determine if the CPM is handling an increase in demand is to compare the execution duration, the internal queue duration, and the monitor count to observe how all three change relative to each other over time. If the internal queue duration rises toward or above the execution duration as monitors increase, you know the CPM is struggling to meet the demand and jobs are queuing at the minion.

SELECT average(nr.internalQueueDuration/1e3),average(nr.executionDuration/1e3),uniqueCount(monitorId) FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' AND type != 'SIMPLE' SINCE 2 weeks ago TIMESERIES AUTO

Outdated Monitor Runtimes

Which monitors are using old runtimes and could be upgraded?

SELECT latest(result) FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' AND nr.apiVersion != '0.6.0' FACET monitorName,nr.apiVersion,type LIMIT 100

Network Troubleshooting

Firewall

The synthetics-horde endpoint needs to be added to your firewall’s allow list.

For US region accounts: https://synthetics-horde.nr-data.net/
For EU region accounts: https://synthetics-horde.eu01.nr-data.net/

https://docs.newrelic.com/docs/using-new-relic/cross-product-functions/install-configure/networks#synthetics-private

Proxy

Is a proxy needed to access outside networks? If so, the minion will need to be configured with some environment variables. We check for specific environment variables and propagate those values to runner containers and pods. Note that these environment variables are for the minion to communicate with synthetics-horde, pull jobs from the queue and send job results back to New Relic. If the monitor itself needs to use a proxy to access an endpoint in the script, that needs to be set in the script. It will not use the environment variables below.

MINION_API_PROXY                          Format: "host:port".
MINION_API_PROXY_AUTH                     Format: "username:password" - Support HTTP Basic Auth + additional authentication protocols supported by Chrome.
MINION_API_PROXY_SELF_SIGNED_CERT         Acceptable values: true, 1, or yes (any case).

https://docs.newrelic.com/docs/synthetics/synthetic-monitoring/private-locations/containerized-private-minion-cpm-configuration#docker-env-config

synthetics.minionApiProxy                 Format: "host:port".
synthetics.minionApiProxyAuth             Format: "username:password" - Support HTTP Basic Auth + additional authentication protocols supported by Chrome.
synthetics.minionApiProxySelfSignedCert   Acceptable values: true, 1, or yes (any case).

https://docs.newrelic.com/docs/synthetics/synthetic-monitoring/private-locations/containerized-private-minion-cpm-configuration#kubernetes-env-config

Docker Network Troubleshooting

Test connection to outside from the host first.

# check that the host adapter and docker0 bridge have valid IP addresses
ip a

# curl CPM endpoint from host
curl -G https://synthetics-horde.nr-data.net/synthetics/api/v1/ping

Test if the minion and runner containers can access our synthetics-horde endpoint, with a proxy if needed for outside traffic.

# check if docker has access to outside from inside docker container
docker run -it -v /var/run/docker.sock:/var/run/docker.sock:rw docker

# should return Hello from Synthetics Horde if you have access through your firewall
curl -G https://synthetics-horde.nr-data.net/synthetics/api/v1/ping
curl -G https://synthetics-horde.eu01.nr-data.net/synthetics/api/v1/ping

# see more details, should return HTTP/1.1 200 OK
curl -Gvvv https://synthetics-horde.nr-data.net/synthetics/api/v1/ping

# should return HTTP/2 200
curl -Ivvv https://quay.io/

# with proxy
curl -U username:password -x proxy.com:port -I https://quay.io/
curl -U username:password -x proxy.com:port -v https://synthetics-horde.nr-data.net/synthetics/api/v1/ping

If name resolution fails because dns servers can’t be reached, try the following from the host:

# check if using systemd / network manager
systemd-resolve --status

# check if using dnsmasq
ps -e | grep dnsmasq

# check name resolution file
cat /etc/resolv.conf

# check hosts file
cat /etc/hosts

# if using dnsmasq, set docker dns rule
cat /etc/NetworkManager/dnsmasq.d/docker-bridge.conf
listen-address=172.17.0.1

# set dnsmasq as dns server in docker daemon
cat /etc/docker/daemon.json
{
  "dns": [
    "172.17.0.1",
     "another server",
     "another server"
  ]
}

If the network is unreliable, test over a longer time period:

# test HTTP GET with header to horde over many requests with max time set to 60 seconds same as the minion
set -B; success=0; for total in {1..100}; do curl -m 60 -IG https://synthetics-horde.nr-data.net/synthetics/api/v1/ping 2>/dev/null | grep '200 OK' && success=$((success+1)); echo "$success/$total, failure rate = $(( 100 * (total-success) / total ))%"; done;

# simulate failures due to timeouts by setting max time to something really low like 0.4 seconds
set -B; success=0; for total in {1..100}; do curl -m 0.4 -IG https://synthetics-horde.nr-data.net/synthetics/api/v1/ping 2>/dev/null | grep '200 OK' && success=$((success+1)); echo "$success/$total, failure rate = $(( 100 * (total-success) / total ))%"; done;

Admin endpoints

These endpoints work the same whether hitting the Docker CPM or K8s CPM. The status check tests a variety of things to determine the health of the CPM. If one of the health checks fails, details will be reported in the debug logs, and the minion pod should restart due to a livenessProbe failure on the K8s CPM.

curl http://localhost:8080/status/check
curl http://localhost:8080/status
curl http://localhost:8180/ping # should return "pong" if the minion is running
curl http://localhost:8180/healthcheck?pretty=true

Redundancy and load balancing

Hosts

For the Docker CPM, having more than one host allows for some level of redundancy. If one should fail, the other can handle the load. To add more hosts, for example AWS EC2 instances, each instance would run a separate CPM but use the same private location key.

Of course, this means that each host needs to have at least as much extra capacity to take on the load of the failed host. In a two host system, this would mean doubling up on capacity. In a three host system, each host would only need an additional ⅓ capacity, etc.

For load balancing of the private location job queue, each additional host or node acts to help balance the load for the private location. Each CPM is hungry and will feed itself whatever jobs can fit in its mouth from the queue.

Nodes

For the K8s CPM, the same holds true except that additional consideration needs to be made to how to structure the cluster. If the cluster is operating with a block-level storage system like AWS EBS, then the persistent volume claim (PVC) access mode will be ReadWriteOnce (RWO). This means that all pods will get scheduled to run on the same node.

For redundancy, additional namespaces could be created to run multiple instances of the K8s CPM on different nodes in the cluster.

If the storage system is a more traditional file system type, like AWS EFS, then the persistent volume claim access mode will be ReadWriteMany. This will allow Kubernetes to schedule pods to run on any node that has capacity. In this case, redundancy can be increased by incrementing the replicaSet count.

Cronjobs and Docker Prune

Weekly Restarts

Weekly restarts may be required if your minion is experiencing memory growth over time, i.e. a memory leak. There are several factors that can affect the CPM’s susceptibility to a memory leak. It is most common to RHEL and can occur with both the Docker CPM and the Kubernetes CPM. Updating Docker, Kubernetes, and Linux kernels can help by incorporating memory leak patches found in recent releases of those systems.

Note that applying the often recommended nokmem workaround can lead to other undesirable issues like frozen pods or inaccessible hosts.

For now, these are the recommended workarounds:

  1. Update the minion to the latest version. We’ve added some memory and network pruning that occurs on minion start.
    https://docs.newrelic.com/docs/synthetics/new-relic-synthetics/private-locations/install-containerized-private-minions-cpms#install
  2. Set up a CRON job to perform a weekly restart of the minion. This will help to mitigate the issue until a more permanent solution can be found.
    1. For Docker, the CRON job would include a docker prune and then docker run.
    2. For Kubernetes, the cronjob would include a helm upgrade to restart the minion.

Example CRON job for the Docker CPM:

#!/bin/bash
#
# Verify only root is running script
if [[ $EUID -ne 0 ]]; then
   echo "This script must be run as root"
   exit 1
fi
# stop all synthetics-minion containers
docker stop $(docker ps | grep "synthetics-minion" | awk '{ print $1 }') 2>/dev/null
# wait for any running processes to finish
sleep 120
# prune containers, images, and networks not in use
docker system prune -af
sleep 120
docker system prune -af
sleep 60
# start new private minions to support monitoring activities
docker run -d --restart unless-stopped -e MINION_PRIVATE_LOCATION_KEY=YOUR_PRIVATE_LOCATION_KEY -v /tmp:/tmp:rw -v /var/run/docker.sock:/var/run/docker.sock:rw quay.io/newrelic/synthetics-minion:latest

Docker cleanup

As part of the CRON job, it’s a good idea to clean up docker with a docker system prune -a, which will prune any stopped containers, images, and remove orphaned networks. Be careful, if other containers exist besides the CPM, they will also get pruned if they are also in a stopped state. If your host runs additional projects alongside the Docker CPM, which is not recommended, then make sure the system prune doesn’t also prune them. It’s okay to prune everything related to the CPM since the minion will get recreated and images redownloaded on the next docker run.

Rightsizing

Internal Queue Duration

Especially if the internal queue duration is higher than the duration, or if the queue is growing without obvious signs of job failures due to the minion, then assess demand to determine what size each host or node needs to be to meet that demand.

SELECT average(nr.internalQueueDuration/1e3),average(nr.executionDuration/1e3),uniqueCount(monitorId) FROM SyntheticCheck WHERE location = 'YOUR_PRIVATE_LOCATION' AND type != 'SIMPLE' SINCE 2 weeks ago TIMESERIES AUTO

Memory Utilization

If the queue is growing and memory is underutilized, say around 20%, then increase the heavy workers count until memory usage approaches 30-40%. This will allow the minion to process more jobs simultaneously, allowing for some longer running scripts to have less of an impact on the overall performance.

The Docker CPM can use up to 2.5 Gibibytes per runner container (managed by a heavy worker thread on the minion container). If a host has 4 CPU cores, this will default to 4 heavy workers. The general requirement is to allocate at least 2.5 times the memory which would be 10 GiB. Assuming each job takes approximately 5 seconds to complete, the CPM could potentially process 48 non-ping jobs per minute at most (60 seconds in a minute / 5 seconds per job = 12 jobs per heavy worker x 4 heavy workers).

If jobs take longer than 5 seconds to complete, then the number of jobs the CPM could process per minute would decline.

CPU Utilization

Also consider CPU utilization. For the Docker CPM, it is best not to exceed 2.5 times the number of heavy workers per CPU core. For example, if your host has 2 CPU cores and enough job demand to fill 5 heavy workers consistently, then those 2 CPU cores will likely be maxed out. Estimate 15-20% processor usage per heavy worker across 2 cores.

For the K8s CPM estimate 1 CPU core per heavy worker, which is the pod limit we set on the node. Kubernetes is more rigid in defining precise resources compared to the Docker CPM running on a typical Linux host.

Input/Output Operations Per Second (IOPS)

Both the Docker and K8s CPMs require a decent amount of write throughput. Estimate somewhere between 10-20 Input/Output Operations Per Second (IOPS) per non-ping monitor set to 1-minute frequency.

A common issue can arise when using an AWS EC2 instance or EC2 node group for EKS with the gp2 storage type. The base IOPS for gp2 is 100. Assuming 16 non-ping monitors set to 1-minute frequency, IOPS will hover between 200 and 300.

This will quickly use up the burst balance since anything above the base level of 100 will require burst credits to be spent. Once the burst balance reaches zero, the minion will experience very high IOWait and severely degraded performance. The job queue will most certainly start to grow in this scenario.

An easy way to see snapshots in time of %IOWait is with iostat 3. If you’d like to measure an avg tps, use iostat -d 1 2.

→ The best practice here is to analyze the average IOPS of your CPM, especially on write operations, to determine the appropriate volume size to meet the needed disk operations per second.

Using AWS EC2 as an example, the gp2 storage type allocates 3 IOPS/Gigabyte, so a 100 Gigabyte volume would yield a sufficient base level of 300 IOPS without reaching into the burst balance.