Containerized Private Minion (CPM) Troubleshooting Framework Internal Error Codes

Internal Error Codes Containerized Private Minion (CPM) Troubleshooting Framework

This Framework covers common CPM errors. It explains common underlying causes and resolutions. Other Troubleshooting Frameworks may go into greater detail on likely resolutions.

Internal Engine Error Codes

If the minion experiences an Internal Engine Error (IEE), it typically means that there was an issue that affected the normal operation and any monitor failures that occur due to an IEE should not be attributed to a problem with the endpoint or script. For example, if the minion experiences an IEE which causes a monitor to fail triggering an alert, that alert would be a false positive. It would not mean that the endpoint is down or the thing the script is monitoring was actually affected.

This is why it’s important to resolve IEE errors so that monitor check results can be relied upon.

Checks that fail on an IEE are recorded in the SyntheticsPrivateMinion type with the minionJobsInternalEngineError attribute. It records a running count per minion and doesn’t list the error code or message, unfortunately.

To see the lifetime count (since the last restart) of IEE per minion:

SELECT minionJobsInternalEngineError FROM SyntheticsPrivateMinion WHERE minionLocation = <location>

Code: 3 DOCKER_UNABLE_TO_CREATE_CONTAINER

This indicates that there was an issue with the CPM POST-ing a request to Docker to create a runner container based on the synthetics-minion-runner image. Often when this error occurs, you’ll see along with this the error message returned from the Docker daemon:

synthetics-minion-runner image is missing

This is the most common case. There can sometimes occur a scenario where the synthetics-minion-runner docker image is no longer available on the host. The Docker daemon will return a 404 (image missing) when we attempt to create the container.

! Caused by: com.spotify.docker.client.exceptions.DockerRequestException: Request error: POST unix://localhost:80/v1.35/containers/create?name=synthetics-minion-runner-container-LZwH5S5N: 404, body: {"message":"No such image: synthetics-minion-runner:2.2.11"}

This often happens if the host has some regular garbage collection occurring for Docker. Note that Docker doesn’t perform its own garbage collection. So this can take the form of a routine docker system prune or docker image prune on a CRON, or some other utility performing the same task. These commands will clean up on the host docker images that are not being used by a container. Since runner containers are ephemeral and run on demand, it is possible that a command like this can run while there is not an active runner container, and delete it.

This can be resolved by restarting the CPM, which will on startup automatically detect that the runner image doesn’t exist on the host and re-download it.

No disk space

Sometimes this can occur if the host is out of disk space. This is especially so if the host has a small disk like 10GB, and the CPM has been updated a few times, which means new instances of the synthetics-minion-runner image which can run between 2-3GB each. Docker will respond with a 500 response code and the message no space left on device.

INTERNAL ENGINE ERROR - code: 3 => DOCKER_UNABLE_TO_CREATE_CONTAINER [Request error: POST unix://localhost:80/containers/cf1a8acab7ebfa41b2dfb719a559625c359f2f697018b23e19161472bf18bd65/start: 500, body: {"message":"mkdir /var/run/docker/libcontainerd/cf1a8acab7ebfa41b2dfb719a559625c359f2f697018b23e19161472bf18bd65: no space left on device"}

Other scenarios

This error type can occur if some system resource is exhausted at the moment of runner container creation. This may step into arcane areas of linux systems and at which point it is good to consider:

  • Is this resource exhaustion a potential Docker or Linux kernel bug?
  • Could it be inode exhaustion, open files limit, memory leaks, or permissions issues caused by SELinux, etc.
  • Check Docker daemon logs for clues.
# check user limits
ulimit -a

# count the number of file handles currently open for all processes (without duplicates):
sudo lsof | awk '{print $9}' | sort | uniq | wc -l

Code: 22 CHROME_TRAFFIC_FILTER_EXT_FAILED_START

Browser monitor checks use a Chrome instance that has a traffic filter extension. This helps us enforce blacklists of hostnames, apply custom HTTP headers, and wait for pending HTTP requests in a check. This filter also has a timeout of 90 seconds to start, and if that timeout is violated this is the error that is thrown. The most common scenario this can happen in is if there is some activity like a non-browser based HTTP request that occupies the script for 90 seconds before traffic is initiated in the browser, or while HTTP traffic in the browser is idle. An example use case is fetching a resource from an API before proceeding with the webdriver script. You can mock this scenario with the following script:

$browser.get('https://newrelic.com')
.then(function(){
  setTimeout(timeoutTest, 95000);
})

function timeoutTest(){
  console.log("hello")
}

Code: 28 CHROME_FAILED_TO_PROGRESS

Chrome is memory expensive, and if there is memory contention on the host it is possible that the Chrome browser instance in a check will get locked up. Seeing this occasionally in minion logs is tolerable, we’ll retry the check anyways, but if this is chronically happening in the logs it is best to start investigating memory utilization. A few key things to consider:

  • Are the minimum requirements of memory to core count being met? We require 2.5 GiB of memory per core present on host for the Docker CPM and 3 GiB of memory per heavy worker on the K8s CPM.
  • Are there multiple instances of the minion running on the host? In this scenario each CPM will spin up heavy job workers for each core on the host and contend for memory.
  • Are there any other services running on the host? Basically is there anything else running on the host that can be contending for memory. Really the box the CPM is on should be single purposed to the CPM.

Code: 31 DOCKER_UNABLE_TO_CREATE_NETWORK

This is probably the most common error we see reported, so much so that we call it out in our documentation here:
https://docs.newrelic.com/docs/synthetics/new-relic-synthetics/private-locations/install-containerized-private-minions-cpms#networks

To summarize what we have there, basically everytime a runner container is created, a new bridge network is created and attached to it. This is to help achieve some network isolation during checks. After the check is completed there is a best effort to remove that bridge network. If an attempt to remove it fails, the minion will perform a retry strategy to remove it. In some cases these bridge networks can get orphaned, and since each network has assigned to it a specific IPv4 range this can eventually uses up all available IPv4 space available to Docker.

A common scenario where a runner network can get orphaned is when the CPM is exited while a check is being performed in a runner, and so it doesn’t have a chance to clean it up. If you have orphaned networks you can see this by running the command docker network ls while a CPM is not running:

9b97680884cb        synthetics-minion-network-28d77731-e023-4fb8-bd5a-ffc8922d8767   bridge              local
aed08c32a95c        synthetics-minion-network-179bb248-3097-429a-88dd-2588616d4691   bridge              local
ea410d549606        synthetics-minion-network-f93180bf-49b8-48c6-b4b9-34c9c75914d8   bridge              local

These can easily be cleaned up by running docker network prune and likewise it is a good recommendation to have a this running as a CRON on the host to guard against this. Since version 2.2.15, the CPM now cleans up dangling synthetics-minion-networks at boot.

Code: 39 UNABLE_TO_RUN

This means the Kubernetes CPM was not able to provision enough resources on the node/cluster to schedule a Runner pod.