Containerized Private Minion (CPM) Troubleshooting Framework Internal Error Codes

Internal Error Codes Containerized Private Minion (CPM) Troubleshooting Framework

This Framework covers common CPM errors. It explains common underlying causes and resolutions. Other Troubleshooting Frameworks may go into greater detail on likely resolutions.

Internal Engine Error Codes

If the minion experiences an Internal Engine Error (IEE), it typically means that there was an issue that affected the normal operation and any monitor failures that occur due to an IEE should not be attributed to a problem with the endpoint or script. For example, if the minion experiences an IEE which causes a monitor to fail triggering an alert, that alert would be a false positive. It would not mean that the endpoint is down or the thing the script is monitoring was actually affected.

This is why it’s important to resolve IEE errors so that monitor check results can be relied upon.

Checks that fail on an IEE are recorded in the SyntheticsPrivateMinion type with the minionJobsInternalEngineError attribute. It records a running count per minion and doesn’t list the error code or message, unfortunately.

To see the lifetime count (since the last restart) of IEE per minion:

SELECT minionJobsInternalEngineError FROM SyntheticsPrivateMinion WHERE minionLocation = <location>

Code: 1 MONITOR_TO_RUNNER_TYPE_MISMATCH

Code: 2 DOCKER_ERROR_RESPONSE

Code: 3 DOCKER_UNABLE_TO_CREATE_CONTAINER

This indicates that there was an issue with the CPM POST-ing a request to Docker to create a runner container based on the synthetics-minion-runner image. Often when this error occurs, you’ll see along with this the error message returned from the Docker daemon:

synthetics-minion-runner image is missing

This is the most common case. There can sometimes occur a scenario where the synthetics-minion-runner docker image is no longer available on the host. The Docker daemon will return a 404 (image missing) when we attempt to create the container.

! Caused by: com.spotify.docker.client.exceptions.DockerRequestException: Request error: POST unix://localhost:80/v1.35/containers/create?name=synthetics-minion-runner-container-LZwH5S5N: 404, body: {"message":"No such image: synthetics-minion-runner:2.2.11"}

This often happens if the host has some regular garbage collection occurring for Docker. Note that Docker doesn’t perform its own garbage collection. So this can take the form of a routine docker system prune or docker image prune on a CRON, or some other utility performing the same task. These commands will clean up on the host docker images that are not being used by a container. Since runner containers are ephemeral and run on demand, it is possible that a command like this can run while there is not an active runner container, and delete it.

This can be resolved by restarting the CPM, which will on startup automatically detect that the runner image doesn’t exist on the host and re-download it.

No disk space

Sometimes this can occur if the host is out of disk space. This is especially so if the host has a small disk like 10GB, and the CPM has been updated a few times, which means new instances of the synthetics-minion-runner image which can run between 2-3GB each. Docker will respond with a 500 response code and the message no space left on device.

INTERNAL ENGINE ERROR - code: 3 => DOCKER_UNABLE_TO_CREATE_CONTAINER [Request error: POST unix://localhost:80/containers/cf1a8acab7ebfa41b2dfb719a559625c359f2f697018b23e19161472bf18bd65/start: 500, body: {"message":"mkdir /var/run/docker/libcontainerd/cf1a8acab7ebfa41b2dfb719a559625c359f2f697018b23e19161472bf18bd65: no space left on device"}

Other scenarios

This error type can occur if some system resource is exhausted at the moment of runner container creation. This may step into arcane areas of linux systems and at which point it is good to consider:

  • Is this resource exhaustion a potential Docker or Linux kernel bug?
  • Could it be inode exhaustion, open files limit, memory leaks, or permissions issues caused by SELinux, etc.
  • Check Docker daemon logs for clues.
# check user limits
ulimit -a

# count the number of file handles currently open for all processes (without duplicates):
sudo lsof | awk '{print $9}' | sort | uniq | wc -l

Code: 4 DOCKER_FAILED_TO_COPY_FILE

Code: 5 DOCKER_IMAGE_INFO_RETRIEVAL_ERROR

Code: 6 DOCKER_IMAGE_NOT_FOUND_ERROR

Code: 7 DOCKER_FAILED_TO_WAIT_CONTAINER_TERMINATION

Code: 8 DOCKER_FAILED_TO_CAPTURE_LOGS

Code: 9 DOCKER_CONTAINER_INFO_RETRIEVAL_ERROR

Code: 10 DOCKER_CONTAINER_NOT_FOUND_ERROR

Code: 12 WEBDRIVER_SERVER_TIMEOUT

Code: 13 MINION_JOB_RESOURCE_HTTP_ERROR

Code: 14 UNKNOWN

Code: 15 MALFORMED_JOB

Code: 16 NO_SUITABLE_RUNNER_AVAILABLE

Code: 17 UNABLE_TO_BASE64_ENCODE_SCRIPT

Code: 18 SCRIPT_MISSING

Code: 19 WEBDRIVER_SERVER_FAILED_START

Code: 20 UNABLE_TO_INIT_JOB_SHARED_VOLUME

Code: 21 UNABLE_TO_SAVE_JOB_SHARED_VOLUME_RESULTS

Code: 22 CHROME_TRAFFIC_FILTER_EXT_FAILED_START

Browser monitor checks use a Chrome instance that has a traffic filter extension. This helps us enforce blacklists of hostnames, apply custom HTTP headers, and wait for pending HTTP requests in a check. This filter also has a timeout of 90 seconds to start, and if that timeout is violated this is the error that is thrown. The most common scenario this can happen in is if there is some activity like a non-browser based HTTP request that occupies the script for 90 seconds before traffic is initiated in the browser, or while HTTP traffic in the browser is idle. An example use case is fetching a resource from an API before proceeding with the webdriver script. You can mock this scenario with the following script:

$browser.get('https://newrelic.com')
.then(function(){
  setTimeout(timeoutTest, 95000);
})

function timeoutTest(){
  console.log("hello")
}

Code: 23 CHROME_SESSION_DELETED

Code: 24 CHROME_NO_SUCH_SESSION

Code: job_stuck 25

Code: 26 CHROME_CRASHED

Code: 27 CHROME_FAILED_TO_START

Code: 28 CHROME_FAILED_TO_PROGRESS

Chrome is memory expensive, and if there is memory contention on the host it is possible that the Chrome browser instance in a check will get locked up. Seeing this occasionally in minion logs is tolerable, we’ll retry the check anyways, but if this is chronically happening in the logs it is best to start investigating memory utilization. A few key things to consider:

  • Are the minimum requirements of memory to core count being met? We require 2.5 GiB of memory per core present on host for the Docker CPM and 3 GiB of memory per heavy worker on the K8s CPM.
  • Are there multiple instances of the minion running on the host? In this scenario each CPM will spin up heavy job workers for each core on the host and contend for memory.
  • Are there any other services running on the host? Basically is there anything else running on the host that can be contending for memory. Really the box the CPM is on should be single purposed to the CPM.

Code: 29 HTTP_CONTEXT_MISSING

Code: 30 DOCKER_CONNECTION_FAILED

Code: 31 DOCKER_UNABLE_TO_CREATE_NETWORK

This is probably the most common error we see reported, so much so that we call it out in our documentation here:
https://docs.newrelic.com/docs/synthetics/new-relic-synthetics/private-locations/install-containerized-private-minions-cpms#networks

To summarize what we have there, basically everytime a runner container is created, a new bridge network is created and attached to it. This is to help achieve some network isolation during checks. After the check is completed there is a best effort to remove that bridge network. If an attempt to remove it fails, the minion will perform a retry strategy to remove it. In some cases these bridge networks can get orphaned, and since each network has assigned to it a specific IPv4 range this can eventually uses up all available IPv4 space available to Docker.

A common scenario where a runner network can get orphaned is when the CPM is exited while a check is being performed in a runner, and so it doesn’t have a chance to clean it up. If you have orphaned networks you can see this by running the command docker network ls while a CPM is not running:

9b97680884cb        synthetics-minion-network-28d77731-e023-4fb8-bd5a-ffc8922d8767   bridge              local
aed08c32a95c        synthetics-minion-network-179bb248-3097-429a-88dd-2588616d4691   bridge              local
ea410d549606        synthetics-minion-network-f93180bf-49b8-48c6-b4b9-34c9c75914d8   bridge              local

These can easily be cleaned up by running docker network prune and likewise it is a good recommendation to have a this running as a CRON on the host to guard against this. Since version 2.2.15, the CPM now cleans up dangling synthetics-minion-networks at boot.

Code: 32 DOCKER_UNABLE_TO_CONNECT_TO_NETWORK

Code: 33 DOCKER_UNABLE_TO_INSPECT_NETWORK

Code: 34 DOCKER_UNABLE_TO_RETRIEVE_NETWORK_CONTAINER

Code: 35 MONITOR_TO_RUNNER_API_VERSION_MISMATCH

Code: 36 UNABLE_TO_INIT_CUSTOM_MODULES

Code: 37 UNABLE_TO_PERSIST_JOB

Code: 38 KUBERNETES_UNABLE_TO_WAIT_FOR_RUNNER

Code: 39 UNABLE_TO_RUN_JOB

This means the Kubernetes CPM was not able to provision enough resources on the node/cluster to schedule a Runner pod.

Code: 40 UNABLE_TO_CREATE_HARPROXY

1 Like