Containerized Private Minion (CPM) Troubleshooting Framework Docker Specific

Troubleshooting the Docker CPM

This Framework covers common CPM errors when installing or configuring on Docker. It explains common causes and resolutions of issues.

The Diagnostics CLI works for the Docker CPM and can automate some troubleshooting and information gathering.

Documentation for reference:

Docker CPM install steps
https://docs.newrelic.com/docs/synthetics/synthetic-monitoring/private-locations/install-containerized-private-minions-cpms#docker-update

Docker CPM requirements
https://docs.newrelic.com/docs/synthetics/synthetic-monitoring/private-locations/install-containerized-private-minions-cpms#docker-requirements

Docker CPM environment variables that can be passed at launch
https://docs.newrelic.com/docs/synthetics/synthetic-monitoring/private-locations/containerized-private-minion-cpm-configuration#docker-env-config

Clarifying Questions

When troubleshooting the Docker CPM, ask yourself some basic questions about the environment and script requirements.

  • What are the specifications of the host? Is it a VM or AWS EC2 instance?
  • How much memory in total is available to the CPM? free -h
  • How many CPUs are recognized by Docker? docker info
  • Which storage driver being used by Docker? Is it overlay2?
  • How does Docker connect to the host’s network? Is it using the Docker bridge network?
  • Which Docker security options are in play? SELinux requires some careful considerations.
  • Has the https://synthetics-horde.nr-data.net/ endpoint been added to the firewall’s allow list?
  • Does the host use a proxy to connect out to the Internet?
  • Does the host use a proxy to connect to internal resources?
  • Does the endpoint in the script use a proxy?

Docker Info

If you are working with New relic Support, please collect the following information:

  1. Your docker run command.
  2. Output from the following commands:
docker info > docker-info.yaml
docker inspect YOUR_CONTAINER_NAME > docker-inspect.json
docker events -f 'container=YOUR_CONTAINER_NAME' --format '{{json .}}' --since '60m' --until '0m' > docker-events.json
docker stats --no-stream > docker-stats.txt
df -h
df -i
free -hw
dmesg | grep oom-killer

Debug Logs

Restart the minion with debug logging enabled:

docker run -e MINION_PRIVATE_LOCATION_KEY=YOUR_PRIVATE_LOCATION_KEY -e MINION_LOG_LEVEL=DEBUG -v /tmp:/tmp:rw -v /var/run/docker.sock:/var/run/docker.sock:rw quay.io/newrelic/synthetics-minion:latest

Then run any problematic monitor scripts a few times to generate some useful log lines and send the information to the Support Engineer you are working with.

docker logs $(docker ps -aq --filter \"label=name=synthetics-minion\") > minion-debug.log

https://docs.newrelic.com/docs/synthetics/new-relic-synthetics/private-locations/containerized-private-minion-cpm-maintenance-monitoring#monitor-docker-debug-logs

You can alternatively run our Diagnostics CLI tool which will collect logs and some docker info.

  1. Download the latest version of New Relic Diagnostics here:
    https://download.newrelic.com/nrdiag/nrdiag_latest.zip
  2. Unzip to any location on the minion host.
  3. Run the OS appropriate binary with:
./nrdiag -s minion -a ATTACHMENT_KEY

Afterwards, remove the -e MINION_LOG_LEVEL=DEBUG and restart the minion.

Common Operational Issues

Docker Container Stops Randomly

Make sure a Docker CPM instance is running on each host (one per host). If stopped, it can be restarted again with:

docker run -e MINION_PRIVATE_LOCATION_KEY=YOUR_PRIVATE_LOCATION_KEY -v /tmp:/tmp:rw -v /var/run/docker.sock:/var/run/docker.sock:rw quay.io/newrelic/synthetics-minion:latest
  1. If the Docker instance continues to stop without warning, make sure each host has at least 2.5 Gibibytes of memory per CPU core.
  2. Consider adding a restart policy to your docker run command like docker run --restart unless-stopped ...
  3. Check for Docker issues in the system log:
# run with debug mode enabled
dockerd -D

# or add to docker daemon config
vi /etc/docker/daemon.json

# add the following code snippet
{
  "debug": true,
  "log-level": "debug"
}

# restart the docker service
systemctl restart docker

# RHEL, Oracle Linux
cat /var/log/messages | grep docker

# Debian
cat /var/log/daemon.log | grep docker

# Ubuntu 16.04+, CentOS
journalctl -u docker.service

# check for oom-killer in action
dmesg | grep oom-killer

Arguments Passed at Launch

Other than environment variables, arguments passed to the CPM in the docker run command do not get passed on to the runner containers, which the CPM creates to process lightweight jobs. Docker has no concept of “inheritance” for containers, and we don’t copy the configuration from the docker run command to the runner containers. The only shared configuration between them is the one set in the Docker Daemon and the environment variables set a launch with the -e option.

Unless the run option only applies to the minion container, for example --restart unless-stopped, run options like docker run --privileged may not get passed to the runner container created by the minion container. This can lead to permissions issues when trying to process lightweight jobs. The privileged option is often used in combination with SELinux and the :z mount option on tmp and docker.sock. This will typically lead to issues that may not manifest for a few days, for example, the /tmp directory filling up due to job results not getting deleted, or IEE Code: 31 where the CPM stops processing jobs due to IP address exhaustion from orphaned networks. The :z mount option should generally be avoided.

Maintenance

In order to keep the docker CPM running smoothly, weekly restarts and/or host reboots are recommended. Memory tends to climb over time along with disk IO until either the container crashes, docker quits, the host freezes, or the disk system runs out of burst balance (as is the case for an AWS EC2 instance with the gp2 storage class).

A CRON job can be used to perform docker system prune -a during the weekly restart to help keep Docker clean. This will prune all stopped containers, associated images, and networks, so be aware of this if other containers exist on the host not related to the CPM that have also been stopped.

#!/bin/bash
# Verify only root is running script
if [[ $EUID -ne 0 ]]; then
  echo "This script must be run as root"
  exit 1
fi
# stop all synthetics-minion containers
docker stop $(docker ps | grep "synthetics-minion" | awk '{ print $1 }') 2>/dev/null
# wait for any running processes to finish
sleep 120
# prune containers, images, and networks not in use
docker system prune -af
sleep 120
docker system prune -af
sleep 60
# start new private minions to support monitoring activities
docker run -d --restart=unless-stopped -e MINION_PRIVATE_LOCATION_KEY=YOUR_PRIVATE_LOCATION_KEY -v /tmp:/tmp:rw -v /var/run/docker.sock:/var/run/docker.sock:rw quay.io/newrelic/synthetics-minion:latest

SELinux

The docker CPM has not been well tested with SELinux, so it is recommended to run with SELinux disabled in both the host and Docker daemon. The output of docker info should not list SELinux under “Security Options”. Also the output of docker inspect YOUR_CONTAINER should have empty values for “MountLabel” and “ProcessLabel”.

The Docker CPM has not been well tested with Docker mount options :z and :Z. They may work to modify permissions for the minion container, but the runner containers that get created by the minion container may not get passed those same permissions.

Storage and Volumes

It’s best to use the overlay2 storage driver for Docker to avoid the issue of inode exhaustion with older storage drivers like devicemapper.

To allow the Docker CPM full access to volume mounts, it’s recommended to chown 1000:3729 /tmp on the host. The runner containers that get created from the minion container look for gid 3729.

# disk free
df -h
# inode consumption
df -i

Memory

Having enough memory plays a critical role in minion stability. Each runner container can use 1.5-2.5 GB of memory since it runs a full Chrome instance.

# check docker stats
docker stats
# check total, used, available, and swap memory
free -hw
# shows OutOfMemory-killer at work
dmesg | grep oom-killer