Containerized Private Minion (CPM) Troubleshooting Framework Kubernetes Specific

Troubleshooting the Kubernetes CPM

Note, Diagnostics CLI currently only works for the Docker CPM.

For reference:

K8s CPM install steps

K8s CPM requirements

K8s CPM environment variables that can be passed at launch

K8s clarifying questions

When troubleshooting the K8s CPM, ask yourself some basic questions about the environment and script requirements.

  • What are the specifications of the cluster? Is it a VM, AWS EC2 node group with autoscaling, or running on Kubernetes cloud service like EKS, GKE, or AKS?
  • How much memory in is available to the CPM on the node? kubectl describe node YOUR_NODE
  • How many milliCPU are available?
  • Is your node running any other pods or projects?
  • Is your node overprovisioned on request or limit? Hopefully there’s at least 50% free for the K8s pods to be added as needed.
  • Which storage class is being used by the cluster and is it set as the default storage class?
  • Do you have an existing persistent volume (PV) and persistent volume claim (PVC) or do you plan on letting the K8s CPM create one on demand? (via the default sc.)
  • Which container runtime is being used by your cluster? Docker, containerd, CRI-O?
  • Has the endpoint been added to the firewall’s allow list?
  • Does the cluster need a proxy to connect out to the Internet?
  • Does the cluster use a proxy to connect to internal resources?
  • Does the endpoint in the script use a proxy?
  • Are you using Helm3?
  • Is kubectl installed and the right version to match the server?

K8s info

# cluster info
kubectl version --short
kubectl api-versions
kubectl cluster-info

# get helm version
helm version --short

# see applied values from helm chart
helm get all YOUR_CPM_NAME -n YOUR_NAMESPACE > helm-all.yaml

# gets
kubectl get --raw='/readyz?verbose'
kubectl get all -n YOUR_NAMESPACE
kubectl get all -o json -n YOUR_NAMESPACE > k8s-all.json
kubectl get
kubectl get events --field-selector type=Warning -n YOUR_NAMESPACE
kubectl get events -o json -n YOUR_NAMESPACE > k8s-events.json
kubectl get statefulset YOUR_CPM_NAME -o wide -w -n YOUR_NAMESPACE

# describes
kubectl describe svc YOUR_SERVICE
kubectl describe sc YOUR_STORAGECLASS
kubectl describe node YOUR_NODE
kubectl describe pv YOUR_PV
kubectl describe namespaces
kubectl describe role synthetics-minion-role -n YOUR_NAMESPACE
kubectl describe pvc YOUR_PVC -n YOUR_NAMESPACE
kubectl describe quota -n YOUR_NAMESPACE
kubectl describe pod YOUR_POD -n YOUR_NAMESPACE

# list role-based access control(RBAC) permissions for your user
kubectl auth can-i --list -n YOUR_NAMESPACE

# get your sa
kubectl get sa -n YOUR_NAMESPACE

# list role-based access control (RBAC) permissions for your sa or the default sa if that is the one being used
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default
kubectl auth can-i --list --as=system:serviceaccount:YOUR_NAMESPACE:YOUR_SA -n YOUR_NAMESPACE

K8s debug logs

Next let’s collect minion logs at DEBUG level.

# use helm install for a new cpm, helm upgrade for existing
helm upgrade YOUR_CPM_NAME YOUR_REPO_NAME/synthetics-minion -n YOUR_NAMESPACE --set synthetics.privateLocationKey=YOUR_PRIVATE_LOCATION_KEY --set synthetics.minionLogLevel=DEBUG

# save logs to file
kubectl logs YOUR_POD -n YOUR_NAMESPACE > k8s-minion-debug.log

Try to collect logs while the issue occurs, then send to this ticket.

Afterwards, remove the --set synthetics.minionLogLevel=DEBUG and restart the minion.

Common operational issues

K8s resources

It is common to underestimate the resource requirements needed for the K8s CPM to run stable under Kubernetes. Resources needed at a minimum are:

  • 1 minion pod
  • 2 heavy workers (runner pods)
  • 1 healthcheck pod (also a runner pod)

That’s four pods that may all need to run on one node at the same time. In total, that could be as much as 3750m (3.75 CPU) and 10.6 Gibibytes of memory.

Persistent Volume (PV) and Persistent Volume Claim (PVC)

If no static persistent volume exists, we use the default storage class and persistent volume claim to generate the persistent volume on the fly. If no default storage class is set, you can do so with:

kubectl get storageclass
kubectl patch storageclass STORAGECLASS_NAME -p '{"metadata": {"annotations":{"":"true"}}}'

If a storage class needs to be used other than the default, it can be set by overriding the value in our helm chart.

# get default values
helm show values YOUR_REPO/synthetics-minion > values.yaml

# set your claimName if you've manually created a persistent volume with a non-default storage class
# or leave claimName blank and set the storageClass name to see if the CPM can automatically create the persistent volume claim
# from the storage class provided (will need a provisioner defined in the storage class)
  # The name of the persistent volume claim to use
  # If undefined or not set Statefulset will dynamically create a persistent volume claim for each replica
  # claimName: ""

  ## Override the StorageClass for VolumeClaimTemplates (relevant only if claimName is undefined or empty )
  ## If defined and claimName is empty, storageClassName: <storageClass>, i.e the PVC can be bound to the PVs
  ## having <storageClass> Storage class.
  ## If set to "-", storageClassName: "", i.e the PVC can bound to the PVs that have no class (dynamic provisioning is
  ## disabled)
  ## If undefined (the default) or set to null, no storageClassName is set, i.e the PVC can bound to the PVs having
  ## Default Storage class.
  ## For more details see
  # storageClass: "-"

  # Access mode to be defined for the persistent volume claim (relevant only if claimName is undefined or empty )
  accessMode: ReadWriteMany

# override default helm chart values
helm upgrade YOUR_CPM YOUR_REPO/synthetics-minion -f values.yaml -n YOUR_NAMESPACE

If the persistent volume claim has no provisioner, the persistent volume will need to be created manually from a yaml.

Kubernetes jobs

Are any cronjobs running that might create an excess of jobs for the kube-apiserver to handle? This could slow down processing of jobs from the queue.

kubectl get cj -n YOUR_NAMESPACE

Are jobs getting created and destroyed when finished or are they sticking around as orphaned jobs? Orphaned jobs could be an indication of resource exhaustion leading to pods getting killed by the K8s scheduler.

kubectl get jobs -n YOUR_NAMESPACE


If using Helm2, it is still possible to get the K8s CPM to work by first using Helm3 to make the statefulset template from our helm charts.

# 1. update the helm repo with helm3
helm3 repo update

# 2. grab the helm values to override
helm3 show values YOUR_REPO_NAME/synthetics-minion > values.yaml

# 3. make the template from your values
helm3 template YOUR_CPM_NAME YOUR_REPO_NAME/synthetics-minion -f values.yaml -n YOUR_NAMESPACE > template.yaml

# 4. apply the template or run with helm2
kubectl apply -f template.yaml

Kubectl versions

It’s a good idea to match the client version to the server version.

kubectl version --short

Service account and RBAC

A user in Kubernetes is a subject that can be either a user, group, or a service account.

A role binding grants the permissions defined in a role to a user or set of users. It holds a list of subjects (users, groups, or service accounts), and a reference to the role being granted. A RoleBinding grants permissions within a specific namespace…

Our role binding uses the default service account in the default namespace. If your default service account is locked down and doesn’t have permission to execute the role binding, then you may run into permissions issues that prevent the role’s rules from being applied.

Our role, synthetics-minion-role, requires certain rules to be defined.

  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["pods/log", "persistentvolumeclaims"]
    verbs: ["get", "list"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "watch", "list", "create", "delete", "patch", "update"]
  - apiGroups: ["batch"]
    resources: ["jobs/log"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]

You may need to do one of the following:

Option 1) Delete the default service account and let K8s recreate it with a new secret.

Option 2) Configure the helm chart to create a new service account in your namespace (that hopefully has the right permission level).

# get values
helm show values YOUR_REPO/synthetics-minion > values.yaml

# modify values.yaml
  # Specifies whether a service account should be created
  create: true

# override default values
helm upgrade YOUR_CPM YOUR_REPO/synthetics-minion -f values.yaml -n YOUR_NS

Option 3) Create your own service account and assign it to the cpm

apiVersion: v1
kind: ServiceAccount
  name: synthetics-minion
  namespace: newrelic
  labels: synthetics-minion-1.0.17 synthetics-minion synthetics-minion "3.0.28" Helm

# apply template
kubectl apply -f sa.yaml -n YOUR_NAMESPACE

List out the permissions for your user and the service account:

kubectl auth can-i --list -n YOUR_NAMESPACE
kubectl auth can-i --list --as=system:serviceaccount:default:default -n default
kubectl auth can-i --list --as=system:serviceaccount:YOUR_NAMESPACE:YOUR_SERVICE_ACCOUNT -n YOUR_NAMESPACE

Test what permissions your service account has after creating:

# permissions
kubectl config view

# can your account do everything
kubectl auth can-i "*" "*"

# can your account create jobs
kubectl auth can-i create jobs

# can the service account do everything
kubectl auth can-i "*" "*"--as=system:serviceaccount:YOUR_NAMESPACE:YOUR_SERVICE_ACCOUNT -n YOUR_NAMESPACE

# can the service account create jobs
kubectl auth can-i create jobs --as=system:serviceaccount:YOUR_NAMESPACE:YOUR_SERVICE_ACCOUNT -n YOUR_NAMESPACE

K8s timeout values

In situations where job durations are exceeding the default timeout of 180 seconds, and healthchecks are failing causing liveness probe failures and pod restarts, there are a couple environment variables worth noting here:

# get statefulset from helm chart
helm template YOUR_CPM YOUR_REPO/synthetics-minion -f values.yaml -n YOUR_NS > template.yaml

# set the minion check timeout to something higher
        - name: synthetics-minion
          image: ""
          imagePullPolicy: IfNotPresent
            - name: MINION_CHECK_TIMEOUT
              value: "180"

# set livenessProbe timeoutSeconds to something greater than the avarage job duration or equal to the minion check timeout
              path: /status/check
              port: http
            initialDelaySeconds: 600
            periodSeconds: 300
            timeoutSeconds: 60

# apply the new template
kubectl apply -f template.yaml -n YOUR_NS

Root minion access

Should you need to add a package like cURL to the minion pod, you’ll need to obtain root permissions by setting the securityContext: to runAsUser: 0 in the sts. The default is 2379, which doesn’t allow modification of the Alpine Linux container.

Use typical Alpine commands like:

apk add curl

Newer builds of the CPM starting with 3.0.31 use Ubuntu 18.04 as the base with curl already installed.

K8s healthchecks

Health checks are run every 5 minutes after an initial delay of 10 minutes. There are many health checks and each occurrence is recorded in the debug logs. For example, if the request to Horde does not respond within 1 minute the PrivateMinionHordeReachabilityHealthCheck will time out. These requirements are defined by the “livenessProbe” in the sts configuration.

To manually trigger a healthcheck:

curl http://MINION_POD_IP:8080/status/check

For a pretty version:

curl http://MINION_POD_IP:8180/healthcheck?pretty=true

Because the liveness probe triggers the health checks to run, the Horde health check timeout is exactly what would cause the liveness probe to fail, which should also trigger a pod restart. Our pod restart policy is set to “Always” so the pod should always restart if this probe fails. It is possible, for troubleshooting purposes, to set the restart policy to “Never”. This will prevent pods from being terminated even if the liveness probe fails, providing an opportunity for inspection to determine a cause. However, this could lead to instability since K8s would not be able to terminate pods that stop responding, so shouldn’t be run in this way for longer than needed.

If a pod restarts, it should be apparent in the kubectl describe pod YOUR_POD -n YOUR_NS under the pod events section. You should also see a pod restart count with kubectl get pods -n YOUR_NS.

K8s observability

Getting observability into the clusters, nodes, pods, containers, processes, events, alerts, and logs will be huge for maintaining the long-term stability of the K8s CPM.

Kubernetes integration

The Kubernetes Cluster Explorer and associated Kubernetes Dashboard will be a good starting point for identifying issues quickly.

We have an automated installer that will pull in details about the cluster.

It’s also possible to install using Helm.

One the integration is working, the following queries will be useful for monitoring the health of the CPM in terms of memory utilization on the node, containers, and by process.

  1. Probably the most important query, this will show all the relevant commands (processes) running for the minion. Note that the minion itself uses java, but the runner containers use node. So if memory is leaking, for example, from one of these processes, we should see it over time with the belwo NRQL query.
SELECT average(memoryVirtualSizeBytes)+average(memoryResidentSizeBytes) FROM ProcessSample FACET commandName WHERE containerName LIKE '%synthetics-minion%' SINCE 2 weeks ago TIMESERIES AUTO
  1. Here’s a query showing memory utilization for the minion.
SELECT average(memoryUtilization) AS 'Gi of Mem' FROM K8sContainerSample WHERE containerName LIKE '%synthetics-minion%' FACET clusterName SINCE 2 weeks ago TIMESERIES AUTO
  1. This query shows mapped memory for the minion and all runner containers/pods.
SELECT average(containerMemoryMappedFileBytes) FROM K8sContainerSample FACET containerName,podName WHERE containerName LIKE '%synthetics-minion%' SINCE 2 weeks ago TIMESERIES AUTO
  1. This one shows memory utilization on the node over time.
SELECT 100*average(memoryUsedBytes)/average(capacityMemoryBytes) as 'used mem percent' FROM K8sNodeSample FACET clusterName,nodeName SINCE 2 weeks ago TIMESERIES

Modifying these queries to suit your needs, they could make for some useful NRQL alerts.

Kubernetes events

The events will be useful for seeing if there are any issues with the operation of the CPM like liveness probe failures or other warning events that could clue you into potential issues.

Kubernetes logs

Adding logs will allow you to filter from the events to a specific set of logs that stream from both the minion pod and runner pods, yielding a combined view from both which is quite useful.

One tip is to remove the last random bit including the last hyphen in the runner pod name when filtering logs. This way log lines will match from both the minion logs as well as the runner pod logs.

A post was split to a new topic: Questions about CPM on EKS