INTERNAL ENGINE ERROR - code: 39

Hi, I am getting this error
│ synthetics-minion 2022-05-17 13:10:51,008 - Failed Job 2026306/f6e3929b-ef71-48d6-ba17-2b85f76ae008/b1441cad-4a4c-44a3-b8ed-4806ecdd4464 - INTERNAL ENGINE ERROR - code: 39

any ideas on how to fix it?

I was reading it could be related to resources. However, the nodes (the minion is on openshift cluster) have enough resources available

Hey there @marlon940721,

Welcome to the community!

You are correct that code: 39 is related to resources. You can find more on these errors codes here: Containerized Private Minion (CPM) Troubleshooting Framework Internal Error Codes. I am not sure however why you are receiving this error exactly if you do happen to have enough resources available. I am looping in one of our engineers to look over this and provide further insight.

Please let us know if you need anything else in the mean time, we will be more than happy to help. I hope you have a great day!

Hi @michaelfrederick,

It should be resources. However, the minion was working fine with openshift cluster with version 4.6 or 4.7. After it was upgraded to version 4.8. It started to display that error and all jobs are failing.

Hey @marlon940721,

Thanks for your post here about the IEE code: 39 you are seeing with OpenShift 4.8.

You’re right that the typical call of code 39 is insufficient memory or cpu request on the node where the runner or healthcheck pod is trying to be scheduled. However, OpenShift presents some challenges in the form of a securityContext applied to the runner or healthcheck pod spec (yaml) which blocks our node commands from executing on the runner and healthcheck pods.

In debug logs from the minion, you may see the following lines:

Here we try to run node --version.

/opt/runtimes/run.sh: line 43: /opt/deps/nodejs/node-v10.15.0-linux-x64/bin/node: Operation not permitted

We then log the results of trying to run the script with node main.js ${EXTRA_ARGS} > ${NR_SCRIPT_LOG_FILEPATH} 2>&1, and then save the output to the script log file path at /opt/shared/script.js which is mapped to /tmp on the PV.

/opt/runtimes/4.0.0/main.sh: line 81: /opt/deps/nodejs/node-v10.15.0-linux-x64/bin/node: Operation not permitted

Replication Test:

This scenario can be replicated by changing the capabilities in the psp to drop all, which will cause the node command to fail with Operation not permitted.

$ oc run -it --rm --restart=Never runner-pod-test --image=quay.io/newrelic/synthetics-minion-runner:3.0.62 -- /bin/bash
If you don't see a command prompt, try pressing enter.
_apt@runner-pod-test:/opt/user$ cd /opt/deps/nodejs/node-v10.15.0-linux-x64/bin
_apt@runner-pod-test:/opt/deps/nodejs/node-v10.15.0-linux-x64/bin$ ./node --version
bash: ./node: Operation not permitted

Possible fix:

Edit your psp, if you have one:

oc edit psp privileged

Change this:

requiredDropCapabilities: - ALL

To this:

allowedCapabilities: - '*'

Additional details:

Since you are using OpenShift, the securityContext may be getting applied not by a psp but by an OpenShift policy. To confirm the offending securityPolicy, it would be useful to collect the yaml output from a runner pod or healthcheck pod. Rerun the below command until you can capture the output from an ephemeral healthcheck or runner pod.

kubectl get po -n newrelic -l synthetics-minion-deployable-selector -o yaml

You will likely discover some securityContext rules and that the runner or healthcheck pod is using the default service account. This is a known bug in the runner pod spec that causes it to not inherit the same service account as was created for the minion.

If you applied the anyuid policy to the service account created for the minion, it likely didn’t get applied to the default service account. There could be UID or GID restrictions preventing the pod from properly accessing the /tmp mount.

You could try applying the anyuid policy to the default sa, but I’m not certain the default sa can be modified?

Alternatively, you could attempt to set the UID and GID for the runner and healthcheck pods by modifying our Helm chart in certain places:

One potential issue with the above solution is that we hard code the GID of 3729 into the runner pod spec, which is not user configurable. This is the group that runner pods and healthchecks will try to use when accessing the /tmp mount. Our engineering team is aware of the issue and hopefully a fix will be released soon.

For OpenShift 4.8:

Kubernetes 1.21 will require CPM >=3.0.61 due to the BoundServiceAccountTokenVolume feature gate which will create a “projected” volume rather than a secret’s based volume.

OpenShift uses the CRI-O container runtime, which seems to work with the CPM on OpenShift. However, I’ve run into issues when attempting to run the CPM in minikube with CRI-O. It’s possible the OpenShift implementation is more backwards compatible with Dockershim somehow.