Hey @marlon940721,
Thanks for your post here about the IEE code: 39 you are seeing with OpenShift 4.8.
You’re right that the typical call of code 39 is insufficient memory or cpu request on the node where the runner or healthcheck pod is trying to be scheduled. However, OpenShift presents some challenges in the form of a securityContext
applied to the runner or healthcheck pod spec (yaml) which blocks our node
commands from executing on the runner and healthcheck pods.
In debug logs from the minion, you may see the following lines:
Here we try to run node --version
.
/opt/runtimes/run.sh: line 43: /opt/deps/nodejs/node-v10.15.0-linux-x64/bin/node: Operation not permitted
We then log the results of trying to run the script with node main.js ${EXTRA_ARGS} > ${NR_SCRIPT_LOG_FILEPATH} 2>&1
, and then save the output to the script log file path at /opt/shared/script.js
which is mapped to /tmp
on the PV.
/opt/runtimes/4.0.0/main.sh: line 81: /opt/deps/nodejs/node-v10.15.0-linux-x64/bin/node: Operation not permitted
Replication Test:
This scenario can be replicated by changing the capabilities in the psp to drop all, which will cause the node
command to fail with Operation not permitted
.
$ oc run -it --rm --restart=Never runner-pod-test --image=quay.io/newrelic/synthetics-minion-runner:3.0.62 -- /bin/bash
If you don't see a command prompt, try pressing enter.
_apt@runner-pod-test:/opt/user$ cd /opt/deps/nodejs/node-v10.15.0-linux-x64/bin
_apt@runner-pod-test:/opt/deps/nodejs/node-v10.15.0-linux-x64/bin$ ./node --version
bash: ./node: Operation not permitted
Possible fix:
Edit your psp, if you have one:
oc edit psp privileged
Change this:
requiredDropCapabilities: - ALL
To this:
allowedCapabilities: - '*'
Additional details:
Since you are using OpenShift, the securityContext may be getting applied not by a psp but by an OpenShift policy. To confirm the offending securityPolicy, it would be useful to collect the yaml output from a runner pod or healthcheck pod. Rerun the below command until you can capture the output from an ephemeral healthcheck or runner pod.
kubectl get po -n newrelic -l synthetics-minion-deployable-selector -o yaml
You will likely discover some securityContext rules and that the runner or healthcheck pod is using the default
service account. This is a known bug in the runner pod spec that causes it to not inherit the same service account as was created for the minion.
If you applied the anyuid policy to the service account created for the minion, it likely didn’t get applied to the default service account. There could be UID or GID restrictions preventing the pod from properly accessing the /tmp
mount.
You could try applying the anyuid policy to the default sa, but I’m not certain the default sa can be modified?
Alternatively, you could attempt to set the UID and GID for the runner and healthcheck pods by modifying our Helm chart in certain places:
One potential issue with the above solution is that we hard code the GID of 3729 into the runner pod spec, which is not user configurable. This is the group that runner pods and healthchecks will try to use when accessing the /tmp
mount. Our engineering team is aware of the issue and hopefully a fix will be released soon.
For OpenShift 4.8:
Kubernetes 1.21 will require CPM >=3.0.61 due to the BoundServiceAccountTokenVolume feature gate which will create a “projected” volume rather than a secret’s based volume.
OpenShift uses the CRI-O container runtime, which seems to work with the CPM on OpenShift. However, I’ve run into issues when attempting to run the CPM in minikube with CRI-O. It’s possible the OpenShift implementation is more backwards compatible with Dockershim somehow.