Kubernetes integration breaking GKE cluster


I’m setting up kubernetes integration in my GKE clusters as described here: https://docs.newrelic.com/docs/integrations/kubernetes-integration/installation/kubernetes-integration-install-configure. I’m using the installer to generate manifest files.
Immediately after applying the manifest it can no longer schedule a new pod: I can’t deploy anything, auto-scaling is not working.

kubectl describe deployment my-deployment output:
Type Status Reason

Progressing True NewReplicaSetCreated
Available False MinimumReplicasUnavailable
ReplicaFailure True FailedCreate

I’m also seeing these errors in the web UI (GKE - Workloads):

  • request did not complete within requested timeout
  • context deadline exceeded

I narrowed it down to the admission webhook (nri-bundle-nri-metadata-injection) timing out. If I decrease the timeout value to 10 seconds it fixes the issue. But I suspect that the webhook is not actually working: it takes exactly 10 seconds for a new pod to appear. So it’s probably still timing out but failurePolicy: Ignore saves us.

I tried disabling firewall and network policies - doesn’t help. Also tried in multiple clusters, getting the same results.

How do I check if the webhook is working properly?
Are there some logs worth looking into?

1 Like

Hi Ivan,

I have located documentation regarding the admission webhook and Kubernetes which explains the Kubernetes and network requirements necessary for it to work. It also has some troubleshooting steps which may be useful while investigating this issue.

For Kubernetes to speak to our MutatingAdmissionWebhook , the master node (or the API server container, depending on how the cluster is set up) should be allowed egress for HTTPS traffic on port 443 to pods in all of the other nodes in the cluster. This might require specific configuration depending on how the infrastructure is set up (on-premises, AWS, Google Cloud, etc).

If that isn’t helpful, a good next step would be to generate verbose logs for the Kubernetes integration which should give you a better insight into what’s going on.

Hope that is helpful! Let me know if you run into any issues :slight_smile:

1 Like

Hey @kmcginley.

Thanks for pointing me in the right direction. After inspecting apiserver logs I figured it out.

The issue is about firewall rules and can be easily reproduced in a fresh private cluster. It’s not specific to my setup.
Here’s what’s happening:

  • Newrelic webhook server is listening on port 8443
  • There’s a ClusterIP service that exposes it on 443
  • Even though MutatingWebhookConfiguration points to the service, GKE apiserver is talking to the pod directly. I’m seeing these errors in apiserver logs: Post https://nri-bundle-nri-metadata-injection.newrelic.svc:443/mutate?timeout=30s: dial tcp i/o timeout. The reason is --enable-aggregator-routing=true flag. More details: github kubernetes/kubernetes/issues/79739
  • Default firewall rule only allows connections from GKE master on ports 443 and 10250.

So the solution was adding a firewall rule that allows TCP connection from my master IP range to the nodes on port 8443. Just 443 is not enough.

It’s probably worth adding to the GKE section of the docs, it affects every private cluster and it’s pretty hard to debug.

1 Like

@ivan_bsites Thank you so much so sharing such a detailed solution with the community! I will pass your feedback on about a GKE section in the documentation. :slightly_smiling_face: