OpenTelemetry Troubleshooting Framework - Troubleshooting

Troubleshooting

In this doc, learn how to start troubleshooting your OpenTelemetry service. Some commonly encountered errors are discussed, along with how to resolve them.

Missing Data

If you are not seeing your OpenTelemetry service in NR1, first check the following before investigating further:

  • The OTLP endpoint that you configured must:
    • match one of our documented endpoints, be properly formatted, and include the default port for OTLP/gRPC, 4317, or the default port for OTLP/HTTP, 4318. Port 443 is also supported for either transport. Please note the specific endpoint for FedRAMP compliance, if applicable.
    • be region-specific. For example, if you are based in Europe and configure your exporter to send data to our US endpoint, data will fail to be exported.
    • must include the appropriate signal path if a signal-specific environment variable is being used. For example, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=https://otlp.nr-data.net:4318/v1/traces.
  • The outbound traffic is not restricted by a firewall. Our Networks document explains domains and network blocks that you may need to explicitly allow.
  • The client is configured to use TLS 1.2 or higher and the request includes the api-key header with a valid New Relic account (ingest) license key.
  • Requests include valid protobuf payloads and use gRPC or HTTP transport, preferably with gzip compression enabled. Request payloads should be no larger than 1MB (10^6 bytes). Sending JSON-encoded payloads is not supported at this time.
  • Client output and logs do not indicate 4xx or 5xx response codes are being returned.

If you have confirmed the above and are still not seeing your OpenTelemetry service in NR, let’s enable logging to find out what’s going on.

If you see your OpenTelemetry service in NR1, but are not seeing data or believe you are missing data, a good next step is to query for NrIntegrationError events, or NRIEs (see below).

Commonly encountered NrIntegrationError events

New Relic creates NrIntegrationError (NRIE) events in customer accounts in our ingestion pipelines to let you know that there was an issue processing your data. To see whether any NRIEs are being reported to your account and the quantity of each, use the below query (expand or narrow time window as appropriate):


SELECT count(*) FROM NrIntegrationError facet message since 3 days ago LIMIT MAX

One or more OTLP metric data point(s) were dropped due to unsupported AggregationTemporality.

New Relic only supports delta aggregation temporality for metrics at the moment. The above NRIE indicates that you haven’t configured your metrics SDK to use delta instead of the default cumulative temporality. You can configure delta temporality programmatically, or via the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY env var as described here.

One or more OTLP data point(s) was dropped because the length of an attribute value was over the limit.

The crux of the problem is that your data is being rejected because one or more of your attributes exceed New Relic’s limits on attribute lengths (see: metric attribute limits and event attribute limits):

  • Length of attribute name: 255 characters
  • Length of attribute value: 4096 maximum character length
  • If a span attribute is the offender, you can use span limits environment variables* to configure the maximum length(s).
  • If a resource attribute is the offender, you can set OTEL_RESOURCE_ATTRIBUTES=<offending-attribute>=unset to override it.
  • Figuring out what type of attribute the problem attribute is can be surprisingly difficult. To narrow down the exact attribute, we recommend trying to log the span to a local Jaeger instance. An OpenTelemetry resource is equivalent to a Jaeger process, so any tags attached to a Jaeger process are resource attributes, while any tags in a Jaeger span are span attributes. We can identify the resource attribute by checking the Jaeger process tags.

*Not all language SDKs support these, so it depends on which language you are using. If your language does not support these, we recommend that you open a Github issue for it in the respective language SDK repo, so that the OpenTelemetry community can resolve it.

Common application log errors

413 errors

You are getting unexpected HTTP status code received from server: 413 (Request Entity Too Large) errors. The maximum allowed payload size is 1MB (10^6 bytes). We recommend doing the following to resolve this error:

  • Enable gzip compression explicitly for all OTLP exporters
  • Configure a batch processor, either directly in your SDK or in your Collector config

To maximize the amount of data you can send per request, we recommend enabling compression in all OTLP exporters you are using. It is best to set compression explicitly, as gzip compression. By default, the OpenTelemetry SDKs and Collector send one (1) data point per request. Using these defaults, it is likely your account will be rate limited. All OpenTelemetry SDKs and Collectors provide a BatchProcessor, which batches data points in memory. This batching allows requests to be sent with more than one (1) data point.

For the batch processor config, we have been recommending setting send_batch_size to 1000, as this is roughly equivalent to what our APM agents do. It’s not guaranteed that a batch size of 1000 will be less than our 1MB ingest limit, but it makes it far more likely. For example, a single span can vary greatly in size depending on the number of attributes and attribute length - so 1000 very large spans may exceed the 1MB limit. You may also be interested in these internal metrics emitted by the Collector.

If you decide to also use the batch processor setting send_batch_max_size, which controls whether larger batches get split up, we recommend setting that to 1000 as well.

429 errors

New Relic places limits on the amount of data customers can send, query, and store to ensure our systems are always up and to keep them from unintended use. Currently, there are no NRIEs created for span rate-limiting, but in this case, one clue will be the presence of 429 (Too many requests) errors in your application logs. We place a limit on the number of ingested requests per minute (RPM) per data type. When this limit is reached, we stop accepting data and return a 429 status code for the duration of the minute.

PERMISSION_DENIED or gRPC status code 7

If you see a Failed to export spans error with either PERMISSION_DENIED and/or code 7, this could indicate an issue with your OTLP endpoint. Check here again to confirm your OTLP endpoint is correctly configured before proceeding.

4 Likes