Infinite Tracing Troubleshooting Framework

Infinite Tracing

Configuring Infinite Tracing

  1. Confirm that your agent version meets the minimum requirements

  2. Set up trace observer. Only one trace observer per region per account hierarchy is allowed.

    1. Trace observer region - While there is no need for multiple trace observers for data ingestion, you may choose to create one trace observer per region to keep data egress cost minimal.
    2. If you are in the EU, you must use US trace observers
  3. Check proxies

    • Proxy configuration is only required if it is between the Trace Observer and your service as Infinite Tracing streams spans directly to the Trace Observer and not to the Collector as all other data reported by New Relic Agents

Errors Installing or Enabling

  1. Unable to load application with an error like: LoadError: Error loading shared library ld-linux-x86-64.so

  2. Enabling Infinite Tracing when running the agent in an Alpine Linux Docker container causes the following exception: SpanStreamingService: Error creating gRPC channel to endpoint ... . (attempt 0) - Exception: NewRelic.Agent.Core.DataTransport.GrpcWrapperException: Unable to create new gRPC Channel ---> System.IO.IOException: Error loading native library "/app/newrelic/libgrpc_csharp_ext.x64.so

Check Configuration

FedRAMP currently prevents Distributed Trace reporting so confirm the agent configuration does not include the FedRAMP endpoint gov-collector.newrelic.com

  1. Confirm you have completed all required steps

    1. Confirm the Trace Observer is set up
    2. Check your agent configuration in ‘Environment’ page in the APM UI
  2. Confirm the agent version is correct and meets the minimum requirements

  3. (Owners, Admins and Add-On manager Users only) Go to the Agent Initialization and search for the configuration setting for Distributed Tracing to confirm it is enabled and the trace observer endpoints are correct

PHP

Requires additional newrelic.span_events_enabled = true configuration

Ruby

Requires additional gem https://rubygems.org/gems/newrelic-infinite_tracing

Test the Trace Observer

  1. Send a Sample Request to the Trace Observer

  2. After sending, navigate to ‘Traces’ UI page and search Find traces where spans contain:[ trace.id](http://trace.id) = 123456

If there are no results:

  1. Check the time window set in the upper right corner of the UI includes the time you sent the sample trace
  2. Confirm your API key is correct
  3. Confirm trace observer host endpoint is correct
  4. Query your account for any NRIntegrationError

SELECT * FROM NRIntegrationError SINCE <general time range of when you sent the sample trace> LIMIT MAX

  1. Check for any network issues, proxies or firewalls that may have prevented the sample trace from sending

Check if any Traces are hitting the trace observer

  1. Enable trace observer monitoring](https://docs.newrelic.com/docs/understand-dependencies/distributed-tracing/infinite-tracing/infinite-tracing-configure-trace-observer-monitoring)

WARNING: If you enable this feature, you’ll see a small additional monthly charge.

  1. Query your Trace Observer data

FROM Metric SELECT sum(monitoring.trace.opened.session.count) AS 'Traces seen' WHERE account = <account id> TIMESERIES SINCE <general time range of when you sent the sample trace>

Missing Traces

  1. Check if there are any traces which should have been sampled by one of the sampling algorithms

  2. Check the throughput of your application. The Duration sampler only samples duration outliers for traces and the Random sampler defaults to sampling only 1 out of every 100 traces so a very low throughput service may see very sporadic data sampled by these filters

  3. When testing or during times you want as much data as possible, like a deployment, it may be useful to configure the Random sampler value to 100% which will result in 100% of traces being kept

  4. Check your attribute filters to ensure you are not dropping traces you want and they are correctly capturing the traces you do want.

  5. Confirm the time window. The default Data Retention period for Trace data is 8 days.

Missing Spans and Fragmented Traces

Fragmented Traces - Are traces consistently fragmented at the same point?

  1. Trace Configuration Conflicts: Confirm all services have Infinite Tracing Enabled

  2. Missing traces for monitored browser apps and mobile apps. Because Infinite Tracing isn’t yet available for browser or mobile monitoring, spans from these services won’t show up in the trace waterfall when they make requests to Infinite Tracing-enabled services.

  3. Very Long traces may be truncated, currently the UI only show 10000 spans

  4. Is the APM Agent correctly reporting and sending spans?

  • Query Supportability Metrics

    SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Seen')) as 'Spans Seen', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Sent')) as 'Spans Sent', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Dropped')) as 'Spans Dropped', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/QueueCapacity')) as 'Queue Capacity', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/QueueSize')) as 'Queue Size', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Response/Error')) as 'Error Count' FROM Metric WHERE (entity.guid = ' <add entity>') TIMESERIES LIMIT MAX SINCE <time window>

    1. Spans sent should be very close to the value of spans sent, this indicates that the agent is reporting spans and attempting to send spans. Some discrepancy due to the different harvest cycles of metrics and spans and accounting for temporary network issues causing dropped spans or delayed spans.
      2. Queue size reports how many spans have entered the queue waiting to be sent to the trace observer. If queue size is equal to queue capacity, that indicates there is an issue where spans are not able to send as fast as they are being created. This could be an indication of either a network error or slow down, or a gRPC. Check for any ‘Error Count’ metrics and check agent logs for any gRPC errors. If queue size and queue capacity have the same value for extended periods of time, often accompanied by a spike in spans dropped.

###Slow Network

  1. If you suspect you may have a slow network, one prone to network issues or your service is generating faster than the agent can send them during high traffic times and exceeding the queue capacity, it may be worthwhile to increase the queue size to prevent spans dropping.. This may increase memory overhead to store these spans.

  2. For gRPC errors check if it is expected: https://developers.google.com/maps-booking/reference/grpc-api/status_codes

  3. Error like “Received RST_STREAM with error code 0” is expected when reconnecting either at start or after a period of being idle

    Node.js Agent only

SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Seen')) as 'Spans Seen', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Sent')) as 'Spans Sent', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Dropped')) as 'Spans Dropped', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/QueueCapacity')) as 'Queue Capacity', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/QueueSize')) as 'Queue Size', filter(average(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Drain/Duration')) as 'Drain Duration', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Drain/Duration')) as 'Drain Count', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/InfiniteTracing/Span/Response/Error')) as 'Error Count' FROM Metric WHERE (entity.guid = ' <add entity>') TIMESERIES LIMIT MAX SINCE <time window>

Node agent has the additional drain metrics which record how long the queue takes to empty, a long drain duration can indicate a network issue or gRPC error.