Distributed Tracing Troubleshooting Framework

Getting Started

Enabling Distributed Tracing for APM Agents

  1. Confirm that your agent meets the minimum requirements

  2. Configure the agent

  3. Restart your application

  4. Check Configuration in Environment Page

    1. Confirm the agent version is correct and meets the minimum requirements
    2. (Owners, Admins and Add-On manager Users only) Go to the ‘Agent Initialization’ and search for the configuration setting for Distributed Tracing to confirm it is enabled.

Enabling Distributed Tracing for Browser

  1. Check your Browser Agent meets the requirements
  2. Go to your Browser application ‘Settings’ to ‘Application settings’ and turn on the Distributed tracing toggle.

  1. Optional - Enable cross-origin resource sharing (CORS)

    1. If the request you want to be traced starts on one domain and goes to an endpoint on a different domain, it’s a cross-origin request and you need to enable CORS sharing in order for Browser Distributed Tracing to generate traces for them.

    2. Check if requests are cross-origin:


SELECT count(*) FROM AjaxRequest WHERE appName like '< add your service name>' LIMIT 100 FACET pageUrl, requestUrl

To check if a specific request is cross-origin, specify the ‘requestUrl’ in your query


SELECT count(*) FROM AjaxRequest WHERE appName like '< add your service name>' AND requestUrl = ‘< add the specific request >’ LIMIT 100 FACET pageUrl

If the pageUrl and the requestUrl have different domains, then this is a cross-origin request, Redeploy

No Traces Appearing

  1. Query Span Data

SELECT count(*) FROM Span WHERE entity.guid = ‘<add your entity.guid>’ SINCE 7 days ago TIMESERIES

How to find your service’s entity guid

  1. Click on the

Hover over the entity.guid and click the ‘Clipboard’ to ‘Copy to clipboard’

  1. Why use entity.guid? Using entity.guid ensures that you have the correct service even if you have multiple services with the same name, rename your service, have multiple accounts, etc.

  2. Check the Time Window

  3. Span data should be reporting within minutes if Distributed Tracing is enabled and the service has had traffic in the last 8 days

APM Agents - Transactions not automatically instrumented

  1. Check the APM Transaction Page to confirm Transactions are being reported

    1. Confirm by querying your service

    SELECT count(*) FROM Transaction WHERE appName = ‘<add your service name>’ SINCE 7 days ago TIMESERIES

  2. If no Transactions are reporting, work through the Installation Troubleshooting Framework for the specific APM agent

  3. Service doesn’t use HTTP - If a service doesn’t communicate via HTTP, the New Relic agent won’t send distributed tracing headers. This may be the case for some non-web applications or message queues. To remedy this, use the distributed tracing APIs to instrument either the calling or called application.

Browser Agent

  1. Confirm AjaxRequest event data is reporting

    1. Check the AJAX UI page to confirm AJAX requests are reporting
    2. Confirm by querying your Browser service

    SELECT count(*) FROM AjaxRequest WHERE appName = <add your Browser service name> SINCE 7 days ago TIMESERIES

  2. Confirm AJAX requests Browser interactions are occurring on your service

  3. If AJAX requests are occurring but not reported, see the Browser Installation Troubleshooting Guide.

  4. Confirm the New Relic header is being injected

Troubleshooting Missing, Fragmented or Incorrect Data

How Sampling Works in Standard Head Based Distributed Tracing

There are three potential sampling rounds a span has to make it through before reaching New Relic.

Head Based Adaptive Sampling

Head based sampling means that the decision to sample a transaction to collect a full trace is decided at the start of the trace. The first service monitored by New Relic, called the trace origin, will randomly select 10 traces to be sampled. APM agents will use the throughput from the last minute to set the rate of sampling in order to spread the 10 transactions out as evenly across the one minute period:

https://docs.newrelic.com/docs/understand-dependencies/distributed-tracing/get-started/how-new-relic-distributed-tracing-works/#trace-origin-sampling

These transactions sampled for a full trace will be given a ‘true’ value for the ‘sampled’ attribute which will propagate downstream to signal all other APM agents the trace touches to collect spans that will also be marked with ‘sampled’.

Span Limits and Priority Sampling

Each APM agent has a Span limit of 1000 spans per minute per instance of the agent. This is a hard limit and cannot be configured. This limit is based on:

In many distributed systems, the average microservice may generate 10 to 20 spans per request. In those cases, the agent span limit can accommodate all spans chosen, and that service will have full detail in a trace.

But this limit can easily be reached if the service is a popular downstream service receiving traces marked for sampling from multiple applications or if there is significant instrumented work being done. This is where Priority Sampling is used. Priority Sampling assigns a random value between 0.0 and 1.0 to the ‘priority’ attribute is assigned to the Transaction and any other associated events, such as TransactionError and Spans to allow sampling to be evenly distributed across the harvest cycle. When Transactions are marked [‘sampled’]New Relic data dictionary | New Relic Documentation) their ‘priority’ value will be raised as well to increase their chance of being sampled. Once the data limit is reached, the agent will use the priority value to begin sampling out spans. This ensures a random sampling and that if a Span (or Transaction or TransactionError) event is sampled, it is very likely all associated events will be sampled as all events created from the same unit of work share a priority value even though each event type has independent sampling pools.

Because these harvest limits cannot be adjusted, when the limit is reached it is possible for a trace marked as ‘sampled’ by adaptive sampling to still be sampled out by priority sampling.

Trace Rate Limiting

[If the above sampling methods still result in too much trace data, we may limit incoming data by sampling traces after they’re received. By making this decision at the trace level, it avoids fragmenting traces (accepting only part of a trace).](https://docs.newrelic.com/docs/understand-dependencies/distributed-tracing/get-started/how-new-relic-distributed-tracing-works/#span-rate-limiting:~:text=If%20the%20above%20sampling%20methods%20still,(accepting%20only%20part%20of%20a%20trace)

This means that a trace marked as ‘sampled’ with high ‘priority’ value spans can still be sampled out, and that 10 traces will not always be reported per minute per instance.

Lambda Sampling

Lambda monitoring does not use the same sampling method as other agents and instead 10% of invocations are sampled to generate spans

https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/ui-data/understand-lambda-data-structure/#data-structure

Browser Sampling

While the Browser does not sample spans, there is a default account limit of 10K spans per minute.

Spans Missing

Fragmented Traces, Orphaned Spans and Orphaned Traces

Applications not fully instrumented

  1. Check traces - if the same spans are consistently orphaned or traces are consistently fragmented in the same place, this is likely an instrumentation issue

  2. Check your agent installation and configuration

  3. If an application isn’t automatically instrumented, Custom instrumentation may be required

Spans have not completed or are sent late

  1. Spans must be sent within the last twenty minutes to be captured in a trace index
  2. Spans will not send until the segment of work they represent has completed

Spans have been sampled out

Agents are limited to collecting 1000 Spans per instance of the agent per 60 second harvest cycle

Query APM Spans

  • Query supportability metrics to see Spans seen, sent and dropped
SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/SpanEvent/TotalEventsSeen')) as 'Spans Seen', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/SpanEvent/TotalEventsSent')) as 'Spans Sent', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/SpanEvent/Errors/Dropped')) as 'Spans Dropped' FROM Metric WHERE (entity.guid = '<add entity>') TIMESERIES LIMIT MAX SINCE <time window>
  • Query Span counts - Use the Environment UI Page to confirm the number of agents running on the application’s hosts
SELECT count(*) FROM Span WHERE entity.guid = ‘<add your entity.guid>’ FACET host SINCE 7 days ago TIMESERIES
  • Check Service Maps - if it is a popular downstream service, spans or traces being sampled is much more common

Browser Ingest Limits

The Browser agent does not sample spans, but there is an account limit of approximately 10000 spans per minute per account.

  1. Query Browser Spans
SELECT count(*) FROM Span WHERE browserApp.name IS NOT NULL SINCE 10 minutes AGO TIMESERIES 1 minute

Connection between apps missing

Upgrade all involved agents to the latest version

Intermediary is missing or isn’t passing trace context

Proxies can often cause this issue. Look at headers in vs headers out, or test removing the proxy

Check if the upstream service has created Distributed Trace Payloads

SELECT count(newrelic.timeslice.value) FROM Metric WHERE (entity.guid = '<add entity>') AND metricTimesliceName LIKE ‘Supportability/DistributedTrace/CreatePayload/%’ FACET metricTimesliceName TIMESERIES LIMIT MAX SINCE <time window>

Or for Agents that support W3C Trace Context, use the query:

SELECT count(newrelic.timeslice.value) FROM Metric WHERE (entity.guid = '<add entity>') AND metricTimesliceName LIKE ‘Supportability/TraceContext/Create/%’ FACET metricTimesliceName TIMESERIES LIMIT MAX SINCE <time window>

Check if the downstream service is receiving Distributed Trace Payloads

SELECT count(newrelic.timeslice.value) FROM Metric WHERE (entity.guid = '<add entity>') AND metricTimesliceName LIKE ‘Supportability/DistributedTrace/AcceptPayload/%’ FACET metricTimesliceName TIMESERIES LIMIT MAX SINCE <time window>

Or for Agents that support W3C Trace Context, use the query:

SELECT count(newrelic.timeslice.value) FROM Metric WHERE (entity.guid = '<add entity>') AND metricTimesliceName LIKE ‘Supportability/TraceContext/Accept/%’ FACET metricTimesliceName TIMESERIES LIMIT MAX SINCE <time window>
  1. Missing spans due to service not having distributed tracing enabled

  2. Stitching together spans from mixed sources

    • Zipkin is an example that does not support W3C Trace Context Headers

Data Odd looking

  1. Unaccounted for Time

    1. Typically a Client span that is longer that the server span or large differences in duration are due to things like network latency, DNS resolution delay, a load balancer or queue time that are not instrumented by New Relic agents
  2. Clock Skew

    1. New Relic agents rely on the time set in headers so if the header has the wrong time, it will result in large gaps of unaccounted for time in traces.

      1. Check your server time using NTP
      2. If you have followed the guide to report request queueing, check that the timestamp in the headers is accurate
  3. Compare Trace Data to other reported data to confirm the accuracy

    1. Transaction Traces have a similar level of detail to Distributed tracing for in-process work but all timing and instrumentation occurs in the same agent. If you are missing fragments of a Distributed Trace, check Transaction Traces for similar traces.

      1. If the work is consistently instrumented in Transaction Traces but not always in Distributed Traces, then the agent is instrumenting correctly but likely sampling is causing fragments to be dropped.
      2. If segments of work are always reported in Transaction Traces, but never reported in Distributed Traces, then check the last section of the Distributed Trace recorded. If there are external calls occurring before the trace fragments, there is likely an issue with headers being properly passed. For APM agents, query support metrics to confirm headers are being both created and accepted, if the down stream service is also monitored by New Relic, run the same query on any downstream services:
SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/AcceptPayload/Success')) as 'Traces Successfully accepted', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/AcceptPayload/Exception')) as 'Exception occurred ', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName LIKE 'Supportability/DistributedTrace/AcceptPayload/Ignored%') as 'Ignored', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/AcceptPayload/ParseException')) as 'Parse Exception', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/CreatePayload/Success')) as 'Create Success', filter(average(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/CreatePayload/Exception')) as 'Create Exception' FROM Metric WHERE (entity.guid = '<add entity>') TIMESERIES LIMIT MAX SINCE 7 days ago

Or for Agents that support W3C Trace Context, use the query:

SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/TraceContext/Accept/Success')) as 'Accept Success', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/TraceContext/Accept/Exception')) as 'Accept Exception', filter(count(newrelic.timeslice.value), WHERE metricTimesliceName LIKE 'Supportability/DistributedTrace/AcceptPayload/Ignored%') as 'Ignored', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/AcceptPayload/ParseException')) as 'Parse Exception', filter(max(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/CreatePayload/Success')) as 'Create Success', filter(average(newrelic.timeslice.value), WHERE metricTimesliceName = ('Supportability/DistributedTrace/CreatePayload/Exception')) as 'Create Exception' FROM Metric WHERE (entity.guid = '<add entity>') TIMESERIES LIMIT MAX SINCE 7 days ago

Common Questions

Why can I not find a Distributed Trace that matches the traceID for my Transaction Trace?

Short answer, sampling! Despite the very similar names, Transaction Traces and Distributed Traces are fairly different. Transaction Traces provide extra details about the flow of work within a single transaction where each unit of work is represented by a segment rather than Spans. While a Transaction Trace can have external requests, all measurement and reporting is done within the service. Transaction events are sampled for full Transaction Traces based on their duration after the transaction completes.

In contrast, standard DIstributed Tracing samples at the start, or “head” of the trace. So since the trace origin makes a sampling decision at random before the Distributed Trace flows through all of the services involved in fulfilling a request, duration does not impact sampling decisions and any chance that a Transaction Trace and Distributed Trace are collected for the same work is entirely random. Even if a transaction is included in both a sampled trace and a Transaction Trace, then it is still possible for the Distributed Trace to get sampled out by one of the other two methods.

Why do so few Transactions marked as ‘sampled’ have Distributed Traces?

Once again, sampling. Seeing ‘sampled: true’ does give the impression that it should be sampled, but it is best to remember this is set at the trace origin as a signal to all agents involved in the trace to collect the information for a trace and not at the end of the trace after Priority Sampling and Trace Rate Limiting has kicked in.

There are a couple common causes for this. If your service is a popular downstream service, each agent calling that service will be sending 10 traces per minute per instance with the ‘sampled’ marker as your service is also sampling it’s own transactions for traces. This can easily exceed the 1000 spans per minute limit, and since the agent favors whole traces, it will drop whole traces rather than being able to report a higher quantity of fragmented traces.

If a lot of work is being done in a single trace, again sampling limits can be reached leading to whole traces being sampled out. A single trace can involve multiple transactions.

3 Likes

Thanks Gabriel, this is very helpful troubleshoot guide. Great work!

1 Like