Lambda Troubleshooting Framework- General Knowledge Part 2

Lambda Troubleshooting Framework - General Knowledge for Troubleshooting Part 2

Part 1 of this guide is here

The goal here is to help identify where the focus of troubleshooting efforts should begin.

New Relic Agents

For layers that include agents (Node.js, Python, Java, and Go), the work of instrumenting your function is made drop-in easy. For .NET functions, our OpenTracing wrapper is still needed which is more of a manual process. Work is being done to make .NET function instrumentation drop-in easy as well.

Java Instrumentation

Our new Java layer is here. The goal is to make Java instrumentation drop-in easy like the current state for Node.js and Python.

Node.js Instrumentation

To use add custom instrumentation to your Node.js function using our Node.js layer, instantiate a newrelic object and use the Node Agent API.

Python Instrumentation

To use add custom instrumentation to your Python function using our Python layer, initialize and then instantiate a newrelic object and use the Python Agent API.

Go Instrumentation

Go Agent Repo, API, and Guide

Go Agent Examples

OpenTracing Agents

Java OpenTracing Wrapper

New Relic OpenTracing Agent

The following repositories define what we would call an agent, used to instrument your Java function.

Examples

Why we use OpenTracing instead of the Java agent for Lambda monitoring

One of our Java agent engineers explained it best:

AWS charges based on Lambda function execution time. Because of this, most people use lambdas in a manner where they only execute for a short period of time (maybe milliseconds or seconds) to manage costs. When you introduce the Java agent into the environment it might add several seconds to the Lambda startup time just to initialize. The overhead impact is increased on lambda cold starts.

That’s why we opted for a solution with less overhead based on OpenTracing.

.NET OpenTracing Wrapper

New Relic OpenTracing Agent

The following repository defines what is analogous to an agent, used to instrument your .NET function (.NET Core or C#).

Code notes:

Wrapper:
- SpanExtensions
  - SetException is where we add an Exception to the Log with specific attributes.
  - This does not use Tags.Errors.Set
- LambdaWrapper
  - AfterWrappedMethod is where we call SpanExtensions.SetException
Tracer:
- DataCollector
  - TransformSpans is what sends each span over to Errors to see if a Log entry for error exists
  - PreparePayload is where the traces, event, and spans get serialized into JSON.
- Errors
  - Errors is where we look for Log entries with the error attributes we set previously.
  - We build the ErrorEvent here
  - We build the ErrorTrace here, with a stacktrace is possible

Current state:

  • Logs in Context is not supported using .NET Lambda since there is no GetLinkingMetadata like there is in the agent nor is there a way to send that data up even if there was a way to capture it.
  • The .NET Lambda Tracer is not based on the .NET Agent.
  • Not all OpenTracing features are implemented or used by the LambdaTracer.
  • There is no NoticeError equivalent in the LambdaTracer.
  • Our SpanExtensions are scoped to internal, not public currently which prevents calling into our existing exception logging functions like SetException.

Examples

It’s a good idea to try implementing our .NET example function to get familiar with all the components.

Distributed Tracing in Lambda

See our distributed tracing example.

Node and Python

Agents in the Extension layer have been designed to work in common use cases, like API Gateway, and so they automatically check for and handle any New Relic or W3C Trace Context headers which contain the traceId.

Go

A manual process for the most part. Uses the Go agent. AWS requires the use of a custom Go runtime for use with Extension.

Java and .NET

Java and .NET implementations for distributed tracing are similar enough that one could use the Java example and convert it to .NET.

Check out our examples:

.NET uses the OpenTracing implementation and so it doesn’t automatically look for the newrelic header. You’ll need to manually instrument insert and extract methods for API Gateway. If our newrelic custom header is attached to the outgoing request, the newrelic header will need to be passed from one entity to the next while maintaining trace context along the way.

You’ll need to:

  • add the custom header
  • let that custom header through api gateway

Note: If API Gateway is stripping out the ‘newrelic’ header, API Gateway can be configured to allow custom headers to pass through.

Incoming Headers

Agents accept both newrelic and W3C Trace Context headers and are designed to work with AWS API Gateway and AWS ALB. OpenTracing should be used with newrelic headers and does not automatically work with API Gateway or ALB.

W3C Trace Context

Here’s an example of how the W3C Trace Context can be broken out and understood from a header like event.request.headers:

traceparent: '00-123456789abcdef123456789abcdef12-123456789abcdef-01',

— traceparent —
version: 00
traceId: 123456789abcdef123456789abcdef12
parentId: 123456789abcdef
flags (sampled): 01
tracestate: '1234567@nr=0-1-7654321-123456789-abcdef123456789----1607034919662',

— tracestate —
trust key: 1234567
version: 0
parentType: 1
accountId: 7654321
appId: 123456789
spanId: abcdef123456789
transactionId: -
sampled-
priority-
timestamp- 1607034919662

transaction/sampled/priority are not required. Seems normal they would not be provided via browser.

New Relic Header

{
  "v":[0,1],                 // version [major, minor]
  "d":{
    "ty":"App",              // type
    "ac":"1234567",          // account id
    "ap":"123456789",        // app id
    "tr":"246813579abcdef",  // trace id
    "pr":1.42749,            // priority
    "sa":true,               // sampled
    "ti":1605197585495,      // time
    "tk":"1234567",          // trusted account key
    "tx":"abcdef987654321",  // transaction guid
    "id":"139742685abcdef"   // span guid
  }
}

Outgoing Headers

Trace context (tracestate and traceparent) must be maintained and passed along to any outgoing requests.

Linking Logs to Traces (Logs in Context)

The agent API or OpenTracing API needs to be used to link logs to traces by traceId and spanId. This is not yet implemented for Lambda and may not show up in the Lambda UI. Once complete, this feature will be handled by our Extension in the layer, which currently just sends function logs to New Relic without linking them to traces.

Examples from the Go Agent API:

Understanding CloudWatch Logs

Platform Logs

We have a log server that:

  1. Subscribes to platform logs via the AWS Logs API to get the START, END, and REPORT metrics.
  2. It also provides an easy way for us to forward function logs to New Relic.

The logger looks for a sandbox URI that acts like localhost. It’s a name that resolves within the Lambda container that we listen on to receive metrics from the Lambda platform.

Occasionally, you may see a line that says:

[NR_EXT] Failed to add platform log for request 12345678-1234-4321-abcd-123456789abc

This happens when the platformLog.Content is nil and can be safely ignored.

One of our Lambda engineers describes the logic behind it this way:

That’s essentially expected.
When we do a harvest because it’s the first invocation, or the first in a while, we remove telemetry that hasn’t been decorated with its platform logs from the buffer.
Later on (during the next invocation, pretty much always) the platform logs arrive, and we can’t find a telemetry payload to add them to.
This avoids the scenario where telemetry on single/occasional invocations could be delayed.

Better to send what we have quickly, even if no platform logs can be included for that one-off invocation.