Lambda Troubleshooting Framework- General Knowledge

Part 1

(Part 2 in next comment)

Lambda Troubleshooting Framework - General Knowledge for Troubleshooting

The goal here is to help identify where the focus of troubleshooting efforts should begin.

Lifecycle of an Invocation

The lifecycle of our Extension includes several phases: Extension init, Runtime init, Function init, Invoke, and Shutdown.

Some of the phases are async and a couple are going to spend a small amount of time blocking other things with synchronous calls. Specifically, from one of our Lambda Monitoring Engineers:

There are two synchronous phases to the extension’s execution, where it actually blocks the progress of the invocation. We spend less than 1ms receiving telemetry from the agent at the end of the invocation, and storing it in a buffer.

Every 7 seconds we send accumulated telemetry by making an HTTP call to the NR ingest service. This takes on the order of 70ms in us-east-1. In AWS regions further from the ingest service, it may take longer.

If logging is enabled, we use a somewhat different approach: we send logs as they arrive, but we don’t block the invocation lifecycle for it.

All of that to say, the Extension adds a few milliseconds to the invocation. However, that pales in comparison to the time it takes for the actual instrumentation of code to happen, which varies depending on what is being instrumented, how much code there is, what the function is doing, and whether objects like database clients are initialized on a global scope or a nested scope. Nested inits will take much longer to instrument on each and every invocation versus just once for a cold start on global scope.

It’s important to note:

The extension process is just a transport for the telemetry and (optionally) logs.

The extension doesn’t instrument your lambda. The agent in our layer (or the OpenTracing SDK) does that, and it’s way more work than just sending telemetry.

Details on harvests

  • The first invocation is sent quickly but without some metrics.
  • Subsequent payloads are sent in a batch every 7 seconds.
  • Functions that get invoked in a burst will still send telemetry every 7 seconds, but if the burst ends before 7 seconds, the subsequent invocations will not get sent until AWS shuts down the container, so best to send another invocation after 7 seconds to get the batch to send.
  • Invocation time: time between sending payload and receiving a response.
  • There are environment variables to fine tune the batch timing per function.
# default ripe harvest: every 7 sec
NEW_RELIC_HARVEST_RIPE_MILLIS: 7_000

# default rot harvest: every 14 sec
NEW_RELIC_HARVEST_ROT_MILLIS: 12_000

Architecture and Design

See our architecture doc.

“Advanced Lambda Monitoring” is actually two products, one built on top of the other. The “base” is simply the Infrastructure AWS integration that gets CloudWatch metrics. The “full package” involves adding APM Agent instrumentation code to Lambda functions, and then one of two ways can be chosen to ship invocation logs and payloads to us:

  1. The New Relic Lambda Extension built into our layers can be used to bypass CloudWatch and sends function logs and invocation payloads to us directly…
  2. Legacy CloudWatch method: Subscribe those functions to our newrelic-log-ingestion function that packages everything together and sends it up to New Relic.

Our endpoints for function logs and invocation payloads are:

InfraEndpointEU - https://cloud-collector.eu01.nr-data.net/aws/lambda/v1
InfraEndpointUS - https://cloud-collector.newrelic.com/aws/lambda/v1
LogEndpointEU   - https://log-api.eu.newrelic.com/log/v1
LogEndpointUS   - https://log-api.newrelic.com/log/v1

Whichever method is used to send us invocation telemetry, instrumentation of the function code is done the same, via our layers, and can be implemented in several ways depending on the language.

NRQL Types

See our docs for a description of New Relic’s backend.

ServerlessSample

The following query will show a list of function names from the linked AWS account and a count of invocations. If names are showing, the integration is working.

SELECT latest(provider.invocations.Sum) FROM ServerlessSample WHERE dataSourceName = 'Lambda' FACET provider.functionName SINCE 1 day ago LIMIT 100

AwsLambdaInvocation

See a description of Lambda monitoring data types in our docs.

The following query deternines whether a function is listed as instrumented or not in the Lambda UI:

SELECT count(*) FROM AwsLambdaInvocation WHERE entityGuid = 'YOUR_ENTITY_GUID' SINCE 30 days ago LIMIT 1

In order for AwsLambdaInvocation to be populated with data, we look for one of two log lines.

Here’s how to decode the payloads to see what is getting instrumented prior to sending to New Relic.

AwsLambdaInvocationError

AwsLambdaInvocationError events provide detail regarding an error that occurred during an invocation. It will provide more detail than a trace would provide. We display this information about invocation errors in the Lambda nerdlet in the New Relic One Explorer.

One of our Lambda engineers describes it this way:

There are cases where a node agent may ignore an error, noticeError() overrides this, thus forcing the agent to gather details about a specific error.

Here is an example NRQL query to get error details:

SELECT * FROM AwsLambdaInvocationError SINCE 1 week ago WHERE aws.lambda.arn = 'YOUR_FUNCTION_ARN'

Span

Span events and metrics will also get recorded. This query shows a count of span events by AWS integration (linked account name).

SELECT count(*) FROM Span FACET providerAccountName SINCE 1 week ago

The Span type is especially useful for seeing distributed traces. The following query will highlight any entities connected by the trace, like an app with an APM agent installed, a browser app with the Browser agent, and one or more Lambda functions. If no parent app is listed, that is the root span. Sort by timestamp to see the flow of the trace.

SELECT count(*),latest(name),max(timestamp) FROM Span  WHERE traceId = 'YOUR_TRACE_ID' FACET appName,provider.functionName,parent.app SINCE 1 week ago LIMIT MAX

This query will count spans between a browser app with the Browser agent installed and an instrumented Lambda function. It can be helpful to see if an approximately equal number of span events are making it from one entity to another.

 SELECT filter(count(*), where browserApp.name = 'YOUR_BROWSER_APP' as 'Browser'), filter(count(*), where provider.functionName = 'YOUR_LAMBDA_FUNCTION' as 'Lambda') FROM Span SINCE 1 week ago

One to Many

The typical design is to use one New Relic account linked with many AWS accounts, each with their own name. The name can be specified with the following --linked-account-name or -n parameter in the following cli command:

newrelic-lambda integrations install -a NR_ACCOUNT_ID -r REGION -n LINKED_ACCOUNT_NAME -k YOUR_USER_API_KEY --no-aws-permissions-check

The above command will also save your New Relic license key in the AWS Secrets Manager unless the --disable-license-key-secret parameter is specified.

Many to One

It is possible use many New Relic accounts with one AWS account. However, our New Relic Lambda CLI has not yet been designed for use with this scenario. The following manual steps outline this process:

Setting up the Integrations

  1. The New Relic accounts should be related in some way: either one’s a main account and the other’s a sub-account, or the two New Relic accounts are sub-accounts of another main account. Note that when setting up each Integration, you can use the same integration role with managed policy ReadOnlyAccess if one already exists in your AWS account, at which point the trusted identity for the role will show both New Relic Account ID’s there.
  2. The different Lambdas would authenticate with the accounts’ different license keys. It might be cleanest to do this with the NEW_RELIC_LICENSE_KEY environment variable rather than using the Secrets Manager, but either method is possible.

If using the AWS Secrets Manager to store your New Relic license key

  1. You’ll need a different secret name and ID for each New Relic account. You’ll also need a dedicated secrets access policy, which would need to be attached to each function’s execution role. Here are the parameters for creating the license key secret. Here is the IAM access policy needed by the function to retrieve the secret.
  2. Set an environment variable on each Lambda function to point it to your specific secret id: NEW_RELIC_LICENSE_KEY_SECRET: YOUR_SECRET_ID

Using the Legacy CloudWatch Method to Send Payloads

If not using our Extension method to send payloads to New Relic, our legacy CloudWatch method requires special consideration be given to the newrelic-log-ingestion Lambda function.

  1. The Extension can be disabled on each of your Lambda functions with: NEW_RELIC_LAMBDA_EXTENSION_ENABLED: false.
  2. One newrelic-log-ingestion function will be needed for each New Relic account linked. Each newrelic-log-ingestion function would be assigned the NEW_RELIC_LICENSE_KEY for the account you want to point it to. See this doc for further details.
  3. Similarly, when setting up the function’s log subscription filter, you’d specify which log ingestion function gets triggered on a log event.
  4. The New Relic Lambda CLI and AWS Deployment App won’t work to add multiple newrelic-log-ingestion functions to one AWS account since they check for an existing function and quit if one is found.

To deploy multiple functions you could do one of the following things:

Our Extension

The lifecycle of our Extension

There are several phases: Extension init, Runtime init, Function init, Invoke, and Shutdown.

Some of the phases are async and a couple are going to spend a small amount of time blocking other things with synchronous calls. Specifically, from one of our Lambda Monitoring Engineers:

There are two synchronous phases to the extension’s execution, where it actually blocks the progress of the invocation. We spend less than 1ms receiving telemetry from the agent at the end of the invocation, and storing it in a buffer.

Every 7 seconds we send accumulated telemetry by making an HTTP call to the NR ingest service. This takes on the order of 70ms in us-east-1. In AWS regions further from the ingest service, it may take longer.

If logging is enabled, we use a somewhat different approach: we send logs as they arrive, but we don’t block the invocation lifecycle for it.

All of that to say, the Extension adds a few milliseconds to the invocation. However, that pales in comparison to the time it takes for the actual instrumentation of code to happen, which varies depending on what is being instrumented, how much code there is, what the function is doing, and whether objects like database clients are initialized on a global scope or a nested scope. Nested inits will take much longer to instrument on each and every invocation versus just once for a cold start on global scope.

Also:

The extension process is just a transport for the telemetry and (optionally) logs.

The extension doesn’t instrument your lambda. The agent (or Open Telemetry) does that, and it’s way more work than just sending telemetry.

Request-Response Latency

The Extension’s send blocks delivery of the response.

AWS Roles and Policies

Required by All Integrations

All integrations need at least these permissions on the integration role in the linked AWS account and region.

CloudWatch

cloudwatch:GetMetricStatistics
cloudwatch:ListMetrics
cloudwatch:GetMetricData

Config API

config:BatchGetResourceConfig
config:ListDiscoveredResources

Resource Tagging API

tag:GetResources

Integration Role

The very broad ReadOnlyAccess policy is the default “AWS managed policy” we use when no other policy is specified.

The integration role is specified with the --integration-arn parameter when using the New Relic Lambda CLI. For example:

newrelic-lambda integrations install --nr-account-id <NR_ACCOUNT_ID> --nr-api-key <KEY> --integration-arn arn:aws:iam::<AWS_ACCOUNT_ID>:role/NewRelicLambdaIntegrationRole_<NR_ACCOUNT_ID>

If specifying your own integration role, these are the bare minimum permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricStatistics",
                "cloudwatch:ListMetrics",
                "cloudwatch:GetMetricData",
                "lambda:GetAccountSettings",
                "lambda:ListFunctions",
                "lambda:ListAliases",
                "lambda:ListTags",
                "lambda:ListEventSourceMappings"
            ],
            "Resource": "*"
        }
    ]
}

Specifying both the --integration-arn and --role-name parameters on the newrelic-lambda integrations install command will allow an AWS user without CAPABILITY_IAM to complete the integration.

The role needs a specific trust relationship and condition. Under the “Trust relationships” tab:

  • add account “754728514883” as a trusted entity
  • add your New Relic account ID as a “StringEquals sts:ExternalId YOUR_NEW_RELIC_ID” condition

Note: the permissions on the integration role are not the same as the function’s execution role. For the purposes of the integrations install command, the --role-name option supplies the CLI with the execution role for use by the newrelic-log-ingestion function. It is not the same as the integration role which is only used for the integration between New Relic and the AWS account. The same execution role can be used by many Lambda functions and is typically attached to the function automatically by AWS when the function is created.

Execution Role

arn:aws:iam::<AWS_ACCOUNT_ID>:role/<YOUR_EXECUTION_ROLE_NAME>

The AWSLambdaBasicExecutionRole policy is the default “AWS managed policy” we use when no other policy is specified.

The execution role is specified with the --role-name parameter when using the New Relic Lambda CLI with the integrations install command as described in the Integration Role section. The execution role can be used for multiple functions, including the newrelic-log-ingestion function.

If specifying your own execution role, these are the bare minimum permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

Secrets Manager Role

arn:aws:secretsmanager:<AWS_REGION>:<AWS_ACCOUNT_ID>:secret:NEW_RELIC_LICENSE_KEY-<RANDOM>

The secrets manager role is created by default unless the --disable-license-key-secret parameter is specified, in which case the NEW_RELIC_LICENSE_KEY env variable should be set on the function. The AWS Secrets Manager needs these permissions in a role attached to the function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": "arn:aws:secretsmanager:<AWS_REGION>:<AWS_ACCOUNT_ID>:secret:NEW_RELIC_LICENSE_KEY-<RANDOM>"
        }
    ]
}

Layers

Our layers include the Extension for collecting and sending invocation payloads, function and platform logs to New Relic. The Node.js and Python layers additionally include agents for instrumenting and handling distributed tracing headers.

See our docs for more.

Regions

We publish our layers to the following regions: af-south-1, ap-east-1, ap-northeast-1, ap-northeast-2, ap-south-1, ap-southeast-1, ap-southeast-2, ca-central-1, eu-central-1, eu-north-1, eu-south-1, eu-west-1, eu-west-2, eu-west-3, me-south-1, sa-east-1, us-east-1, us-east-2, us-west-1, us-west-2.

GovCloud

Note: We don’t yet publish layers to AWS Gov Cloud regions, so a layer arn isn’t available there.

If you need to add our layer to your AWS Gov Cloud region, you’ll need to manually download our layer as a zip file, then publish the zip to your gov region. This has to be done once per region.

Here’s how to do it:

  1. Download our layer zip for your function’s runtime, for example Python 3.7. Make sure your aws profile is configured to match the region of the layer arn.
aws lambda get-layer-version --layer-name arn:aws:lambda:us-west-2:451483290750:layer:NewRelicPython37 --version-number 35 | jq -r .Content.Location | xargs curl -o python-layer.zip
  1. Publish our zipped layer to your region.
aws lambda publish-layer-version --layer-name NewRelicPython37 --description "New Relic Lambda Python 3.7 layer" --compatible-runtimes python3.7 --zip-file fileb://python-layer.zip
  1. You can then add the layer to your function.

Either:

  • select it from the list of runtime compatible layers, or
  • copy the layer arn that was output to your console from the publish command

Layer Versions

See our docs for a description of our layers. You can find the latest layer versions here.

Java Layer

Our new Java layer can be used with Java functions to provide auto instrumentation.

Extension Layer

The Extension layer can be used with .NET and Go functions. It doesn’t include any instrumentation logic, it includes the Extension for sending invocation telemetry to New Relic (bypassing CloudWatch).

If the function experiences a timeout, verify that all async functions are returning and not leading to unhandled exceptions. Confirm you’ve completed our handler setup. Our Extension processes each payload synchronosly and must wait for either:

  • function execution to complete
  • function timeout

OpenTracing should be used to implement instrumentation logic for .NET functions utilizing our Extension layer, otherwise a timeout will occur. See our OpenTracing Agent for .NET as a reference.

Node.js and Python Layers

Note: Node 8 and Python 3.6 do not support the AWS Lambda Extensions API, so will default to using our legacy newrelic-log-ingestion function to send invocation payloads and logs.

The Node.js and Python layers include the Extension as well as instrumentation logic in the form of the Node.js and Python New Relic Agents.

Handlers

Handlers in our Node.js and Python layers: In the AWS Console, the handler set in the function’s Runtime settings should be set to the handler in our layer. This kicks things off in the layer first. The layer then looks for an environment variable NEW_RELIC_LAMBDA_HANDLER which points to the function’s actual handler. In this way the layer starts first in order to instrument the function.

Part 2

(Part 1 in above comment)

New Relic Agents

For layers that include agents (Node.js, Python, Java, and Go), the work of instrumenting your function is made drop-in easy. For .NET functions, our OpenTracing wrapper is still needed which is more of a manual process. Work is being done to make .NET function instrumentation drop-in easy as well.

Java Instrumentation

Our new Java layer is here. The goal is to make Java instrumentation drop-in easy like the current state for Node.js and Python.

Node.js Instrumentation

To use add custom instrumentation to your Node.js function using our Node.js layer, instantiate a newrelic object and use the Node Agent API.

Python Instrumentation

To use add custom instrumentation to your Python function using our Python layer, initialize and then instantiate a newrelic object and use the Python Agent API.

Go Instrumentation

Go Agent Repo, API, and Guide

Go Agent Examples

OpenTracing Agents

Java OpenTracing Wrapper

New Relic OpenTracing Agent

The following repositories define what we would call an agent, used to instrument your Java function.

Examples

Why we use OpenTracing instead of the Java agent for Lambda monitoring

One of our Java agent engineers explained it best:

AWS charges based on Lambda function execution time. Because of this, most people use lambdas in a manner where they only execute for a short period of time (maybe milliseconds or seconds) to manage costs. When you introduce the Java agent into the environment it might add several seconds to the Lambda startup time just to initialize. The overhead impact is increased on lambda cold starts.

That’s why we opted for a solution with less overhead based on OpenTracing.

.NET OpenTracing Wrapper

New Relic OpenTracing Agent

The following repository defines what is analogous to an agent, used to instrument your .NET function (.NET Core or C#).

Code notes:

Wrapper:
- SpanExtensions
  - SetException is where we add an Exception to the Log with specific attributes.
  - This does not use Tags.Errors.Set
- LambdaWrapper
  - AfterWrappedMethod is where we call SpanExtensions.SetException
Tracer:
- DataCollector
  - TransformSpans is what sends each span over to Errors to see if a Log entry for error exists
  - PreparePayload is where the traces, event, and spans get serialized into JSON.
- Errors
  - Errors is where we look for Log entries with the error attributes we set previously.
  - We build the ErrorEvent here
  - We build the ErrorTrace here, with a stacktrace is possible

Current state:

  • Logs in Context is not supported using .NET Lambda since there is no GetLinkingMetadata like there is in the agent nor is there a way to send that data up even if there was a way to capture it.
  • The .NET Lambda Tracer is not based on the .NET Agent.
  • Not all OpenTracing features are implemented or used by the LambdaTracer.
  • There is no NoticeError equivalent in the LambdaTracer.
  • Our SpanExtensions are scoped to internal, not public currently which prevents calling into our existing exception logging functions like SetException.

Examples

It’s a good idea to try implementing our .NET example function to get familiar with all the components.

Distributed Tracing in Lambda

See our distributed tracing example.

Node and Python

Agents in the Extension layer have been designed to work in common use cases, like API Gateway, and so they automatically check for and handle any New Relic or W3C Trace Context headers which contain the traceId.

Go

A manual process for the most part. Uses the Go agent. AWS requires the use of a custom Go runtime for use with Extension.

Java and .NET

Java and .NET implementations for distributed tracing are similar enough that one could use the Java example and convert it to .NET.

Check out our examples:

.NET uses the OpenTracing implementation and so it doesn’t automatically look for the newrelic header. You’ll need to manually instrument insert and extract methods for API Gateway. If our newrelic custom header is attached to the outgoing request, the newrelic header will need to be passed from one entity to the next while maintaining trace context along the way.

You’ll need to:

  • add the custom header
  • let that custom header through api gateway

Note: If API Gateway is stripping out the ‘newrelic’ header, API Gateway can be configured to allow custom headers to pass through.

Incoming Headers

Agents accept both newrelic and W3C Trace Context headers and are designed to work with AWS API Gateway and AWS ALB. OpenTracing should be used with newrelic headers and does not automatically work with API Gateway or ALB.

W3C Trace Context

Here’s an example of how the W3C Trace Context can be broken out and understood from a header like event.request.headers:

traceparent: '00-123456789abcdef123456789abcdef12-123456789abcdef-01',

— traceparent —
version: 00
traceId: 123456789abcdef123456789abcdef12
parentId: 123456789abcdef
flags (sampled): 01
tracestate: '1234567@nr=0-1-7654321-123456789-abcdef123456789----1607034919662',

— tracestate —
trust key: 1234567
version: 0
parentType: 1
accountId: 7654321
appId: 123456789
spanId: abcdef123456789
transactionId: -
sampled-
priority-
timestamp- 1607034919662

transaction/sampled/priority are not required. Seems normal they would not be provided via browser.

New Relic Header

{
  "v":[0,1],                 // version [major, minor]
  "d":{
    "ty":"App",              // type
    "ac":"1234567",          // account id
    "ap":"123456789",        // app id
    "tr":"246813579abcdef",  // trace id
    "pr":1.42749,            // priority
    "sa":true,               // sampled
    "ti":1605197585495,      // time
    "tk":"1234567",          // trusted account key
    "tx":"abcdef987654321",  // transaction guid
    "id":"139742685abcdef"   // span guid
  }
}

Outgoing Headers

Trace context (tracestate and traceparent) must be maintained and passed along to any outgoing requests.

Linking Logs to Traces (Logs in Context)

The agent API or OpenTracing API needs to be used to link logs to traces by traceId and spanId. This is not yet implemented for Lambda and may not show up in the Lambda UI. Once complete, this feature will be handled by our Extension in the layer, which currently just sends function logs to New Relic without linking them to traces.

Examples from the Go Agent API:

Understanding CloudWatch Logs

Platform Logs

We have a log server that:

  1. Subscribes to platform logs via the AWS Logs API to get the START, END, and REPORT metrics.
  2. It also provides an easy way for us to forward function logs to New Relic.

The logger looks for a sandbox URI that acts like localhost. It’s a name that resolves within the Lambda container that we listen on to receive metrics from the Lambda platform.

Occasionally, you may see a line that says:

[NR_EXT] Failed to add platform log for request 12345678-1234-4321-abcd-123456789abc

This happens when the platformLog.Content is nil and can be safely ignored.

One of our Lambda engineers describes the logic behind it this way:

That’s essentially expected.
When we do a harvest because it’s the first invocation, or the first in a while, we remove telemetry that hasn’t been decorated with its platform logs from the buffer.
Later on (during the next invocation, pretty much always) the platform logs arrive, and we can’t find a telemetry payload to add them to.
This avoids the scenario where telemetry on single/occasional invocations could be delayed.

Better to send what we have quickly, even if no platform logs can be included for that one-off invocation.