Lambda Troubleshooting Framework - Troubleshooting Lambda Part 2

Lambda Troubleshooting Framework - Troubleshooting Lambda Part 2

Part 1 of this guide is here

Environment Variables

Log Ingestion Function

The following environment variables are required for sending invocation payloads to New Relic:

LICENSE_KEY: YOUR_LICENSE_KEY
INFRA_ENABLED: true
LOGGING_ENABLED: true

A subscription filter and optional pattern links logs output from your function invocations to the newrelic-log-ingestion function, which then forwards invocation payloads to our cloud-collector and optionally logs to the log-api.

Lambda Extension

The environment variables can be used to configure the operation of the Extension in the layer which wraps your function.

NEW_RELIC_LAMBDA_EXTENSION_ENABLED: true
NEW_RELIC_LICENSE_KEY: YOUR_LICENSE_KEY
NEW_RELIC_LICENSE_KEY_SECRET: YOUR_SECRET_ID
NEW_RELIC_TELEMETRY_ENDPOINT: TELEMETRY_ENDPOINT_URI
NEW_RELIC_LOG_ENDPOINT: LOG_ENDPOINT_URI
NEW_RELIC_HARVEST_RIPE_MILLIS: 7_000
NEW_RELIC_HARVEST_ROT_MILLIS: 12_000
NEW_RELIC_EXTENSION_SEND_FUNCTION_LOGS: true
NEW_RELIC_EXTENSION_LOG_LEVEL: DEBUG

The Extension can output more detailed logs to CloudWatch by setting the environment variable NEW_RELIC_EXTENSION_LOG_LEVEL: DEBUG. The Extension does not send CloudWatch logs to New Relic, it only sends function logs. So you will not see “extension logs” that start with [NR_EXT] in New Relic, but they can be viewed in CloudWatch to assist in troubleshooting the Extension.

Node.js and Python Handler

The NEW_RELIC_LAMBDA_HANDLER needs to point to your actual function handler. Without this, the handler in our layer won’t know which handler in your function to execute. The environment variables are documented in our legacy Lambda monitoring doc which details manual installation steps for Node.js and for Python.

NEW_RELIC_LAMBDA_HANDLER: YOUR_ACTUAL_HANDLER (required)

Distributed Tracing

Most languages need the following environment variables to be set on the function to enable distributed tracing.

NEW_RELIC_ACCOUNT_ID: YOUR_NEW_RELIC_ACCOUNT_ID
NEW_RELIC_TRUSTED_ACCOUNT_KEY: your parent accountId or NEW_RELIC_ACCOUNT_ID if no parent

Node.js and Python additionally need the following environment variable to enable distributed tracing:

NEW_RELIC_DISTRIBUTED_TRACING_ENABLED: true

Java additionally needs the following environment variable to enable distributed tracing:

NEW_RELIC_PRIMARY_APPLICATION_ID: YOUR_NEW_RELIC_ACCOUNT_ID

.NET additionally needs the following environment variable to enable distributed tracing for SQS or SNS:

NEW_RELIC_USE_DT_WRAPPER: true

Node.js

These Node.js environment variables can be set on the Lambda function in the AWS Console or via SAM templates, CloudFormation templates, Terraform templates, etc. The Node.js agent logs can be set to trace level to get more detail on how it is instrumenting your function in CloudWatch.

NEW_RELIC_APP_NAME: YOUR_APP_NAME
NEW_RELIC_NO_CONFIG_FILE: true
NEW_RELIC_LOG_ENABLED: true
NEW_RELIC_LOG_LEVEL: trace
NEW_RELIC_LOG: stdout

Note: The Node.js agent automatically detects if it is running in an AWS environment and so it is recommended to NOT set the NEW_RELIC_SERVERLESS_MODE_ENABLED environment variable.

Python

The following Python environment variables relate to Lambda functions. The Python agent logs can be set to debug level and recorded in CloudWatch by sending to stderr.

NEW_RELIC_SERVERLESS_MODE_ENABLED: true
NEW_RELIC_LOG_LEVEL: debug
NEW_RELIC_LOG: stderr

Note: The Python agent does not automatically detect if it is running in an AWS environment and so it IS recommended to set the NEW_RELIC_SERVERLESS_MODE_ENABLED environment variable.

Java

Optionally, to enable debug logging for our OpenTracing Java Wrapper, set the following environment variable on the function:

NEW_RELIC_DEBUG: true

Go

To enable agent logs to be sent to CloudWatch, our Go agent doesn’t require any environment variables. Instead, it uses a go agent logger.

Common Invocation Issues

Integration Errors

If any integration errors are happening, you will notice some missing invocations. You can check for missing invocations by looking up the aws.requestId for a particular invocation. The requestId will be visible in CloudWatch logs.

`SELECT * FROM Span WHERE aws.requestId = 'YOUR_REQUEST_ID' SINCE 1 day ago`

In our internal downstream consumers of Lambda telemetry, if we get a payload that has a problem (invalid JSON, etc), we drop the request and create an NrIntegrationError event. This allows you to see the type of error and uuid of the request as a way to debug what happened.

SELECT * FROM NrIntegrationError SINCE 1 week ago

For Infrastructure-related account errors for a particular integration name (linked account name), like ServiceAccessDenied:

SELECT count(*) as 'Number of errors', max(timestamp) as 'Last seen' FROM IntegrationError WHERE providerAccountName = 'YOUR_INTEGRATION_NAME' FACET dataSourceName, method, error SINCE 1 day ago

New Relic Support can additionally look up errors not exposed externally for issues like lambdaIntegrationDisabled. Please contact New Relic Support if you are missing invocations but don’t see anything reported in NrIntegrationError or IntegrationError and request a check of LambdaLogConsumerError in staging for AWS Lambda.

Lambda Timeouts

See Common Issues and Solutions.

Extension release 2.0.1 addresses some common timeout issues that have plagued earlier releases:

  1. Our Extension should no longer cause Lambda function timeouts when there is an HTTP timeout within the Lambda Extension, i.e. failure to connect to our cloud-collector ingest endpoint. HTTP timeouts within the lambda extension are still possible, they are now handled gracefully. The Extension now adheres to the Lambda function’s deadlines whenever possible, caching payloads to be retried on the next invocation if necessary.
  2. In this way we can also avoid timeouts due to a misconfigured Extension.
  3. We now handle the case where a Lambda function sends a sigkill and will shut down gracefully rather than cause a timeout.

If your function is experiencing a timeout, it is recommended to update the layer to the latest version which will also include the latest Extension.

If a timeout still occurs after updating the layer:

  1. Try to narrow down the possible causes of the timeout. It’s a good idea to test with one of our example functions.

  2. If the timeout still happens with the example, we can rule out an issue with the function’s code. In this case it would more likely be an environmental issue or an issue with the New Relic layer itself.

  3. Check the function’s timeout setting. The default value may be set too low for your function.

Import Module Errors

After following the manual instructions to add our layer to your function, you may encounter the following errors.

[ERROR] Runtime.ImportModuleError: Unable to import module 'newrelic_lambda_wrapper': No module named 'newrelic_lambda_wrapper'
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'aiohttp'

The MODULE_NOT_FOUND error will occur if the layer has not been added to your function.

Console.log

Depending on the language of your function, there are different ways that you can log output to CloudWatch.

  • In the case of a timeout, it would be good to know precisely where the timeout is occuring in the invocation lifecycle.

  • In the case of excessive overhead after adding the layer, logging resource usage along the way can help determine what is contributing to the overhead.

Logging to console the start and end of each call within the function can provide more context.

Node.js

// Option 1: catch uncaught exception and log transaction state
process.on("uncaughtException", function (err) {
  console.log(err)
  console.log(newrelic.getTransaction())
  newrelic.noticeError(err)
})

// Option 2: with more detail about the origin
process.on("uncaughtException", (err, origin) => {
  fs.writeSync(process.stderr.fd, `Caught exception: ${err}\n` + `Exception origin: ${origin}`)
  console.log(newrelic.getTransaction())
  newrelic.noticeError(err, origin)
})

// Option 3: with detail on resources
const used = process.memoryUsage();
for (let key in used) {
  console.log(`function name: ${myFunction.name}, key: ${key}, value: ${Math.round(used[key] / 1024 / 1024 * 100) / 100} MB`);
}

.NET

using System.Diagnostics;
StackTrace stackTrace = new StackTrace();
Console.WriteLine(stackTrace.GetFrame(1).GetMethod().Name);

What To Do With CloudWatch Logs

CloudWatch logs can give us an idea of how the invocation is working. Of interest are log lines about invocation errors like duration and timeouts, memory usage, function errors, agent logs that show what is being instrumented, and the status of the requests to send function logs and the invocation payload to New Relic.

Verify Common Log Entries

Extension method

Our AWS Lambda Extension outputs log lines like this for cold starts at DEBUG level:

2021-02-05T13:15:11.309-08:00 START RequestId: 2c6ce1cf-c1f0-4591-b717-774acffabd85 Version: $LATEST
2021-02-05T13:15:11.354-08:00 LOGS Name: cloudwatch_lambda_agent State: Subscribed Types: [platform]
2021-02-05T13:15:11.543-08:00 [NR_EXT] New Relic Lambda Extension starting up
2021-02-05T13:15:11.544-08:00 [NR_EXT] Starting log server.
2021-02-05T13:15:11.545-08:00 LOGS Name: newrelic-lambda-extension State: Subscribed Types: [platform]
2021-02-05T13:15:11.545-08:00 EXTENSION Name: cloudwatch_lambda_agent State: Ready Events: [INVOKE,SHUTDOWN]
2021-02-05T13:15:11.545-08:00 EXTENSION Name: newrelic-lambda-extension State: Ready Events: [INVOKE,SHUTDOWN]
2021-02-05T13:15:11.547-08:00 hello world!
2021-02-05T13:15:12.098-08:00 [NR_EXT] Sent 1/1 New Relic payload batches with 1 log events successfully in 476.101ms (475ms to transmit 1.0kB).
2021-02-05T13:15:12.099-08:00 END RequestId: 2c6ce1cf-c1f0-4591-b717-774acffabd85
2021-02-05T13:15:12.099-08:00 REPORT RequestId: 2c6ce1cf-c1f0-4591-b717-774acffabd85 Duration: 552.75 ms Billed Duration: 917 ms Memory Size: 128 MB Max Memory Used: 77 MB Init Duration: 363.83 ms

And like this for subsequent invocations:

2021-02-05T13:15:12.920-08:00 START RequestId: 7c9c9601-8756-4010-82b5-9cb70025a85a Version: $LATEST
2021-02-05T13:15:12.924-08:00 hello world!
2021-02-05T13:15:13.043-08:00 END RequestId: 7c9c9601-8756-4010-82b5-9cb70025a85a
2021-02-05T13:15:13.043-08:00 REPORT RequestId: 7c9c9601-8756-4010-82b5-9cb70025a85a Duration: 119.71 ms Billed Duration: 120 ms Memory Size: 128 MB Max Memory Used: 79 MB

Legacy CloudWatch method

The newrelic-log-ingestion function has log lines like:

2021-02-05T12:35:38.508-08:00 START RequestId: a3eb03ab-689d-4f55-b631-cc57d87fd186 Version: $LATEST
2021-02-05T12:35:38.787-08:00 Log entry sent. Response code: 202. url: https://cloud-collector.newrelic.com/aws/v1
2021-02-05T12:35:38.806-08:00 Log entry sent. Response code: 202. url: https://log-api.newrelic.com/log/v1
2021-02-05T12:35:38.820-08:00 END RequestId: a3eb03ab-689d-4f55-b631-cc57d87fd186

The function logs in CloudWatch will have a line called NR_LAMBDA_MONITORING which includes the base64 encoded payload and looks like:

| 2021-02-17 00:14:58.867 | START RequestId: 1234abcd-1234-abcd-5678-123456789abc Version: $LATEST
| 2021-02-17 00:14:58.949 | [2,"NR_LAMBDA_MONITORING",{"protocol_version":17,"execution_environment":"AWS_Lambda_python3.6","agent_version":"5.16.1.146","arn":"arn:aws:lambda:us-west-2:123456789:function:the-function-name","function_version":"7"},"abcd1234"]
| 2021-02-17 00:14:59.016 | END RequestId: 1234abcd-1234-abcd-5678-123456789abc
| 2021-02-17 00:14:59.016 | REPORT RequestId: 1234abcd-1234-abcd-5678-123456789abc Duration: 147.07 ms Billed Duration: 148 ms Memory Size: 1280 MB Max Memory Used: 219 MB

See how we’ve implemented the collection of the NR_LAMBDA_MONITORING payload for each function language.

Decode the Payload

Sometimes it can be useful to see what is included in the payload sent to New Relic. The payload can be accessed in CloudWatch and details the instrumentation of your function, including: metadata, metric_data, transaction_sample_data, span_event_data, analytic_event_data, errors, traceIds, etc.

When sending via CloudWatch and with NEW_RELIC_LAMBDA_EXTENSION_ENABLED: false, a NR_LAMBDA_MONITORING payload will be seen in CloudWatch logs for the function after each invocation. The value can be decoded for inspection with:

# Mac only with pbpaste
pbpaste | base64 -D | gunzip | jq . > payload.json; code payload.json

# Linux base64 uses -d
pbpaste | base64 -d | gunzip | jq . > payload.json; code payload.json

When using the Extension to send Agent telemetry bytes, viewable with the environment variable NEW_RELIC_EXTENSION_LOG_LEVEL: DEBUG, the telemetry bytes value from CloudWatch logs can be decoded with:

# Mac
cnt=$(pbpaste | base64 -D | jq length); pbpaste | base64 -D | jq -r .[$cnt-1] | base64 -D | gunzip | jq . > /tmp/payload.json; pbpaste | base64 -D | jq ".[$cnt-1] = $(cat /tmp/payload.json)" > payload.json; code payload.json

# Linux
cnt=$(pbpaste | base64 -d | jq length); pbpaste | base64 -d | jq -r .[$cnt-1] | base64 -d | gunzip | jq . > /tmp/payload.json; pbpaste | base64 -d | jq ".[$cnt-1] = $(cat /tmp/payload.json)" > payload.json; code payload.json

Troubleshooting Distributed Tracing

Explore our distributed tracing example.

Verify Spans and Parents

If no root span exists, the trace will be orphaned. In other words, if every entry span has a parent, then no root span exists and it will be considered an orphaned trace.

To find spans for your trace, the best way is to use a NRQL query. The service map can also be used, but it’s important to note that NRQL queries will be the most accurate way to see each span and the entities involved in a trace. Queries can then be added to a dashboard to see more than a service map can provide. Everything at New Relic revolves around NRDB, so NRQL is your friend.

Visualize DT Spans with NRQL

If you have the traceId, this will find all spans and associated entities, assuming there are APM apps and Lambda functions included in the trace. Sort by timestamp to see how the trace is being passed and received.

SELECT count(*),latest(nr.entityType),latest(aws.lambda.eventSource.eventType),latest(name),max(timestamp) FROM Span WHERE traceId = '123456789abcdef' FACET entity.name,parent.app SINCE 1 week ago LIMIT MAX

Headers

Determine what attributes are in the header:

console.log(event.request.headers)

The request event is going to show headers that get passed in from an upstream entity.

Diagraming the Flow of Spans

For distributed tracing, it’s often helpful to diagram what’s happening and where root spans should originate. It can help to understand how many entities/functions are involved, what calls what, and if a certain thing needs special configuration, like API Gateway to allow custom headers to pass through. This example provides a format that works well in support tickets.

SQS
-> ALB
-> Lambda
    name1,
    Java OpenTracing implements TracingRequestStreamHandler,
    runtime version x,
    using Extension layer,
    shows as instrumented in New Relic UI
-> API Gateway
-> Lambda
    name2,
    Java OpenTracing implements StreamLambdaTracing,
    runtime version y,
    using NR_LAMBDA_MONITORING payload,
    not showing as instrumented in New Relic UI, just basic CloudWatch metrics

Serverless Plugin Errors

Invalid Credentials

This is not a New Relic error, but you may encounter it when using our plugin upon sls deploy. It happens when the Serverless Framework can’t find your AWS profile.

  ServerlessError: The security token included in the request is invalid.
      at ..\npm\node_modules\serverless\lib\plugins\aws\provider.js:1481:27
      at processTicksAndRejections (internal/process/task_queues.js:93:5)

By default, the Serverless Framework uses the credentials configured in the “default” profile. If you have a custom profile defined, please make sure to let the Serverless Framework know to use it.

sls config credentials --profile YOUR_PROFILE_NAME --provider aws --key YOUR_KEY --secret YOUR_SECRET
export AWS_PROFILE="YOUR_PROFILE_NAME"

Hitting the EU Endpoint with a US License Key Bug

When performing the serverless deployment, you might see this error:

$ npx serverless@1.34.0 deploy -v --stage=dev --force
Serverless: Plugins: ["serverless-newrelic-lambda-layers"]
Serverless: Adding NewRelic layer to image-usages-forEdition
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Fetch Error --------------------------------------------
invalid json response body at https://api.eu.newrelic.com/graphql reason: Unexpected token < in JSON at position 0

This bug has been fixed in Serverless Framework v0.4.2.

No Integration Found

------------------------------------------------------------
Serverless Deploy Errors:
Serverless: No New Relic AWS Lambda integration found for this New Relic linked account and aws account.
Serverless: Enabling New Relic integration for linked account: New Relic Lambda Integration - 123456 and aws account: ************.
Serverless: Error while creating the New Relic AWS Lambda cloud integration: Error: data.cloudLinkAccount missing in response.
------------------------------------------------------------

Errors in "newrelic-log-ingest lambda":

[NR_EXT] Failed to retrieve license key AccessDeniedException: User: arn:aws:sts::123456789:assumed-role/newrelic-log-ingestion-production-eu-west-1-lambdaRole/newrelic-log-ingestion is not authorized to perform: secretsmanager:GetSecretValue on resource: arn:aws:secretsmanager:eu-west-1:123456789:secret:NEW_RELIC_LICENSE_KEY-ABC123

Need to run the integration install to add the AWS Secrets Manager role and secret.

newrelic-lambda integrations install -a NR_ACCOUNT_ID -r REGION -n LINKED_ACCOUNT_NAME -k YOUR_USER_API_KEY --no-aws-permissions-check

CloudFormation template to GetSecretValue.

          - Effect: Allow
            Action:
              - 'secretsmanager:GetSecretValue'
            Resource: !Ref LicenseKeySecret