Now in beta: Track your service level objectives

We’re happy to announce a new capability to track service levels for your applications on New Relic One!

This capability is available in beta to all full platform users. You can use it to create service level indicators and objectives for any entity type, and get a nice compact view of their compliance so the whole team can discuss together.

When you use it from an APM service, New Relic can automatically suggest the typical service levels, adjusting them to the last weeks of data. Or you can also create your own service level indicators based on any NRDB event.

Learn all about New Relic Service Levels on docs.newrelic.com.

Please remember that our Global Technical Support team does not offer direct support for beta releases. So don’t hesitate to post your questions and suggestions here on the Explorer’s Hub, or use the “Help us improve” button on New Relic One.

Cheers!

8 Likes

Do you have on the roadmap alerting capabilities on SLO compliance and error budgets burn rates?
any plans to make this feature (Service Levels) available to basic users in the future?

Thank you!

1 Like

Hello @gygabyte, thanks for the questions!

Alerts are on the roadmap, indeed.

May I ask your opinion on the SLO compliance alert? We’re not sure that it should follow the typical incident flow, so perhaps it could be a report you get once a day when you’re out of compliance? How do you envision it? Who would be the recipients?

Regarding the visibility to basic users, that’s not in the plans for the moment.

Cheers!

Thanks for your reply.

I am not too familiar with the incident concept in NR, but it seems to me that a SLO that is not met should be treated as a major event. Typically in SRE the SLO is a critical KPI. The incident resolution to one of these incidents can be a lot of things (establish a new SLO, enhancement/fix ticket, etc)… not the typical outage situation, I would agree.

Ideally though there should be also a capability to alert before being out of compliance, ie, a threshold and treated as a warning. Not sure to what exactly that would translate into NR monitoring/alerting current capabilities.

2 Likes

Trying to create Service Levels in a workload via terraform. In terraform apply, i get the following error when creating Service Levels in a workload i just created:
Error: Could not validate account access to the following guids:
323xxxx:69746:MzIzODM2MnxOUjF8V09SS0xPQUxxxxxxx: Invalid entity guid
First number is account, which is valid
last field is entity guid of the workload, which is valid
not sure what number 69746 represents. Seems like permission issue. I can manually create these through the gui without issue

Hello @Jerry.Johnson, thanks for reaching out and for using Service Levels! :slightly_smiling_face:

Based on the message, I suspect that the issue could be to two factors.

The first could be that the GUID of the Workload you are trying to create te SLI for is not correct. Could you please confirm that in your Terraform resource the argument guid only contains the GUID of the Workload itself? Because from the message you shared it seems that “guid” is a combination of an account ID, the SLI ID and the Workload GUID separated by colons.

Please, see the official New Relic Terraform documentation for more details: registry.terraform.io

The other reason, as you pointed out, could be that the New Relic API Key that you are using to create the Terraform resources doesn’t have access to the account where the Workloads lives in. I’d suggest double checking that if you get a chance. More details on docs.newrelic.com

Please, don’t hesitate to reach out if you require any further assistance.

2 Likes

thanks for quick response. In fact i was using id instead of GUID, so issue resolved

1 Like

Hi,
Is there a way to reflect service level performance in the workload status? i.e. if SLO’s aren’t being met, can the workflow status be changed to ‘disrupted’? I can’t find a way of selecting Service Levels in the workload status calculation.
Cheers.

1 Like

Hello @steve.ohara , thank you for this question!

Right now the workload status can’t be derived from related SLOs. But this is something that makes total sense, and we definitely want to add this capability to workloads in the upcoming months. At that point, the status of a workload will be determined by
a) the related SLOs,
b) the roll up of child entities’ status, or
c) the static status set manually by the owner.

Here’s a question for you: when you think of the workload status coming from SLO data, would you expect that status to show the current SLO attainment (meaning, “is the service burning error budget too quickly now?”), or would you expect to see the compliance over the whole period?

Cheers

1 Like

Thanks,
I can see reasons for both but I think the current SLO attainment would be most useful. If it was tied to the whole period, then the workload status would be affected for a long time after any issues are fixed; e.g. a service level with an SLO of 99% and a period of 7 days, an issue with the service could make the workload show as ‘disrupted’ for a week after less than 2 hours of not meeting the SLO. While the workload hasn’t been meeting the linked SLO, anyone looking at it on the 7th day would think there’s still an issue.
Cheers

1 Like

Unlike custom dashboards, workloads are more streamlined for a specific real-time monitoring use-case. As a result, I would think that the current SLO attainment would be more relevant here.

1 Like

Thanks for sharing @steve.ohara and @Rishav, your thoughts are 100% aligned with our current thinking. Our approach is that the health status of a workload (or any other entity type) should indicate if there are ongoing issues, and we’ll find other ways to show whether the SLOs are out of compliance for the period.

This feature has a lot of potential. The docs discuss best practices for queries by using 95th percentile, but the actual queries in SLM don’t allow it. They only appear to let you do count(*), which is not very useful for things like latency.
https://docs.newrelic.com/docs/service-level-management/create-slm/

Maybe I’m missing a way to create fully custom queries?

Hi, @stewart.b: According to the docs:

A latency SLI measures the proportion of valid requests that were served faster than the threshold established as a good experience.

To calculate the proportion of valid vs. invalid requests, use count(*). To calculate the 95th percentile (the threshold for your application), use the query builder:

To select an appropriate value for the duration condition, one typical practice is to select the 95 percentile duration of the responses for the last 7 or 15 days. Find this duration threshold using the query builder, and use it to determine what you consider to be good events for your SLI

Take the result of that query back to the service level tool, and plug it into the duration field:

Thanks @philweber - I get it now. This number will change over time, where business patterns cause deviation in the measurement (eg: month end). Ideally it would allow for a dynamic percentile calculation in the query, but this will suffice for Beta. :slight_smile:

The challenge is that NRQL does not allow functions in a WHERE clause, so it’s impossible to say, WHERE duration < percentile(duration, 95). Unless/until that capability is added, we are stuck calculating the percentile elsewhere and plugging in a static value.

1 Like

Hello @stewart.b , thanks for sharing your thoughts on the beta, and thanks @philweber or clarifying how to use the current percentile!

Indeed, we know it’s not easy to decide on what a “good event” is (we experience that for our own services!), therefore we suggest to use the current p95 as a baseline for the initial configuration, in accordance with a 95% SLO target.

For each service, only your teams will know if this baseline experience is good enough for your users, or if you actually need to improve the reliability/performance of any of them. In that case, you can set more ambitious conditions for good events, and/or more ambitious targets for your SLOs.

As you said, this is an iterative process and SLI/SLOs need be revisited as services mature, the throughput varies, new user flows are released, etc. So we will definitely consider the idea of helping recalculate a new baseline, while editing an SLI.

Thanks!

Hi @philweber and @hpujol, thanks for your input, much appreciated.

If you wouldn’t mind, could you clarify how to set SLO at different percentile levels, like 95% or 99%? Or if this is not supported at the moment, could a feature request be opened to enable this functionality please?

Hi, @Rishav: As noted above, you must execute an NRQL query against your data to determine a static threshold value. So, to set the SLO to the 95th percentile, you would execute the following query in the query builder:

SELECT percentile(duration, 95) 
FROM Transaction 
WHERE appName = 'Your Application' 
SINCE 7 days ago

To get the 99th percentile, replace the SELECT with:

SELECT percentile(duration, 99)

Take the result of the above query and enter it into the duration field of your SLO, as shown above.

1 Like

Thanks, @philweber, though I don’t get how to implement this in the context of the SLO. Would you mind sharing a screenshot of the percentile query as a functional SLO, please?