Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store ย ย  Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Calling Developers and SREs!

rfb
research-study-announcement

#1

:tada::tada: Calling all developers and SREs! :tada: :tada:

Are you on an on-call rotation? Do you sometimes get alerted at 2 am about production problems and have to troubleshoot the issue? If so, we would like to exclusively invite you to participate in a research study to learn more about how you troubleshoot and solve complex production problems today.

Your feedback will help us develop the next generation of modern software! Interested? Comment below or private message me, and I will follow up with you!


#2

Iโ€™m none of what youโ€™re looking for, but this kind of information would be invaluable to me. If I can help in any way, please let me know.

Also, I like SWAG.


#3

I work on oncall rotation to provide support to the applications which run 24*7 and have stringent availability SLA requirements.


#4

Here is detailed sneak a peek on to our oncall rotation structure.

We have a 8 member team separated at two geographic locations in different time zones (4 on each site)

Each site provides 12*7 support for the Applications/Infrastructure.
Each site has got Primary oncall / Secondary oncall.
Pager group defined for each site with details on rota.
Member perform oncall activity for a weekโ€™s duration.

Primary Oncall Responsibilities :

1 Acknowledge Alerts/Monitoring Scripts/Jobs/Incidents issues.
2 Work on Incidents/Changes
3 Loop in secondary oncall
4 Work on Low priority Incidents/Service Requests (If volume is high assign to secondary)

Secondary Oncall Responsibilities :

1 Assist Primary oncall as and when requested.
2 Work on Low Priority Incidents/ Service Requests if volumes are high for the day.

Pros -->
1 Pre-planned rota defined adjudging vacation and or other leave times.
2 Single Colleague not being over-burdened.
3 Beneficial in case of major incidents with more hands on deck
4 Better co-ordination.

Cons -->

1 Oncall Rotation frequency high.
2 Bit Strenuous in case of high volumes.
3 Extended oncall if multiple team members on leave.

Oncall hrs distribution -->

7 AM to 7 PM --> Onsite Team ( More Incidents / Service Requests related work)

7 PM to 7 AM --> Offsite Team ( More Change related work)

Oncall person can transfer his duties in case of personal work/appointments/exegencies within available team members.

In case of any Incident/Monitoring script alert or a job failure, Paging application would send message to primary oncall who needs to acknowledge it so that page doesnโ€™t go to secondary oncall or manager.

On receiving page primary oncall acts upon the alert and takes action as deemed to be appropriate in order to provide resolution to the issue. If issue is major or would breach SLA escalation matrix is informed in advance in order to take appropriate management mandated decisions to avoid penalties.

I am not sure if itโ€™s optimal model. But this has been working for us for past 5+ years.