Here is detailed sneak a peek on to our oncall rotation structure.
We have a 8 member team separated at two geographic locations in different time zones (4 on each site)
Each site provides 12*7 support for the Applications/Infrastructure.
Each site has got Primary oncall / Secondary oncall.
Pager group defined for each site with details on rota.
Member perform oncall activity for a week’s duration.
Primary Oncall Responsibilities :
1 Acknowledge Alerts/Monitoring Scripts/Jobs/Incidents issues.
2 Work on Incidents/Changes
3 Loop in secondary oncall
4 Work on Low priority Incidents/Service Requests (If volume is high assign to secondary)
Secondary Oncall Responsibilities :
1 Assist Primary oncall as and when requested.
2 Work on Low Priority Incidents/ Service Requests if volumes are high for the day.
1 Pre-planned rota defined adjudging vacation and or other leave times.
2 Single Colleague not being over-burdened.
3 Beneficial in case of major incidents with more hands on deck
4 Better co-ordination.
1 Oncall Rotation frequency high.
2 Bit Strenuous in case of high volumes.
3 Extended oncall if multiple team members on leave.
Oncall hrs distribution -->
7 AM to 7 PM --> Onsite Team ( More Incidents / Service Requests related work)
7 PM to 7 AM --> Offsite Team ( More Change related work)
Oncall person can transfer his duties in case of personal work/appointments/exegencies within available team members.
In case of any Incident/Monitoring script alert or a job failure, Paging application would send message to primary oncall who needs to acknowledge it so that page doesn’t go to secondary oncall or manager.
On receiving page primary oncall acts upon the alert and takes action as deemed to be appropriate in order to provide resolution to the issue. If issue is major or would breach SLA escalation matrix is informed in advance in order to take appropriate management mandated decisions to avoid penalties.
I am not sure if it’s optimal model. But this has been working for us for past 5+ years.