There certainly is a proliferation of *Ops: DevOps, ChatOps, GitOps, DataOps, ModelOps, MLOps. But there is one Ops concept that is quite special: HugOps. HugOps is a way to show empathy and appreciation for the people that operate the service: your SysAdmins, Site Reliability Engineers (SRE), Production Engineers, and Support Center staff.
While physical hugging may not be possible due to #covid19 or you simply don’t feel comfortable hugging people, showing appreciation is not only possible — it is a must. Research has shown that Community Care – sharing the burdens together – can effect change and help people manage stress better. A tweet / slack with the #HugOps hashtag can go a long way in showing your empathy with the people in the fire.
A stressful calling
Working in operations is a stressful job. Businesses and customers depend on the services offered, and the costs for downtime are rising. Aberdeen estimates that while five years ago an outage cost about $260,000/hour, costs now are likely greater than $1million. “Slow is the new Down.” Even if there isn’t an outage, slowness can affect the bottom line. Delay in website load time can hurt conversion rate; mobile site visitors will leave a page that takes longer than three seconds to load, for example.
The importance of services is ever increasing, and so are their reliability requirements. The result of moving from a 3 9s of availability (99.9%) to 4 9s (99.99%) is that now the downtime of the service can only last four minutes per month. Four minutes! A well-engineered operations function incorporates modern operations approaches (like SRE) on top of a reliable architecture. But still, bad things can happen. Carrying the weight of handling an incident certainly is enormous stress.
It is important to recognize the different aspects of stress in this job. Like other emergency responders, three aspects of stress can be considered (see “Stress Management for Emergency Responders”):
- Day-to-day stress – Getting ready to respond to interrupt (power up computer, connectivity to the system), coordinate daily life, etc..
- Critical incident stress – Performing the incident response for an incident in flight.
- Cumulative, chronic stress – Results from an accumulation of various stresses inherent in the job — repeating incident patterns, feeling helpless, feeling alone, etc..
As you can see, even small doses of stress add up, leading towards a risk of chronic stress.
A way out
We must find ways to tackle the impact of stress in this discipline. Social support from friends and family can help getting through stressful times. This is where #HugOps comes into play. Rather than just putting additional pressure on top of a stressful job, the technical community can show empathy and support to the people in the fire. By sharing the burdens and vitalizing our community, we can help the operations team to cope with stress better and faster.
Having strong social ties helps us get through stressful times and lowers anxiety.
A culture of collaboration and support
Beyond the typical approaches of #HugOps (Tweets, sending food / sweets / swag), below are some thoughts on applying the objective of HugOps in the enterprise.
The main motivation is two-fold. First, operations is a team sport and a responsibility from everyone across the software development life-cycle, not just the team labelled ‘Operations.’ The second aspect is psychological safety, the belief that you won’t be punished when you make a mistake (see: High-Performing Teams Need Psychological Safety).
- Prevent burnout before (preparation), during and after (stress relief) an incident:
- Improve the on-call schedule and policy (i.e. no more than 2 incidents per shift).
- Prepare for the job through exercises like “Wheels of Misfurtune.”
- Measure workload and staff appropriately. Don’t treat ops just as a cost play.
- Provide the necessary training and technology to be well prepared for responding to incidents.
- Offer relaxing and mindfulness services. Learn from other emergency response teams (such as emergency medical technicians, firefighters, and police).
- Collaboration. In the true spirit of DevOps, collaborate with your operations colleagues:
- As a developer, actively support the operations team during the incident when they need help.
- After the incident, seek active participation in the blameless (!) Post Incident Review, treating the incident as a learning opportunity for operations as well as development.
- Perform continued improvement of the service by implementing the resulting non-functional actions and in turn reducing the technical debt.
- Apply concepts such as ChatOps to enable collaboration across organizational boundaries.
- Shift the responsibility left. Don’t throw your operations colleagues under the bus by writing rogue code, lacking aspects such as reliability or observability:
- Build better reliability and manageability into the software: twelve-factor, build for reliability (link), build to manage (link).
- Instrument the code to provide richer observability.
- Follow secure coding practices.
- Apply modern principles like chaos engineering to validate the robustness of the service.
- Get in front of it:
- Learn from incidents. If you don’t spend time analyzing and determining the conditions that exist in order for an incident to take place, you won’t learn how to successfully remove nor recover from these conditions in the future. Help each other learn. (link)
- Shift from reacting to avoiding: automation, observability, shift left, etc..
- Give people time and tools to improve the service and the incident response. (“Sharpening the Axe” has come to mean taking action to make yourself better at your job, both long term and for the task at hand. Take a moment and think about how to go about the task in a smarter way).
- When giving people time, do this in a mindful way, for instance an effective on-call rotation scheme that differentiates between “interruptible” and “non-interruptible” work times.
- Hold production readiness reviews to govern what gets into production. An interesting aspect of SRE is that the SRE team has the right to refuse support if the service doesn’t meet the requirements, switching to a “you build it – you run it” model.
Not every service needs 5-9s of availability: Product Owners need to clearly negotiate and articulate the reliability targets of the service. Once these service level objectives (SLO) are defined, the appropriate measures need to be taken to be able to support these targets (architecture, implementation, operations).
Operations is a stressful job. The Community Care practices described in this article will help reduce the stress significantly by being better prepared, responding better to, and learning from unforeseen scenarios. Sharing the burdens together, and expressed noticeably through #HugOps, will create a support system to destigmatize burdening others.
Stress and anxiety may still exist in your workspace, but there are simple ways to reduce the pressure you feel. These tips often involve getting your mind away from the source of stress. Self-awareness, exercise, stress-reduction techniques (such as mindfulness, meditation), music and hobbies (such as woodwork, photography, knitting) can all work to relieve anxiety — and they will improve your overall work-life balance as well. SAMHSA has some great resources on individual stress management planning for disaster response staff members.
People typically only notice the operations role when something is not working; successful operations tends to go unnoticed. Safety management should move from ensuring that ‘as few things as possible go wrong’ (so-called Safety-I ) to ensuring that ‘as many things as possible go right.’ This perspective is called Safety-II, and it relates to the system’s ability succeed under varying conditions (see “From Safety-I to Safety-II”). Applying these concepts will be the topic of a future blog post.
Links and References
#hugops in Practice: Operationalizing Empathy. David Shackelford (PagerDuty). DevOpsDays 2015. https://legacy.devopsdays.org/events/2015-detroit/proposals/hugops and https://www.pagerduty.com/blog/hugops-in-practice/
HugOps is the best ops. On empathy and site reliability engineering. James Governor (RedMonk). September 18, 2017. https://redmonk.com/jgovernor/2017/09/18/hugops-is-the-best-ops-on-empathy-and-site-reliability-engineering/
HugOps for Humans. From self-care to selfless-care. Nitya Narasimhan. October 04, 2019. https://speakerdeck.com/nitya/hugops-for-humans-self-care-to-selfless-care
Stress Management for Emergency Responders. Understanding Responder Stress. Dr. Leslie Snider (Antares Foundation). https://www2c.cdc.gov/podcasts/media/pdf/AntaresPgm1.pdf
Disaster Responder Stress Management. Substance Abuse and Mental Health Services Administration (SAMHSA). U.S. Department of Health & Human Services. https://www.samhsa.gov/dtac/dbhis-collections/disaster-response-template-toolkit/disaster-responder-stress-management
High-Performing Teams Need Psychological Safety. Here’s How to Create It. Laura Delizonna. Harvard Business Review. August 24, 2017. https://hbr.org/2017/08/high-performing-teams-need-psychological-safety-heres-how-to-create-it
From Safety-I to Safety-II: A Whitepaper. Erik Hollnagel (University of Southern Denmark). Robert L. Wears (University of Florida Health Science Center). Jeffrey Braithwaite (Australian Institute of Health Innovation). 2015. https://www.england.nhs.uk/signuptosafety/wp-content/uploads/sites/16/2015/10/safety-1-safety-2-whte-papr.pdf