Over the years we have been providing many enhancements to reduce the amount of genuine events generated by the IT landscape: Best practices are applied, monitors become smarter through baselining, events are correlated through patterns and topology information, analytics is applied against event streams, tickets are being automatically generated and escalated. But at the end of the day people still had to perform manual tasks at the receipt of an event / incident. But things are changing…
At Interconnect, we are announcing a beta for runbook automation as a new capability in our IT Service Management Portfolio.
Runbook Automation will allow clients to automatically respond to incidents based on defined procedures. There are multiple aspects that motivate the need for automating responses:
- Cost Savings – Reduction of manual effort.
Many clients have performed labor arbitrage by shifting Call-Center / Level-1 support tasks to low-cost regions. But the trend of cost saving continues and new approaches to cost reduction have to be found. By automating repetitive tasks, savings can be achieved. Even if you don’t fully automate the entire resolution of the problem, just automating repetitive tasks for problem resolution can already reveal significant savings.
- Time Savings – Reduction of the overall time spent.
A slight, but important variation of the first aspect. The execution of a workflow is frequently characterized by wait time, as the activity is handed over to a different person for a specific task. The tasks may only takes minutes, but the flow is put on hold until that task is performed. Especially in the area of incident management, automating the hand-over can significantly improve the MTTR (mean time to repair).
- Quality Improvement – consistent execution of defined procedures.
As the tasks are now automated, the defined steps are executed with no variation. This not only improves the quality for this one activity. It also removes the risk for a ripple-effect down the road. How frequent is the consequence of a slight variation noticed weeks / months after the fact, resulting in increased effort. Automation drives consistency, removing the risk for configuration drifts.
- Risk Reduction – consistent execution of defined procedures.
Risk is another element addressed by automation. As the runbook is now very prescriptive, the risk of operators making mistakes is reduced, full automation can even drastically remove the risk, as a direct interface for operators is no longer needed.
When consulting with clients, we have seen that while the desire is full automation end-to-end, the best practice still follows the pareto-principle. The last 20% take a lot of effort, as all combinations and variations have to be considered and implemented. Frequently, the firm can gain more value by only automation 80% of the flow, and then move on to the next use case. And yes, there are low-hanging fruits. Therefore, our approach to Runbook Automation will respect this. In fact, we intend to support three types of runbooks
- Manual Runbooks: The system provides prescriptive instructions what tasks to perform. While the steps are performed manually, the system will track that the right steps are provided, and executed in the correct sequence.
- Semi-automated Runbooks: Some of the steps in the runbook can be performed automatically, at the click of a button.
- Fully-automated Runbooks: a complete automation of all steps end-to-end.
These types of runbooks also support the client adoption. First, the organization defines the steps to be performed for a given instance. “Manual Runbooks” allow these clients to document the procedure and control its execution adherence. The second step is to automate the most repetitive steps, or the more complex steps. This is where “Semi-automated Runbooks” help. And finally, for the cases where full automation is required or motivated and clients have gained confidence and control, “fully automated Runbooks” will take over.
The Runbook Automation service is aiming for natural integration with the existing solutions. This could be Event Management (like Netcool OMNIbus), where runbooks becomes actions that operators could take. Likewise, helpdesk tools (like IBM Control Desk) can leverage these runbooks when performing Incident Management or Problem Management (as suggested by ITIL, the IT Infrastructure Library and de-facto standard for IT Service Management). And then there is a suite of more collaborative tools that support the initial problem determination and resolution process, frequently before an incident is raised. The later can also be seen at clients that don’t focus on ITIL.
In closing, I want to highlight one other important factor on runbook automation. Don’t just limit your thinking to resolving issues (“auto-correcting”). There is significant value also in performing problem determination (“auto-checking), for instance verifying connectivity to dependent components or taking a snapshot of the system state. Just imagine the effort when performing a snapshot (what is going on): You need to locate the system (hostname, FQDN, IP-address), get the credentials for the system, log-on to a jump-server and in turn to the server, then perform the necessary commands (like process list, memory and CPU utilization, version information, etc.). These steps can easily take 5 minutes to perform. If you multiply these 5 minutes by the amount of times these steps are executed each year, you see the potential of automation in this area. The system could perform these tasks immediately at receipt of an event and provide the operator the output of these commands. There are further positive side effects like the consistent, documented and timely invocation of these commands, right after the detection time.