Modern Application development is being hit by a storm: the world is moving towards Microservices, and many companies are transforming even their existing monolithic applications towards a Microservices-based architecture. The Microservices architecture provides many advantages: agility and speed in releasing new functionality, flexibility in changing the implementation, independence between functional units, scalability, etc. However, these benefits come with a price in terms of management. The service management solution needs to deal with these dynamics, dependencies, and complexities to assure that the application is available and performing. Unless the management solution is also shifting its paradigm, the applications may behave worse than an application built in the traditional fashion.
New approaches exist to provide a more resilient environment. Examples are implicit redundancy through application high-availability and scalability and Fault-tolerance using concepts like circuit breaker patterns. While the concepts reduce the direct impact of an outage, they don’t relieve the DevOps team from detecting the incident and responding to it with a sense of urgency.
This post describes five key aspects on managing Microservices. These principles will assist the operations team in tackling the new “beast” Microservices. They also help the developer to think about operational facets of his application, as both – Developers and Operations – share a common goal of services that are robust and of high quality.
Aspect 1 – Operations
There are typical operational activities that need to be performed in a production environment: making sure that applications have the right capacity to cope with the load, are compliant with corporate or governmental policies, etc.
Some aspects of these tasks are simpler to achieve with the right support in place. In Capacity Management for instance, Microservices can support elasticity where infrastructures can automatically scale up and down depending on the usage. The 12-factor app manifesto describes a methodology for building applications that can scale up without significant changes to tooling, architecture, or development practices.
While Microservices don’t have to be implemented using containers, we do see a huge benefit in leveraging both technologies together. Using containers, everything required for the software to run is packaged into isolated, lightweight bundles, making deployment and operations much simpler. Kubernetes is an open-source system for automating deployment, scaling and management of containerized applications. Kubernetes provides functionality for most of the operational tasks
- Placement of workload based on resource requirements
- Automated rollouts and rollbacks
- Service discovery and load balancing
- Horizontal Scaling
As you can see, many of the typical operational tasks can be taken care of through Kubernetes, which enables you to focus on additional operational activities.
Google, Lyft, IBM recently announced Istio, an open platform to connect, manage, and secure Microservices. Istio provides an easy way to create a network of deployed services with load balancing, service-to-service authentication, monitoring, and more, without requiring any changes in service code. You add Istio support to services by deploying a special sidecar proxy throughout your environment that intercepts all network communication between Microservices, configured and managed using Istio’s control plane functionality. Information on Istio can be found at istio.io
Checking compliance is one of those tasks, for instance checking for compliance security advisories, corporate policies or standards enforced by Industry or government. It is best practice that these checks are already performed during the development and testing stages. Many policies keep changing (a good example is a new security exposure becoming public), therefore is recommended to perform these checks in production as well.
Another example is backup & archiving. In order to protect from disasters and to meet regulatory needs, backups have to be performed regularly based on RPO (recovery point objective, the maximum targeted period in which data might be lost) and RTO (recovery time objective, the targeted duration of time and a service level within which a business process must be restored after a disaster) objectives. A good Microservices design will externalize storage-related activities to explicit persistency services, so these tasks can be limited to the Services that deal with persistency. Needless to say that you have to verify that the backups are consistent and usable.
Aspect 2 – Monitoring
We believe each Microservice needs to be monitored. Before thinking of, let alone implementing, a monitoring solution, it is important to define what to monitor. A guiding principle should be the experience of the user of this service, which could be a human (for front-facing services), or a system (for back-end services). Therefore the key metrics typically are availability, performance / response time / latency, and error rate. Synthetic transactions are performed – ideally from multiple locations – to ensure that relevant functions of each service are “exercised” and these key metrics are evaluated against expected results.
These metrics are drastically different to a typical monitoring solution looking for CPU, Memory and Disk space. These parameters may still be monitored, however with the move towards cloud-based operating models (aka. IaaS, PaaS), the relevance for the application owners will be lesser and lesser.
A best practice is to expose a HealthCheck API for each microservice. As the developer knows best what the critical resources and checks for his services are, it is only consequent for him/her to implement this HealthCheck.
As we mentioned Kubernetes in the Operations section above, we will mention Prometheus as an open-source monitoring framework that works hand-in-hand with Kubernetes. They provide some level of Healthcheck API natively, easing the path for developers to take advantage of it.
Another important element to monitor are application logs, as they provide visibility into the behavior of a running app. A typical use case is the parsing and investigation of logs during the diagnostic analysis of an incident. Critical alerts may be exposed in logs as well, so a monitoring solution should look for these patterns and alert the operations team. Most of the time, these logs are streams, for instance per 12-factor app running processes write their event stream, unbuffered, to stdout. Although Microservices are only loosely coupled, they do depend on each other when it comes to providing their logic. They have been developed polyglot, and their execution characteristic is distributed and highly dynamic, so procedures need to be in place to aggregate logs to a central place and perform search and analysis from there.
A technique that helps stitching traces together is the use of correlation identifiers. Using correlation ids not only helps to identify the execution path for a given transaction. It also supports visualization that adds the context of time (for instance latency), the hierarchy of the services involved, and the serial or parallel nature of the process/task execution. OpenTracing is a vendor-neutral open standard for distributed tracing.
Aspect 3 – Eventing and Alerting
The monitoring solution will detect problems with the services, but there can be plenty. In this meshed architecture services depend on each other, so a degraded performance in one service may result in cascading failures in each of the depending services as well. In order to avoid people chasing symptoms rather than causes, an event management system will integrate alerts from various feeds – service monitoring, log monitoring, infrastructure monitoring – and attempt to correlate these events. In order to do so, topology information and deployment state information (i.e. how many instances of a particular service are currently running) is required.
As this information is changing rapidly, the data needs to be gathered at the time of the correlation. Traditional approaches like Configuration Management Databases (CMDBs) have a high risk of showing incomplete or stale information. This dynamic data, for instance topology information, should be retrieved directly from the (container) management system at the time of the detection and used for correlation.
The result of this correlation are actionable alerts. Each event should associated with a runbook, so that the First Responder team knows how to respond to the alert and what mitigation action to perform. Ideally, these runbooks are codified in form of scripts, so that the event management system could automate the execution and only surface new problems to a human.
We don’t want First Responder to stare at consoles. We rather notify them immediately at the receipt of a new alert. Notification can be done through various channels, for instance Email, Text alert or an alert in an instant messaging system. The notification systems also chases people of the response is not taken within a defined SLA.
Aspect 4 – Collaboration
Once the First Responder is notified about the incident, he will start with the diagnostics. The first step is to isolate the component at fault. Once isolated, the investigation continues to see what exactly happened and what can be done to restore the service as quickly as possible.
In an architecture where many services depend on each other, it is often required for many people to collaborate. As one of the key concepts of Microservices is to support multi-language and event multi-platforms, the need to interact with subject matter experts including developers is only increasing. The term ChatOps describes this process where people use an instant messaging communication platform to collaborate amongst SMEs. Through the ChatOps platform, all interaction is logged at a central place and one can browse through this log to see what actions have been performed.
ChatOps is not limited to humans interacting with each other. Using Bot-technology, DevOps and Service Management tools can be integrated. Examples are monitoring system that pushes a chart showing the response time distribution over the last 24 hours. Or a deployment system that informs about the recent deployment tasks.
Another important aspect to expedite the restauration is improved visibility through dashboards. As the Microservices-based application is dynamic in nature – continued deployment, auto-scaling, dynamic instantiation, circuit-breakers, etc . – having an accurate understanding of the application is a challenge. Dashboards help to visualize topology, deployment activities and state as well as the operational state showing availability and performance aspects. A dashboard should also visualize the key service indicators, from a user perspective.
Aspect 5 – Root-Cause Analysis
Through collaboration, the operations team eventually identified the right mitigation and restored the service. So far the focus was to restore the service as quick as possible. In order to prevent the incident from reappearing, the root cause has to be assessed. We recommend following the 5-Why’s approach as this method helps to beyond the surface in identifying the issue that was ultimately responsible for this incident. It is absolutely key that this investigation is operated in a blameless culture – only with this approach people are willing to share their insight and enable others learning from the experience.
Once the root-cause is known, appropriate steps need to be performed to address it. This could range from changes to the application or its architecture, changes to the infrastructure or changes to the management system. Following the agile approach, these changes are put into the backlog and will be prioritized at the next sprint.
There continues to be a challenge that functional enhancements tend to be prioritized higher than these outage-related changes. For these changes to actually get implemented, companies take different approaches. Some companies make operations the responsibility of the DevOps teams. In this model, developers now have an intrinsic interest in addressing these reliability issues. Another approach is to establish a Site Reliability Engineering (SRE) practice. This team is empowered to address reliability issues by spending at least 50% of their time on engineering work. Examples are reducing toil through automation, or assisting the development team in implementing outage-related changes.
Balancing short-term tactical improvements with longer-term strategic implementations is an act that needs to be carefully managed.
This was a quick introduction to some of the management aspects of a Microservices landscape. If you want to learn more, please visit the Service Management section in the IBM Architecture Center. In addition to a Reference Architecture, you can also find implementations for managing Microservices using the Netflix OSS pattern as well as containerized Microservices using Kubernetes.
To request a briefing on Cloud Service Management and Operations, please contact me at firstname.lastname@example.org