So we migrated to cloud. We listed products, reorged and restructured teams. We broke down the monolithic and built micro-services. And we have been using Kubernetes for the past few years. We automated our builds and deployments using CICD pipelines. We have a lot to celebrate. Let me list some of the things that seem routine but a few years ago things were different.
- We deploy during business hours. This used to be done during night when there is not much traffic to our app and services. One is there could be impacts to customers during deployment. And if something happens, there could be outages. Now Kubernetes rollout update made deployment a smooth process with no downtime to app and services.
- Automated build and deployment pipeline allowed us to deploy multiple times during a day, versus once every two weeks, or even once every month.
- Ability to automatically rollback quickly also gave us more confidence for deployments.
- Disaster Recovery ability and redundant data centers increased high availability. Even when something happens during deployment to one data center, we can switch to the other data center.
- Small incremental changes and deployments help developers to have clarity of minds and make it easier for impact analysis.
- Emergency type of requirement can be handled quickly by pushing out changes once development work done, without having to wait on deployment schedule or wait on manual deployment process. (We don’t get this often but there are good examples of these such as government mandates during pandemic.)
These are all good and we love them. But now let’s look at the challenges. Having the capabilities of build and deploy fast, it gives us a high deployment frequency. With micro service architecture and many interconnect pieces, there could be impact to another part of the system even though on one side we are just making a small change. Some naive changes could cause the other side to collapse. And how much testing is enough? Unit tests, automated integration testing, automated functional testing. We do all that and are continuing doing that by the way. Software is just so critical. We cannot afford to have an outage of a few minutes. It causes interruptions to business and impact our customers.
(There is the saying that if you never caused an outage, then you are not a real software engineer.)
Okay. So we keep on pushing changes to the system and we can do that frequently. But how can we make the system more stable? Sure we try to eliminate mistakes from any possible failure points. Clear requirements, pair programming, code review, impact analysis, automated testing, random manual testing, acceptance tests, automated CICD pipeline, automated rollback pipeline, disaster recovery capabilities, rich monitoring and alerting systems, pushing better communication and knowledge sharing, empowering the teams, etc. Is that enough? And we know it is never enough. Each of these there is always room to improve. Let’s be honest. Software bug could still happen and we need to prepare for the system to fail. My team have been on a journey to stabilize the system and making enhancements to the overall process. We want to share our journey and invite the community to join us to find ways to adopt best practices, patterns and principles of doing things.
Just to talk about my team, Passenger Checkin of American Airlines. As an airline, my company helps about half a million passengers to travel each day. That’s also the daily checkin count for my team. Our services getting called 1000 times per minute. So let’s say some type of failure for whatever reason, might be a naïve bug that slipped through all the checking, or could be some upstream flow code change that affected passenger checkin flow, then each minute the issue is present, we have 1000 failed passenger checkins, and 5 minutes is 5000.
This is what we believe in and our expectation: Issues need to be self-reported. Waiting on customers to report issues is not acceptable. So early detection is key to fix the issue. A quick rollback of the deployment or switch traffic over to the other data center could be resolutions. Though each just take a few minutes, but we want to improve on these to avoid those 5000 passengers impact. You know travel could be stressful until you reach your destination. How stressful it could be for passengers when something is not working in the software they use.
We try to look at different aspect of our response time to an incident, Detection, and Communication. Automatic alerting system. Telemetry and Observabilities of the system. Spend time to set up all kinds of alerts. So after a deployment, not by developer checking each dashboard, because we have many apis, some of the errors may not show up on the dashboard.
The other thing we try to strengthen on is impact analysis, from starting of the development work and after. So there is a good strategy of what to check, test and communicate. Since the apis are being called by multiple clients, any contract breaking changes for sure will break the clients. Some naive change of returning back this value for a field instead of another value could also break the clients. (An additional enum value for example)
The main strategy we try to pursue is using Kubernetes deployment templates and Istio, together with gitaction to achieve incremental roll out on any of the changes as a routine process. The goal is to reduce blast radium for any changes made. Monitor in production and detect any failures before they having a big impact to users. The strategies are using Istio weighted routing for Canary Deployment but overall automate that.
During the research and experiments, we considered a few options. We could do it at the deployment/pods level. Using service’s selectors to point to the old and new deployments. Scale the pods up for the new deployment and down for the old deployment. Then traffic can gradually move from the old to the new deployment. The benefits being both deployments are active. The automatic switching traffic can happen instantly if needed. And this approach is simple as it is done at pods and deployments layer. But the drawback is that you can not precisely control the percentage to rolling out. It is all based on the number of pods running the new changes and pods with the old changes.
About routing traffic, most of our teams have already been using Istio as service mesh solution. We use Istio to route traffic for our apis. Then it is just a further step to introduce weighted routing. Also automation is the key. Once we figured out how to do it. We automate it into pipeline. Reusable workflow makes the solution scalable and enables many team to adopt it easily.