(First in our Troubleshooting Series)
We all love troubleshooting. Well, maybe not so much. In fact, it’s a pain in the butt and rarely fun for anyone, least-of-all the business folks whose system is down, losing revenue and reputation. And yet, astonishing little has been done to improve this situation in the last decade. In fact, one can argue that dynamic distributed systems, Docker, Lambda, etc. are making it harder to keep things running and fix them when they break.
Yes, we are getting more & better metrics, logging is getting better organized, and we have decent APM tools. But all these just give us more data, often with less understanding, remarkably little context, and not much help in getting things fixed, faster. This is not really progress.
More data with less understanding is not really progress . . .
Given our decade of MSP experience running hundreds of the world’s largest systems (up to hundreds of millions of users each), we are very opinionated about troubleshooting, especially across dozens of technologies, architectures, clouds, and industries. We’ve seen a lot and would love to share how we approach troubleshooting.
Great troubleshooting combines great tools and great processes, including training, dry-runs, and good management. Many people talk about process and how to manage incidents, so this post focuses on what’s missing from industry tools - even today’s most modern toolsets fall woefully short of what we all really need in this brave new dynamic cloud and operations world.
To us, several key things are missing: System Models, History, Context, and Expert Systems.
Let’s take them in order.
System Models - Ops teams tend to think they understand their systems, but as they scale, rotate team members, and launch new products, this is rarely really the case. DevOps makes this much worse, with micro-services, daily deployments, Docker, Lambda, etc.. And even if the team knows, their tools do not, so everyone and everything is left guessing on what is where, how it connects, what depends on what. Not to mention how it’s all configured, as no one ever seems to know that, nor config history, differences, etc. (even if it's mostly stored in git as infrastructure-as-code).
What we really need are system models, i.e. how system is structured, communicates, and its dependencies. Some tools try to tease this out by algorithms, while others like APM just show you the data. We prefer to couple discovery with just asking the people who built it. And then model the system so we know how it all fits together. Of course some things are dynamic and we want to detect those, too, but if the bulk of the system is determined in advance, things get much easier.
A real model includes all the ‘objects’ in the system which is VMs (physical/virtual), Containers, OS, all the services like Apache or MySQL, plus all the cloud stuff like Subnets, IPs, Security Groups, and then all the server-less RDS, Lambda, etc. along with 3rd party or external services. And a perfect model would have all the detailed config info for all these things, too; not just pretty diagrams.
On top of that, you need to include both the dependencies and something we never see, the Minimum Service List (MSL). The MSL defines how many instances of a service you need to run - derived from the Minimum Equipment List (MEL) used for aircraft to fly safely. The MSL tells us we only need one of the three web servers we have, but we need one of the only MySQL Write DB we have, etc. This is very powerful info for alerting and escalating, especially at 3am.
When you model this way, it simplifies a lot of things and makes everything else far more accurate, easier to automate, communicate, and remediate.
History - Part of knowing your system is knowing its history. For most alerting or incident systems, this means showing you when the same alert happened before, and perhaps some info on how problems were resolved. But that’s not enough. You really need to know several types of history, including:
- Expanded Alert History - This includes any alerts and incidents that happened on this server/VM, this service like MySQL, other MySQL services and servers, and on the system in general. This hierarchy is important, especially if these alerts/issues were recent, as they provide useful input to the troubleshooting engineer in a complicated system. It often may not matter, but is surely can and it’s important to be able to explore this as an inverted funnel backwards in time and ever-larger in scope.
Real Alert History should also indicate or classify this alert in areas such as noisy alerts vs. rare, plus hard vs. soft and proactive vs. reactive. Noisy vs. rare is obvious, but hard alerts are certain such as mysql not running or out of RAM alerts. Soft alerts are much less certain, such as "server unreachable” or “no data received” which may not be a real problem. Proactive alerts are things like low disk space, i.e. not yet a problem, while high 5xx errors on Apache are reactive, as bad things already happened. More advanced systems can also indicate if alerts are due to fixed thresholds or anomaly detection, the latter of which can be quite complex and often best looked at in graph-form with alert limits.
- Event History - You need to know what’s going on, especially what has changed, rebooted, logged in, deployed code, changed config, and anything else that usually affects your systems, as this is often a source of problems. Few systems have any of this, but capturing and showing any part of these events can help a lot, especially when tied to the ‘who’ of the change, so you can find out even more.
We’ve not seen any event trackers, but pushing as much to Slack as you can is a poor-man’s way to get started, especially if you tag the events like #reboot, #deploy, etc. and keep the channel clear of comments and other noise. But a great tracker would build a Facebook-like timeline of everything that goes on in every system, including Cloud Trail-like feeds, key syslog events, alerts, and as much as you can get, with filter controls.
- Config History - Most teams have no idea how their systems are configured in detail, especially now that things a mix of configured in code via Infrastructure-as-Code and long-running ‘pet’ servers which no one even remembers building. Nor any idea how the configurations have changed over time, by who and why (though Infrastructure-as-Code can help this, but very hard to see/access across services, or at scale).
They also have no way to determine if servers’ configs have drifted from originally designed, or from each other in things like web farms. This lack of info can lead to endless problems, especially when hunting (and changing) things during troubleshooting.
So a good system would have CMDB info at your fingertips for every part of every system, ideally periodically pulled from the real running systems, e.g. the OS, AWS APIs, config file parsers, etc. plus any Infrastructure-as-Code files such as Terraform or Cloud Formation and base config files you might be running through Puppet, Chef, or Ansible.
Context - What else has been going on with the system recently (and now) that can impact this incident ? Such as changes in basic load levels (requests/second), types of load (HTTP request size, which changes in DDoS attacks), or especially latency levels across the stack - using the dependency info from the above models, they can show the lowest level services experiencing higher latency, so you know your slow web results are due to slow ELK lookups, which are probably due to high JVM heaps. Of course, higher error levels in any part of the system also feed this type of analysis.
Some of the better APM tools have some of this real-time traffic, e.g. New Relic can show SQL traffic to MySQL with some stats and filters. However, it's usually a challenge to see this across a system, especially over time and with any deviations from typical.
Expert Systems - Even if we know everything, we can’t know everything. We especially don’t have the time during urgent troubleshooting to check dozens of various facts and issues, to hunt for various config settings, nor quickly consider all the outlier possibilities.
But good Expert Systems can do all of this, including the very important Exclusion list, which are possible causes the system can rule out so you don’t waste time on them (a serious problem afflicting junior and/or sleepy engineers).
Expert Systems can look at dozens of complex factors and rank possible causes, usually including lots of things you wouldn’t find nor think about for quite awhile. They can also provide next steps to move the incident forward, temporarily work-around, etc. And when safe, they can quickly run auto-healing or auto-remediation (i.e. remove servers from load balancers).
More advanced Expert Systems even can even give you additional observations/tests which can be done manually or run automatically to get more context and information, such as a 'ps list' or top IPs in an HTTP log after an alert happens. These really help boost troubleshooting speed by providing timely and useful info when it’s needed. Ideally they are always a step ahead of you, and of course learn and improve over time based on actual results and causes.
Finally, all this should be supplied to engineering teams in a single dashboard with a master set of Facts tailored to this alert/incident, as what you need to know changes based on the problem you are trying to solve. So ideally you can see all the facts, plus the Expert System output, and links off to more detailed histories, data, etc. on tabs or related pages.
There’s a lot more, of course, but having the above would be a huge benefit to everyone involved in operations and reliability engineering.