How to effectively troubleshoot production issue
Troubleshooting is a critical skill for SREs , Application Support and Production Services teams. With the growing adoption of cloud , microservices architecture , distributed systems have become more complex than ever. Troubleshooting major outages ,incidents in production has therefore become more complex these days because of this complex architecture. But the good thing is troubleshooting is a skill which can be both learnable and teachable according to Google.
I was recently reading the “effective troubleshooting” chapter from Google SRE book. For my own learning and understanding , I just noted down some of the important points from this chapter and wanted to write them down.
Google SRE book describes below process for effective troubleshooting of production problem.
Problem Report.
Every problem starts with a problem report , which might be an automated alert or your customer/user reporting the problem. Ideally there should be some tool to store this problem information like any bug tracking tool. This tool can be a good starting point at which problem reporters can try self-diagnosing or self -repairing common issue on their own.
Triage.
Once you receive the problem report/Incident ,the next step is to figure about the severity of the problem. Your response should be appropriate based on the severity of the problem. Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly a possible. IGNORE THAT INSTINCT ! . Instead ,your action should be to make your system work as soon as possible for your customers. This may entail emergency options ,such as diverting traffic from broken cluster to other healthy cluster , stopping the bleeding should be your first priority. You aren’t helping your users if the system dies while you’re root-causing. Of course, an emphasis on rapid triage doesn’t preclude taking steps to preserve evidence of what’s going wrong, such as logs, to help with subsequent root-cause analysis.
Examine.
We need to examine different components of system whether or not they are behaving correctly. Ideally , monitoring system’s metrics are good place to start figuring out what’s wrong. These metrics can be effective way to understand the behavior of specific pieces of a system and find correlations that might suggest where problem began. Logging is another invaluable tool. We may need to analyze system logs across one or many processes. Tracing requests through the whole stack provides a very powerful way to understand how a system is working. Another thing can be exposing current state of the system. Google have it’s own tool box for this , But my experience says — this can be done using the various APM tools. These tools can give us the current state of system , how different components or servers are communicating with others using architecture diagram. This workflow diagram shows the end points of each component of the system which helps in identifying the RPCs , database calls , error rates ,latency in RPC calls etc. Finally you may need to ask users to experiment the tasks once again to track the request and responses of different components of your system.
Diagnose.
Dividing and conquering is a very useful general purpose solution technique. In a multi layer system , it’s often best to start systematically from one end of stack and work towards other end , examining each component of system. We can look at the connections between components or equivalently ,at the data flowing between them to determine whether a given component is working properly. Having a solid reproducible test case makes debugging much faster ,and it maybe be possible to test the use case in non-production environment. Recent changes to a system can be productive place to start identifying what’s going wrong. well -defined systems should have effective production logging to track new version deployments and configuration changes at all layers of the stack.
Test and Treat.
Once you come up with a short list of possible causes , it’s time to try to find which factor is at the root of the actual problem. using the experimental method, we can try to rule in or rule out our hypothesis. Take clear notes of what ideas you had , which tests you ran and the results you saw. Negative results should not be ignored or discounted.
Cure.
Ideally , you have now narrowed the set of possible causes to one. Next , we would like to prove that it’s the actual cause. It can be difficult to do in production systems ;often, we can only find probable causal factors, for the following reasons: Systems are complex, It’s quite likely that there are multiple factors , each of which individually is not the cause ,but which taken jointly are causes. Having a non-production environment can mitigate the challenge to reproduce the problem in a live production system.
Once you have found the factors that caused the problem , it’s time to write up notes on what went wrong with the system, how you tracked down the problem , how to fix the problem and how to prevent it from happening again. In other words , you need to write down the postmortem report.
Reference — https://sre.google/sre-book/effective-troubleshooting/