Brian is a Site Reliability Engineer at Twitter, where he supports internal platform security and Direct Messaging. He has also worked at Pinterest and Facebook, where he has worked on deployment, monitoring, and remediation tooling primarily using Python.
Using a real production war story, this talk will highlight some of the thoughts, techniques, and approaches to troubleshooting production python at scale.
Problems with single hosts are challenging enough. Scaling up to hundreds or thousands of running hosts only multiplies the problems. However, troubleshooting and remediating production issues at scale can also be much easier to deal with than issues on smaller installations.
Services written in python can be more apt to encounter certain problems and lend themselves to certain solutions as well. In this talk, we will explore a real production issue or two around a python application to highlight some sound techniques and approaches to handling services at scale.
While working through the narrative of the problem, we will explore some specific how-tos from a simple level, such as reading logs, to more complicated things pertaining the overall state of the runtime environment.