Four Key Reasons to Learn Markdown
Back-End Leveling UpWriting documentation is fun—really, really fun. I know some engineers may disagree with me, but as a technical writer, creating quality documentation that will...
As a full-stack web developer, I attended SRECon to expand my thinking about the reliability and observability of the services I develop. Here are my top 5 takeaways:
Casey Rosenthal’s talk titled “The success in SRE is silent” reminded us that while nobody thanks you for the incident that didn’t happen, you can still evaluate how the people around you are learning. First, check their reaction, thumbs up or thumbs down, about the changes. Eventually, they will be able to gauge that they’ve learned something. After that, you may notice shifts in behavior, such as asking for help setting up a monitor on Slack (where before, they might not have added a monitor). Finally, the results of new things making it to production, such as the new monitor.
Alper Selcuk shared Microsoft’s response to the massive expansion in the use of Microsoft Teams within education at the beginning of the pandemic. One of their techniques for avoiding service blackouts was brownouts, such as no longer displaying the cursor locations of other users on a shared document, preloading fewer events on the calendar, and decreasing the quality of videos on conference calls. This allowed Microsoft to keep the services online while increasing capacity and optimizing the service for the new load level. What brownouts could be applied to your service if it were to experience a sudden increase in demand?
Victor Lei applied experience skydiving to disaster recovery. There is a specific altitude in skydiving to stop trying to fix your main parachute and decide what is next. Then, there is another altitude where the skydiver automatically fails their backup parachute. Timeboxing is a technique for limiting the time spent testing a new idea or optimization, but it’s easy to lose track of time during a disaster. I’d like to see more guidelines for how long the on-call engineer should try to fix a problem before failing to the backup or calling in additional support.
Mattie Toia discussed emergent organizational failure. One point was forgetting how hard prioritization is, which can be helped by collaborating on mental models and making sharing and communication easy. Another was using incentives to replace dedication when the organization needs to demonstrate trust through actions. At the center of all five points was trust, how to build that, and recognizing that each organization member is complex and has their views of the world and the organization.
Christina Yakomin explained how to use the scientific method to test the resilience of systems.
I look forward to helping each project I’m on continue to grow in features, reliability, and observability to weather the good times and the bad.
SRECon is an open-access conference. Videos of all the talks will be free from Usenix in the following weeks.
Writing documentation is fun—really, really fun. I know some engineers may disagree with me, but as a technical writer, creating quality documentation that will...
Humanity has come a long way in its technological journey. We have reached the cusp of an age in which the concepts we have...
Go 1.18 has finally landed, and with it comes its own flavor of generics. In a previous post, we went over the accepted proposal and dove...