
Why the Usage of Instrumentation Within Monitoring Tools Should be Implemented in Your Next Web Project
Back-EndWhen designing a web application, a strategy that has often been used is to use a monitoring tool such as Grafana or Datadog. There...
As a full-stack web developer, I attended SRECon to expand my thinking about the reliability and observability of the services I develop. Here are my top 5 takeaways:
Casey Rosenthal’s talk titled “The success in SRE is silent” reminded us that while nobody thanks you for the incident that didn’t happen, you can still evaluate how the people around you are learning. First, check their reaction, thumbs up or thumbs down, about the changes. Eventually, they will be able to gauge that they’ve learned something. After that, you may notice shifts in behavior, such as asking for help setting up a monitor on Slack (where before, they might not have added a monitor). Finally, the results of new things making it to production, such as the new monitor.
Alper Selcuk shared Microsoft’s response to the massive expansion in the use of Microsoft Teams within education at the beginning of the pandemic. One of their techniques for avoiding service blackouts was brownouts, such as no longer displaying the cursor locations of other users on a shared document, preloading fewer events on the calendar, and decreasing the quality of videos on conference calls. This allowed Microsoft to keep the services online while increasing capacity and optimizing the service for the new load level. What brownouts could be applied to your service if it were to experience a sudden increase in demand?
Victor Lei applied experience skydiving to disaster recovery. There is a specific altitude in skydiving to stop trying to fix your main parachute and decide what is next. Then, there is another altitude where the skydiver automatically fails their backup parachute. Timeboxing is a technique for limiting the time spent testing a new idea or optimization, but it’s easy to lose track of time during a disaster. I’d like to see more guidelines for how long the on-call engineer should try to fix a problem before failing to the backup or calling in additional support.
Mattie Toia discussed emergent organizational failure. One point was forgetting how hard prioritization is, which can be helped by collaborating on mental models and making sharing and communication easy. Another was using incentives to replace dedication when the organization needs to demonstrate trust through actions. At the center of all five points was trust, how to build that, and recognizing that each organization member is complex and has their views of the world and the organization.
Christina Yakomin explained how to use the scientific method to test the resilience of systems.
I look forward to helping each project I’m on continue to grow in features, reliability, and observability to weather the good times and the bad.
SRECon is an open-access conference. Videos of all the talks will be free from Usenix in the following weeks.
When designing a web application, a strategy that has often been used is to use a monitoring tool such as Grafana or Datadog. There...
There are several design patterns used these days in the .NET ecosystem. What are they? What are the benefits and drawbacks of each pattern?...
Generics in Go are nearly here! Here's what that means for you and some real use cases.