Fast & Simple - Observing Code & Infra Deployments at Honeycomb.io

#work #tech #yow-september-2021

YOW! September 2021 Recap

Jessica Kerr and Ian Smith

Credit has to be given to Jessica and Ian here, because originally Liz Fong was scheduled to take this talk slot, but fell ill at late notice. So these two champs stepped in with their own talk and did a pretty good job of it.

Jessica started by stating that Continuous Delivery needed to be considered an investment; not just in your software that you make, but the software that you build around your software. This means;

Telemetry
Tests
Continuous Integration
Feature Flags
Code Review
Observation in Production

The data uncovered by the DORA (DevOps Research and Assessment) group at its heart talks about feedback loops, which is what all of the above represent. Even code review, often derided as a slow friction point in a continuous delivery world, is a valuable feedback loop that can be made quick - as long as the change is small enough.

Jessica’s advice was to start with lead time, and trying to focus getting the time to make a change to production lower than what you currently have. Once you bring this down, all other things begin improving - Mean Time To Recover, Availability, etc. The other major piece of advice she’d have for teams would be;

“Fix the duct tape”

That is, the bits of “glue” that we have holding deployments or pipelines together that haven’t been touched in a while. The importance of this duct tape holding everything else together is only really appreciated when it no longer holds!

Moving a little more into how Honeycomb do things internally, Ian Smith took over the talk and spoke about the importance of repeatable infrastructure, all driven through code.

“If I want to make a change to our servers, I want to be able to diff and see that change in my browser, then hit a button and just have it happen.”

The importance of this repeatable, stable, reliable infrastructure driven via code is best seen in what it enables you to do with the products that you create - you can go faster on stable infra. You can manage risk, take bets, and iterate when things don’t work out. As a company whose products are designed to provide customers with observability, it’s important that they eat their own dogfood.

“Honeycomb… runs on Honeycomb”

On that subject, alerting is an incredibly important part of Honeycomb’s infra, but Smith cautions that there’s a degree of nuance that’s required. Having this vast amount of data, metrics, and telemetry available to you means that you could literally alert on everything if you wanted, but that would only result in fatigue and “the paging system that cried wolf”.

“You should use SLOs to drive nuanced alerting. For some events, maybe a ping on Slack is all that you need. For others, you’re going to need to pull a company-wide fire alarm. Knowing the difference here is what’s important.”