The DevOps Handbook – The Technical Practices of Feedback

1:51:12
 
Share
 

Manage episode 268760541 series 27319
By Allen Underwood, Michael Outlaw, and Joseph Zack. Discovered by Player FM and our community — copyright is owned by the publisher, not Player FM, and audio is streamed directly from their servers. Hit the Subscribe button to track updates in Player FM, or paste the feed URL into other podcast apps.

It’s all about telemetry and feedback as we continue learning from The DevOps Handbook, while Joe knows his versions, Michael might have gone crazy if he didn’t find it, and Allen has more than enough muscles.

For those that use their podcast player to read these show notes, did you know that you can find them at https://www.codingblocks.net/episode138? Well, you can. And now you know, and knowing is half the battle.

Sponsors
  • Datadog – Sign up today for a free 14 day trial and get a free Datadog t-shirt after your first dashboard.
  • Secure Code Warrior – Start gamifying your organization’s security posture today, score 5,000 points, and get a free Secure Code Warrior t-shirt.
Survey Says Which one?

Take the survey at: https://www.codingblocks.net/episode138.

News
  • We give a heartfelt thank you in our best announcer voice to everyone that left us a new review!
    • iTunes: TomJerry24, Adam Korynta
    • Stitcher: VirtualShinKicker, babbansen, Felixcited
  • Cost of a Data Breach Report 2020 (IBM)
  • Garmin Risks Repeat Attack If It Paid $10 Million Ransom (Forbes)
  • Almost 4,000 databases wiped in ‘Meow’ attacks (WeLiveSecurity.com)
The Second Way: The Principles of Feedback Implementing the technical practices of the Second Way
  • Provides fast and continuous feedback from operations to development.
  • Allows us to find and fix problems earlier on the software development life cycle.
Create Telemetry to Enable Seeing and Solving Problems
  • Identifying what causes problems can be difficult to pinpoint: was it the code, was it networking, was it something else?
  • Use a disciplined approach to identifying the problems, don’t just reboot servers.
  • The only way to do this effectively is to always be generating telemetry.
    • Needs to be in our applications and deployment pipelines.
    • More metrics provide the confidence to change things.
  • Companies that track telemetry are 168 times faster at resolving incidents than companies that don’t, per the 2015 State of DevOps Report (Puppet).
    • The two things that contributed to this increased MTTR ability was operations using source control and proactive monitoring (i.e. telemetry).
Create Centralized Telemetry Infrastructure
  • Must create a comprehensive set of telemetry from application metrics to operational metrics so you can see how the system operates as a whole.
    • Data collection at the business logic, application, and environmental layers via events, logs and metrics.
    • Event router that stores events and metrics.
      • This enables visualization, trending, alerting, and anomaly detection.
      • Transforms logs into metrics, grouping by known elements.
    • Need to collect telemetry from our deployment pipelines, for metrics like:
      • How many unit tests failed?
      • How long it takes to build and execute tests?
      • Static code analysis.
  • Telemetry should be easily accessible via APIs.
  • The telemetry data should be usable without the application that produced the logs
Create Application Logging Telemetry that Helps Production
  • Dev and Ops need to be creating telemetry as part of their daily work for new and old services.
Should at least be familiar with the standard log levels
  • Debug – extremely verbose, logs just about everything that happens in an application, typically disabled in production unless diagnosing a problem.
  • Info – typically action based logging, either actions initiated by the system or user, such as saving an order.
  • Warn – something you log when it looks like there might be a problem, such as a slow database call.
  • Error – the actual error that occurs in a system.
  • Fatal – logs when something has to exit and why.
Using the appropriate log level is more important than you think
  • Low toner is not an Error. You wouldn’t want to be paged about low toner while sleeping!
  • Examples of some things that should be logged:
    • Authentication events,
    • System and data access,
    • System and app changes,
    • Data operations (CRUD),
    • Invalid input,
    • Resource utilization,
    • Health and availability,
    • Startups and shutdowns,
    • Faults and errors,
    • Circuit breaker trips,
    • Delays,
    • Backup success and failure
Use Telemetry to Guide Problem Solving
  • Lack of telemetry has some negative issues:
    • People use it to avoid being blamed for problems, which can be due to a political atmosphere and SUPER counter-productive.
  • Telemetry allows for scientific methods of problem solving to be used.
    • This approach leads to faster MTTR and a much better relationship between Dev and Ops.
Enable Creation of Production Metrics as Part of Daily Work
  • This needs to be easy, one-line implementations.
  • Use data to generate graphs, and then overlay those graphs with production changes to see if anything changed significantly.
    • This gives you the confidence to make changes.
Create Self-Service Access to Telemetry and Information Radiators
  • Make the data available to anyone in the value stream without having to jump through hoops to get it, be they part of Development, Operations, Product Management, or Infosec, etc.
  • Information radiators are displays which are placed in highly visible locations so everyone can see the information quickly.
    • Nothing to hide from visitors OR from the team itself.
Resources We Like
  • The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations (Amazon)
  • The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win (Amazon)
  • The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data (Amazon)
  • 2015 State of DevOps Report (Puppet)
  • StatsD (GitHub)
  • Graphite (graphiteapp.org)
  • Grafana (grafana.com)
  • The Twelve-Factor App (12factor.net)
    • The Twelve-Factor App: Codebase, Dependencies, and Config (episode 32)
    • The Twelve-Factor App: Backing Services, Building and Releasing, Stateless Processes (episode 33)
    • The Twelve-Factor App: Port Binding, Concurrency, and Disposability (episode 35)
    • The Twelve Factor App: Dev/Prod Parity, Logs, and Admin Processes (episode 36)
  • Break Up With IE8 (breakupwithie8.com)
Tip of the Week
  • Bookmarks for VS Code (GitHub, Visual Studio Marketplace)
  • Pwn your zsh! (ohmyz.sh)
    • Companion cheetsheet (GitHub)
  • Use Docker BuildKit’s experimental features to enable and use build caches (GitHub)
  • Disable all of your VS Code extensions and then re-enable just the ones you need using CTRL+SHIFT+P. (code.visualstudio.com)
  • Color code your environments in Datagrip! Right click on the server and select Color Settings. Use green for local and red for everything else to easily differentiate between the two. Can be applied at the server and/or DB levels. For example, color your default local postgres database orange. This color coding will be applied to both the navigation tree and the open file editors (i.e. tabs).

142 episodes