The DevOps Handbook - Cheat Sheet V1.0
The DevOps Handbook - Cheat Sheet V1.0
quality closer to the source Design with Conway’s Law in Mind Design with Conway’s Law in Mind
Without alignment on incentives and goals, Development & IT Operations Don’t hand off work to other teams, minimize approvals, right-size Enable market-orientated teams Continuously build, test, and integrate our code
will be at odds with each other. documentation, and make changes in small batches.
Optimize for speed and embed the functional engineers and skills (Ops, Step towards continuous delivery by automatically building and testing in a
Enable optimizing for downstream teams QA, Infosec etc) into each service team production like environment, when code is checked-in to version control.
Design software with architecture, performance, stability, testability, Test, operations, and security as everyone’s job, every day Build a fast and reliable automated validation test suite
configurability, and security prioritized into the work.
Establish shared goals on quality, availability, and security that are the Automate all layers of the testing – balancing the test pyramid across unit,
3rd Way: Continuous Learning & Experimentation responsibility of everyone in the development process. acceptance, integration, and functional testing.
Enable an organizational learning & safety culture Enable every team member to be a generalist Catch errors as early in our automated testing as possible
Adopt a generative (Westrum) culture where failure leads to inquiry, and Focus on establish teams with generalist skills, providing opportunities for Establish an “ideal test pyramid” where we aim to detect issues as early
information, including risks, is freely shared. all engineers to learn the skill necessary to build and run systems and as fast as possible (ie. Unit tests)
Institutionalize the improvement of daily work Fund not projects, but services and products Ensure tests run quickly (in parallel, if necessary)
Pay down technical debt, fix defects, refactor and improve problematic Fund long-lived teams that focus on the achievement of organizational and Automate the commencement and running of tests (from source check-in),
Narrow the gap between the concept of Development and Operations – areas of the code – the ‘boy scout rule’ of leaving code better than before customer outcomes such as revenue, value, or adoption rather than waiting for manual approval or trigger from developers
Cheat Sheet V1.0
1st Way: The Principles of Flow Leaders reinforce a learning culture Keep team sizes small Integrate performance testing into our test suite
Leaders create iterative, short term target conditions – and empower Use the “two pizza” rule – where teams are small enough that they can be Write automated performance tests that validate across the entire
Make your work visible teams to experiment in order to solve for it. fed with two pizzas, ideally around 7 plus or minus 2. application stack as part of the deployment pipeline.
Use a Kanban board to show your entire workstream, making it visible to
all stakeholders to drive central prioritization of work
Selecting which value stream to start with Integrate operations into the daily work of development Integration of non-functional requirements testing
Consider both systems of record and engagement Create shared services to increase developer productivity Tests should include validation of system attributes we care about –
Limit work in process (WIP) supported applications, compilers, OS, and any other dependencies.
Optimise your value stream to maximise flow – focusing both on quality Create a set of centralized platforms and tooling that enable dev –
Establish WIP limits at each stage of the Kanban board to limit multi- and speed to create a robust and fast flow of value automated environments, testing, and common version control Establish Andon cord for when deployment pipelines break
tasking – measure lead times through the board
Start with the most sympathetic & innovative groups Embed Ops engineers into our service teams When test failure occurs – ensure there is shared responsibility for all to
Reduce batch sizes react and address the failure before continuing further work.
Find teams that already believe in DevOps, focusing on creating success Ensure the operational skills are within the service teams, either by
The DevOps Handbook
Set WIP limits on your Kanban board to reduce batch sizes by limiting the
amount of in-flight work – the optimum batch size will be the lowest total
with those groups to build a coalition of change embedding DevOps, or training and empowering the development team Enable and practice continuous integration
cost of delivery when considering transaction and holding costs Assign an ops liaison to each service team
Expand DevOps across the organization Use small batch development
Reduce the number of handoffs Find innovators/early adopters, build a critical mass & silent majority, then Build operational skills and awareness into teams by assigning an ops Merge early and often – by providing many small merges, as opposed to
once widely adopted – you can focus on the holdouts. liaison to each development team building up large and infrequent merges.
Automate as much as possible in the development process –reorganizing
developments teams to have all capabilities required to develop, test, Understand the work in our value stream Integrate ops into dev rituals
release, and maintain their code in production
Adopt trunk-based development practices
Create a value stream map to see the work Have the ops engineers attends development team ceremonies, Institutionalize that developers need to check-in their code to trunk at
Continually identify and address your bottlenecks participating to improve the operational supportability of development least once per day to limit the batch size of changes.
No one person can know all the work that must be performed to create
Continually identify and remove the most significant bottleneck impacting value for the customer – visualize this publicly for all to see Make relevant ops work visible on shared Kanban boards Automate and enable low-risk releases
your speed of delivery – creating change tolerant architectures and
automation through development & release. Create a dedicated transformation team Create a shared Kanban board that gives operations and development Automate the deployment process (code, test, and infra.)
visibility of what work is flowing into production shortly.
Eliminate hardships and waste in the value stream Assign dedicated resources to the DevOps transformation who are Automate all steps across the deployment processes, minimizing the
generalists and respected – create space for them to focus Create the foundations of your Development Pipeline manual effort required through the process to create repeatability
Look for partially done work, extra processes/features, task switching,
waiting, motion, manual work, and heroics – and optimize to remove these Establish a shared goal Enable on demand creation of all environments Enable automated self-service deployments
2nd Way: The Principles of Feedback Create a north star for the transformation team – relentlessly Establish automated tools for configuration, OS, environments, and Create a code promotion process that can be performed by Dev or Ops
communicate it to reinforce the vision and goal to the business deployment to allow dev teams to establish environments on demand without manual intervention to build, test, and deploy the software
Design a safe system of work
Keep our improvement planning horizons short Create our single repository of truth for the entire system Integrate code deployment into the deployment pipeline
Manage complex work, swarm on problems, transfer knowledge through
the organization, and grow leaders with these values Be adaptive in planning improvements, work in short iterations of change, Have all application code, scripts, schemas, env creation tools, containers, Ensure packages are suitable for PRD deployment, see env readiness at a
measure outcomes, and incorporate past learnings in new initiatives tests, and other technical artefacts in a common source control location. glance, automated deploy, and record and test automatically.
See problems as they occur
Reserve time for NFR and technical debt Make infrastructure easier to rebuild than repair Decouple deployments from releases
Create fast feedback and fast-forward loops via creation of automated
builds, integration, and test processes. Dedicate effort for addressing non-functional requirements and technical Establish immutable infrastructure where manual changes to PRD are not Employ environment based or application based release patterns to
debt – ideally 20-30% of time as a rule of thumb allowed – on the construction/de-construction via automated processes. decouple deployment from customer release.
Swam & solve problems to build new knowledge
Use tools to reinforce desired behaviour Done for dev teams includes running in a PRD like env Leverage patterns to improve speed and ease of deploy
Fix problems as they occur – and build a psychologically safe environment
for people to raise concerns real time. Use common backlogs and tools between Dev & Ops teams Ensure development teams demonstrate code in a production-like Implement feature toggles or dark launches to control visibility of changes
environment as part of their definition of done.
Architect for low risk releases Have developers follow work downstream Decrease incident tolerances to find weaker failure signals Integrate security into defect tracking and post-mortems
Architect to enable productivity, testability, and safety Have the developers directly observe the UX of their software on real users Standardization along cannot prevent software issues – continually Track all security issues in the same work tracking system as that which
– understanding any challenges users are facing. experiment and discover to find new software risks. Dev and Ops are using – include post-mortem learnings into this
Establish a loosely-coupled architecture with well-defined interfaces which
enforce how services connect with one another. Have Devs initially self-manage their production service Redefine failure and encourage calculated risk-taking Integrate security controls into source code and services
Select the best architecture for your needs Dev teams have a Launch Readiness Review with Ops on their early life You need to fail faster and more often, identifying it as a learning Centralize a set of pre-validated security blessed libraries that are
services – then self-manage those until operational stability and a Hand-off opportunity and applying the necessary correction to prevent recurrence maintained and pulled in real-time during the CI/CD pipeline.
Monolithic architectures are fine for early life companies, but may not Readiness Review is completed.
scale – establish a loosely coupled architecture and adaptable design. Inject production failures to enable resilience and learning Integrate security into your deployment pipeline
Integrate A/B Testing into Our Daily Work
Use the strangler pattern to safely evolve Deliberately create failure scenarios in production – Implement a ‘Chaos Create security tests that run as part of the deployment pipeline for every
Integrate A/B testing into your feature testing Monkey’ to test the resilience of your production systems. committed change.
To decommission legacy software – place it behind an API where it remains
unchanged, then gradually replace it with the desired architecture. Release two version of your product, diverting a number users to the Institute game days to rehearse failures Ensure security of the application
control (“A”) or the treatment (“B”) – applying statistical analysis of results
Create Telemetry to Enable Seeing and Solving Problems Regularly simulate failure - This tests the fault resistance of your software Tests should include static & dynamic analysis, dependency scanning, and
Integrate A/B testing into your release in a wide variety of scenarios to identify and address latest defects code integrity and signing checks – and be aligned with OWASP guidelines
Create centralized telemetry infrastructure
Integrate feature toggles into new releases, and leverage them to control Convert Local Discoveries into Global Improvements Ensure security of your software supply chain
Cheat Sheet V1.0
Centralize logging, transform the logging into valuable metrics, then apply the percentage of users who experience the treatment version.
statistical analysis to identify patterns to trigger actionable events Use chat to automate and capture org. knowledge Ensure all packages and dependencies used are up to date, and meet the
Integrate A/B testing into your feature planning same security tests required of your platform as a whole.
Create application logging telemetry that helps production Document and share observations of system and testing health
Use the feature hypothesis: We Believe (action), will result in (result), we automatically via a shared chat location that is transparent to all Ensure security of the environment
Ensure every feature is instrumented and providing telemetry, and create will have confidence to proceed when see (measure)
logging hierarchies for both non-functional and feature attributes. Automated standardized processes in software for re-use Establish known good states of environments – automating the monitoring
Create Review and Coord. Processes to Increase Quality of all production instances against those good states.
Use telemetry to guide problem solving Capture knowledge and documentation of services in source control,
Avoid the dangers of change approval processes making information available for everyone to search and use. Integrate information security into production telemetry
Leverage the telemetry to provide fact based problem solving - using the
scientific method to create and test hypothesis to obtain learning. Change controls can create negative impacts – be mindful that more Create a single, shared source code repository Provide security telemetry via the same tools that Dev, QA, and Operations
controls added means a more rigid processes, and less adaptability. are using to give everyone vision of security performance.
Enable creation of production metrics as part of daily work Establish a central shared source repository that stores all tools/
Ensure you don't "Overly control" changes libraries/infrastructure/config/source for deploying all environments Create security telemetry in your applications
Create central and easy to use infrastructure and libraries so that it is easy
for development & operations to create telemetry for all new functionality. You cannot reliably predict successful changes with words - use control Spread knowledge through docs and CoP Establish telemetry into your applications to identify insecure practices or
methods that resemble peer review & reduce reliance on external bodies behaviours in the system operation – and flags appropriate alert levels
Enable self-service to telemetry and information radiators Develop tests that are self documenting of the code – showing engineers
Enable coordination and scheduling of changes working examples of how to use the system. Create security telemetry in your environment
Provide mechanisms so all teams can get access to production telemetry
easily, without needing production access or privileged accounts. Create loosely-coupled architecture to avoid release dependencies – Design for operations through codified NFR Establish telemetry into your environments to monitor changes to OS,
enabling independent deployment of services by teams. security, config, infrastructure, or XSS/SQLi attempts & server errors
Find and fill any telemetry gaps Establish standard NRF requirements that set a baseline that all new
Enable peer review of changes services must achieve in order to enable operational objectives. Protect your deployment pipeline
Create telemetry at all levels of the application stack, for all environments,
The DevOps Handbook
and throughout the entire deployment pipeline. Ensure all code is reviewed prior to release – keeping the size of changes Build reusable operations user stories into development Harden CI/CD process, review all changes in version control, instrument to
small to streamline review & release practices. detect suspicious API calls, isolate CI processes running.
Analyse Telemetry to Anticipate Problems and Hit Goals Relentlessly automate every step of the deployment process – Supporting
Avoid manual testing and change freezes Ops improvements with Engineering effort in automation and tooling Protecting the Deployment Pipeline
Use mean and standard deviations to detect problems
Automate and integrate testing into your daily work, ensuring a flow of Ensure technology choices help achieve org. goals Integrate security and compliance into change approval
Create alerts that look for outliers from the mean using a standard changes into production with high release frequency
deviation where data sets are bell curved in nature Select technology standards that allow for fast deployment, common Leverage ITIL’s standard/normal/urgent change classifications and
Enable pair programming to improve changes learning and skill, and ease of understanding and maintenance. incorporate security assessment into those to meet compliance needs
Instrument and alert on undesired outcomes
Spread knowledge and develop in small testable batches through pair Reserve Time to Create Org. Learning and Improvement Re-categorize the lower risk changes as standard changes
Identify the lead indicators of outages, and instrument to alert on those to programming, and practices like TDD/BDD
create pro-active early detection systems. Institutionalize rituals to pay down technical debt Categorize and record all changes, focusing on moving changes with
Fearlessly cut bureaucratic processes patterns of high success and low MTTR to be ‘standard’ changes
No standard deviation on telemetry that’s not bell curved Regularly schedule improvement blitzes/hack weeks focusing on enabling
Relentlessly reduce the effort required for engineers to perform work and the team to pay back technical debt and improve their means of delivery Reduce reliance on separation of duty
Where normal operation can’t be described by the bell curve – don’t use deliver it to the customer with light controls, and high automation.
the standard deviation as it will create over or under alerting Enable everyone to teach and learn Use controls like pair programming, continuous inspection, code reviews
Enable and Inject Learning into Daily Work and others as the primary sources of control over separation of duty.
Leverage anomaly detection for non-bell curve Dedicate regular time for learning and teaching – being committed to
Establish a just, learning culture prevent it being deprioritized for other operational work. Ensure docs and proof for auditors and compliance officers
Establish patterns in your telemetry, and leverage smoothing, period
patterns, and seasonality to your data where it described by a bell curve. Build a culture that embraces failure as a trigger for inquiry and learning , Share your experiences from conferences Work with auditors in the control design process - sending all telemetry to
and not of scapegoating and blame centralized systems for auditor access and auditing.
Enable Feedback So Dev and Ops Can Safely Deploy Code Apply and experiment with the learnings you obtain from conferences –
Schedule blameless post-mortem meetings after accidents fostering the relationships you build for continuous learning from peers Inspired by the Clean Code Cheat Sheet developed by Urs Enzler
Use telemetry to make deployments safer from bbv software services (www.bbv.ch)
When failures occur, bring all stakeholders together to understand the Create internal consulting and coaches to spread practices
Actively monitor the metrics associated with your feature during timeline of events, identify root cause, identifying blameless learnings
deployment - overlaying metrics with code deployment patterns for insight Allocate specific resources focused on improvement without constraint Tribute to the ‘The DevOps Handbook` published by:
Publish our post-mortems as widely as possible Kim. G, Humble. J, Debois. P, Willis. J (2016), It Revolution Press
Dev shares pager rotation duties with Ops Information Security as Everyone’s Job, Every Day
Make the findings and actions of post-mortems transparent to all, all the This work by Trevor de Vroome (2020) with support from
Make problems visible to Developers by having them be responsible for way through to the customer, if possible. The goal is to spread the Integrate sec into development iteration demonstrations Whiteboard People (www.whiteboardpeople.com) is licensed under
handling of operational incidents – by implementing and making them knowledge, so others can learn from it.
responsible for pager duties of priority incidents. Incorporate security into the acceptance criteria and DoD for your stories a Creative Commons Attribution 4.0 International License.