Four Guidelines for Valuable Documentation

📃 We’ve written a lot of documentation for a lot of projects. We’ve also read a lot of documentation for a lot of projects and had mixed experiences with what it taught us. Across that work, we’ve found four guidelines that make documentation easy to write and valuable to readers. Hopefully they save you some time and some frustration!

All four come from one principle:

Documentation exists to help users with generic experience learn your specific system.

Generic experience is a prerequisite. Documentation isn’t a substitute for knowing the basics of the tooling your project uses, it’s a quick way for knowledgeable readers to learn the specific ways your project uses those tools.

Don’t Write Click-by-Click Instructions

❌ This is way too much detail:

  1. Go to
  2. Click Log Groups on the left
  3. Type “widgets-dev-async-processor” in the search box
  4. Click the magnifying glass icon
  5. Find the “widgets-dev-async-processor” in the search results
  6. Click “widgets-dev-async-processor”
  7. Click the first stream in the list
  8. Read the log entries

It’s frustratingly tedious for experienced users. Users who are so new that they need this level of detail are unlikely to get much from the logs it helps them find.

This will also go out of date as soon as the CloudWatch UI changes. You won’t always notice when it changes, and even if you do it’s easy to forget to update your docs.

Use simple text directions instead:

Open the widgets-dev-async-processor Log Group in the AWS CloudWatch web console.

That’s easy to read, tells the reader what they need and where to find it, and won’t go out of date until you change how your logs are stored.

Limit Use of Screenshots

🔍 Searches can’t see into images, so anything captured in a screenshot won’t show up in search results. Similarly, readers can’t copy/paste from images.

Also, like click-by-click instructions, screenshots are tedious for experienced readers, they don’t help new users understand the system, and they’re impractical to keep up to date.

Most of the time, simple text directions like the ones given above are more usable.

Link Instead of Duplicating

Duplicated docs always diverge. Here’s a common example:

Infrastructure code and application code live in different repos. Engineers of both need to export AWS credentials into their environment variables. Infra engineers need them to run terraform, app engineers need them to query DynamoDB tables. Trying to make it easy for everybody to find what they need, someone documents the steps in each repo. Later, the way users get their credentials changes. The engineer making that change only works on terraform and rarely uses the app repo. They forget to update its instructions. A new engineer joins the app team, follows those (outdated) instructions, and gets access errors. There’s churn while they diagnose.

It’s better to document the steps in one repo and link 🔗 to those steps from the other. Then, everyone is looking at the same document, not just the same steps. It’s easy to update all docs because there’s only one doc. Readers know they’re looking at the most current doc because there’s only one doc.

This is also true for upstream docs. For example, if it’s already covered in HashiCorp’s excellent terraform documentation, just link to it. A copy will go out of date. Always link to the specific sections of pages that cover the details your readers need. Don’t send them to the header page and force them to search.

Keep a Small Set of Accurate Documentation

If you write too many docs, they’ll eventually rot. You’ll forget to update some. You won’t have time to update others. Users will read those docs and do the wrong thing. Errors are inevitable. It’s better to have a small set of accurate docs than a large set of questionable ones. Only write as many docs as it’s practical to maintain.

Writing docs can be a lot of work. Sometimes they just cause more errors. Hopefully, these guidelines will make your docs easier to write and more valuable to your readers.

Happy documenting!

Operating Ops

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

A Checklist for Submitting Pull Requests


Reviewing code is hard, especially because reviewers tend to inherit some responsibility for problems the code causes later. That can lead to churn while they try to develop confidence that new submissions are ready to merge.

I submit a lot of code for review, so I’ve been through a lot of that churn. Over the years I’ve found a few things that help make it easier for my reviewers to develop confidence in my submissions, so I decided to write a checklist. ✔️

The code I write lives in diverse repos governed by diverse requirements. A lot of the items in my checklist are there to help make sure I don’t mix up the issues I’m working on or the requirements of the repos I’m working in.

This isn’t a guide on writing good code. You can spend a lifetime on that topic. This is a quick checklist I use to avoid common mistakes.

This is written for Pull Requests submitted in git repos hosted on GitHub, but most of its steps are portable to other platforms (e.g. Perforce). It assumes common project features, like a contributing guide. Adjust as needed.

The Checklist

Immediately before submitting:

  1. Reread the issue.
  2. Merge the latest changes from the target branch (e.g. master).
  3. Reread the diff line by line.
  4. Rerun all tests. If the project doesn’t have automated tests, you can still:
    • Run static analysis tools on every file you changed.
    • Manually exercise new functionality.
    • Manually exercise existing functionality to make sure it hasn’t changed.
  5. Check if any documentation needs to be updated to reflect your changes.
  6. Check the rendering of any markup files (e.g. in the GitHub UI.
    • There are remarkable differences in how markup files render on different platforms, so it’s important to check them in the UI where they’ll live.
  7. Reread the project’s contributing guide.
  8. Write a description that:
    1. Links to the issue it addresses.
    2. Gives a plain English summary of the change.
    3. Explains decisions you had to make. Like:
      • Why you didn’t clean up that one piece of messy code.
      • How you chose the libraries you used.
      • Why you expanded an existing module instead of writing a new one.
      • How you chose the directory and file names you did.
      • Why you put your changes in this repo, instead of that other one.
    4. Lists all the tests you ran. Include relevant output or screenshots from manual tests.

There’s no perfect way to submit code for review. That’s why we still need humans to do it. The creativity and diligence of the engineer doing the work are more important than this checklist. Still, I’ve found that these reminders help me get code through review more easily.

Happy contributing!


Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

How To Work From Home


With the expansion of the recent coronavirus, major companies like Microsoft have started asking their teams to work from home to limit exposure. I find remote work is often the most productive, but the transition can be messy. I’ve been doing it for years, and I’ve learned some lessons I thought I’d share.

✔️ Lesson #1: keep the ticket tracker up to date

One of the hardest parts of managing a remote team is keeping confidence that you know what’s going on. Good managers keep a keen ear on their team, but when their team goes remote their ears aren’t enough. The best way to help them is to make the ticket tracker (like Jira or Trello) the source of truth for the team’s work. My rule is this:

If the boss needs to know the status of your work, all they should need to do is refresh the ticket board.

🗣 Lesson #2: use chat software for discussions but don’t use it to communicate requirements

It’s easy to take everything to your chat app as soon as you leave your shared office space, but I find that leads to dropped tasks. If it has to get done, it should be a ticket in the ticket tracker or at least an email that copies the Project Manager and Manager of the project. A ping in a chat channel isn’t enough. It’s too easy to scroll past something critical. My rule is this:

If something gets missed and it was only posted in chat, the miss belongs to the person who posted.

👥 Lesson #3: replace check-in meetings (like Scrum stand-ups) with a tool

If there’s noise in your home workspace, like a roommate coming home or a kid leaving for school, a pair of headphones is enough to mask it and help you focus. On a call, that background noise disrupts everything. Regular check-in meetings exaggerate competition for your workspace, but they’re also repetitive so they’re easy to replace with a text-based system. I’ve used Status Hero for this. It sends reminders, teammates can tag each other if they need help, and it tracks blockers. It instantly reduces the call schedule. It’s a quick win when you’re trying to bootstrap a newly-remote team.

📫 Lesson #4: use Inbox Zero

Unanswered emails are one of the biggest frustrations I face as a remote worker. Things get missed and chat channels get polluted with “hey did you see that email…” messages. I try not to create the same frustration for others.

Everybody gets a lot of email. Checking it is hard, even for diligent workers. Inbox Zero teaches you to process email instead of checking it. If you’re on a thread with an action item, create a ticket and archive the message. If someone sends you a link, bookmark it and archive the message. Instead of letting messages build up into a database of everything, process them into wherever they go and then get rid of them. Then unanswered messages don’t get lost in the pile.

Enjoy skipping your commute!


Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Grooming Large Backlogs

This morning, a buddy was wrangling a Trello board. It was packed with tasks and it was getting hard to plan work. Not long later, I got an email:

What’s your recommendation for a Very Large Backlog?

I see a lot of Very Large Backlogs in DevOps. Everybody is moving to the cloud. Everybody is automating. Everybody is paying down technical debt. There’s a lot to do and a lot of it shows up in the backlog. My buddy doesn’t work in DevOps, but VLBs are common in my world and I have a few recommendations that’ll help in any field.

How you deal with this depends on the tasks in your backlog. There’s no magic method. But, I have seen several recurring patterns in big backlogs. I’ve also found some ways to deal with them:

  • Tons of small, lower-priority tasks. I’ll often bump these up in priority just to get them out of the backlog. One of my mantras is, “You have to do a sufficient amount of the little stuff every week to keep the business running.” If I’ve got a bunch of cards for “rotate password on [service]” and “set up new account for [person]” and “email instructions for [task] to [coworker]”, I’ll just sit down for three hours and smash through them. It’s better to spend time doing the work than to spend time tracking it.
  • Tasks that are vague or not actionable until far into the future. These are good candidates to convert to a bulleted list in a markdown file. I keep a fiddle directory with a file where I track these. It’s a workspace for me to think about stuff and experiment. Like, “Consider switching from Google Apps to Office 365”. It’s something I want to remember to think about later but it’s not really an actionable task that needs to be scheduled.
  • Planning tasks. I try not to put these in the backlog. Things like, “write a plan for how I’m going to write the code/copy/proposal for [task].” There are a lot of sneaky variations on this pattern, so I go through my backlogs and read each task carefully to look for it. I usually just delete them. If a task fits in one sprint, I consider my approach when I start the task. Orienting myself and planning my implementation is part of the development process, not something I do separately. Any task that’s so large I can’t plan it in-line isn’t a task. It’s a project and it needs a whole other process.
  • Tasks that just never seem to get prioritized. Backlogs often bloat with tasks that’ll never be high-value enough to prioritize. Like, “define code style standards for all mah code.” Or, “figure out how to install that JIRA plugin we talked about in standup that one time.” There’s only so much magic you can do to tame a large backlog. Too much work is too much work. One of the best things you can do is sort by value and drop tasks from the bottom. Put a note in the card and close it (don’t delete it, you want it to show up in searches so nobody adds it again). If the backlog is too big to complete, your options are to hire workers or cancel work. Hiring workers is slow and expensive. Canceling low-value work is fast and cheap. (You can also look for ways to work more efficiently but those are usually gains of just a few percent and won’t overcome a large backlog.)

There is a lot more to backlog management than fixing these patterns, but these can help a lot in shrinking your backlog down to something manageable.

Happy grooming!


Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

The New Project Manager’s Glossary: Cloud and DevOps

I often meet Project Managers who are new to the cloud or DevOps or sometimes new to software altogether. There’s plenty of jargon in these spaces, and often definitions are hard to find. Quite a few folks have asked me to help define the jargon, so I decided to write it up.

This is an opinionated list. It’s also a simplification. It summarizes what I’ve personally learned in my years in these spaces. There are other definitions, but these should get you close enough to work within the context of conversations.

This list starts with the boring terms that you’re most likely to already know and builds them up into the more esoteric ones. Sort of. There’s a lot of interconnection. If you see a term you don’t know, try looking farther down in the list.

To make the examples easier to follow, imagine you work for The Golf-Stuff Company. You make a golfing website where golf enthusiasts can buy golf stuff. The product is the golfing website. The customers are the golfers. My definitions are written around this example case.

Code: A synonym of software and of program. You “write” or “develop” code/software/programs. Code is the informal term, software is the formal one. Program is an older word that nobody says anymore.

Development: The process of writing code. A synonym of coding and of programming. Coding is the informal term, developing is the more formal one. Programming is still in use, but it’s less common. “Coders developing software” means the same thing as “software engineers writing code” means the same thing as “coders coding”. Technically those are the same as “programming programs”, but nobody would say it that way.

Application Development: The same as development, but specifically the development of the golfing website. This distinction matters because DevOps engineers are also developers who write software, but their software never gets used by customers.

End User: The customers who actually use the final product. Golfers who buy golf stuff from your golfing website. They’re the people at the “end” of the whole system of technology that makes the website work.

Server: A computer that runs the golfing website. Similar to a laptop running Netflix. Fundamentally, servers are the same type of thing as laptops, they’re just used for different purposes.

Compute Resource: A server, but in the cloud. This is one of the biggest simplifications in this list, but it’s good enough to get the context of most conversations. Engineers mostly say “server” even when they technically mean “compute resource”. See “serverless” below.

Infrastructure: A bunch of servers all hooked together. Infrastructure includes all the connecting bits (like the networks that they use to communicate). Individual servers aren’t good for much without the infrastructure they live in. Modern golfing websites run on complex infrastructures, not on individual servers. Infrastructure comes in endless varieties.

Serverless: Technically a better way to say this is “serverless platform”, but a lot of people just say “serverless”. A type of compute resource that doesn’t require you to manage your own servers. That reduces the amount of deployment automation that DevOps engineers have to write. Today, not all products are compatible with serverless. Serverless platforms are services sold as part of clouds, and each one is different. If your application works in one serverless platform it may not work in another. It’s common to say “going serverless” when you mean “assigning our application developers to make our product compatible with Amazon’s lambda serverless platform (because we’re tired of managing servers)”.

Containers: Containers allow engineers to create mini-servers for their products that can be easily started and stopped on whatever infrastructure needs them. This simplifies deploying the same product to different infrastructures (e.g. you might sell it as a product that multiple customers would each want to run in their own infrastructure). It can also simplify adding and removing capacity because it’s easy to add and remove more copies of the same container.

I’m going to pause the list here and note that servers, compute resources, serverless platforms, and containers are all interconnected concepts that can combine and overlap in endless varieties. A lot of the work done by DevOps engineers today is around deciding which patterns of these to use.

Deployment: The golfing website runs on infrastructure. To run, it has to be deployed. Code has to be copied over, configuration entered, commands run. Similar to how you have to install the Netflix app on a laptop before you can stream video. Together, the outcome of these actions is the deployment.

Deployment Automation: Software that deploys other software to infrastructure. It’s cheaper and more reliable to build a tool to deploy your product than to let an error-prone human do it by hand. Today, most golfing websites have two major components: the actual product code and the deployment automation code that manages its infrastructure.

Deployment Pipeline: Tooling built around deployment automation that delivers the golfing website to infrastructure. Like any software, deployment automation has to actually run somewhere (e.g. on compute resources). The deployment pipeline is that somewhere. You might ask, “what runs the deployment pipeline?” A fair question with no easy answer. This is a chicken-and-egg situation and the implementations vary a lot. Typically the pipeline and the deployment automation are part of the same code, but that’s not something that matters much outside of an engineer’s world.

Build Pipeline: This is beyond the scope of a cloud/DevOps list, but it’s worth distinguishing from deployment pipelines. Build pipelines are the tools that deliver the golfing website code to deployment automation. They’ll do things like run tests to see if there are bugs, do some formatting to make it easier to deploy, etc.

Build: A packaged version of the golfing website that’s ready to deploy. Typically this is the output of a build pipeline. It’s possible to deploy software that hasn’t been “built”, but that’s generally considered a bad practice. The details here vary a lot, but it’s usually good enough to know that a build is the outcome of application development and is also the thing that is deployed to infrastructure.

Release: A version of the golfing website. There is usually a “build” of a “release”. The distinction isn’t important in very many non-technical conversations. This can also be a verb: “we’re going to release the latest version of the golfing website on Thursday”.

The Cloud: A misnomer. There isn’t a cloud, there are many clouds. Clouds are products owned by corporations. Clouds provide infrastructure where you can run golfing websites. Each cloud is different, and if you build a product on one it won’t (easily) work on another. Typically clouds allow you to increase and decrease what you use (and pay for) day to day. Historically, you’d have to buy enough servers to handle your most busy day even if that meant a bunch of it sat idle on your least busy day. Clouds have grown far beyond just that one benefit, they provide all kinds of ancillary services, but at the core their value is on-demand pricing. You pay for what you’re using right now, not what you might need to use tomorrow.

AWS: Amazon Web Services. A cloud. Owned by Amazon. Distinct from is an e-commerce product that is deployed to AWS. If someone says they’re going to “the cloud”, they likely mean AWS. At time of writing, AWS had the largest market share of all the clouds.

Azure: A cloud. Owned by Microsoft.

Google Cloud: A cloud. Owned by Google. Distinct from the Google search engine.

Application Developer: An engineer who writes the golfing website code.

System Administrator: Also called a sysadmin. An engineer who manually deploys the golfing website to infrastructure. These roles have been mostly replaced by DevOps.

Operator: A technician who monitors running infrastructure and responds if there are problems (so if golfers report that they can’t get to the golfing site, an operator will be the first person to do something about it). In environments without automation, operators are also typically responsible for deploying code to infrastructure. Increasingly these roles are being replaced by automation developed by DevOps Engineers.

DevOps Engineer: An engineer who writes deployment automation. So if you want your golfing website deployed to the AWS cloud, you’d need a DevOps engineer to write automation to do that. DevOps roles often include other responsibilities, but this is the core.

SRE: Site Reliability Engineer. Usually this is the same role as DevOps engineer, just under a different name. ⬅️ This definition will start fights with a lot of people. I recommend never saying this. It’s enough to know that SREs typically have very similar jobs to DevOps engineers.

I hope this helped! Happy project managing,


Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

A Book from 2017: Stretch Goals and Prescriptions

Happy New Year!

Today’s post is a little outside my usual DevOps geekery, but it’s been an influencer on my work and my career choices this year so I wanted to share it.

For the record, I have zero connections to 3M.

In my teens, I noticed that whenever I bought something with the 3M logo it was noticeably better than the other brands. I didn’t know what 3M was, but this pattern kept repeating and I started to always choose them. Years later, deep inside a career in technology, I was still choosing 3M. I started to ask myself how they did it. Why were all their products better than everyone else’s?

I didn’t know anyone at 3M, so I found a book. The 3M Way to Innovation: Balancing People and Profit.


Balance? At work? And still better than everyone else? Bring it on.

The book approaches 3M through their innovations. They built hugely successful product lines in everything from sandpaper to projectors, and it turns out other companies have long looked to them as the top standard for the innovation that drives such diverse success. As I worked through the book, one thing really stuck with me: 3M’s definition of Stretch Goals.

I’ve seen a lot of managers ask their teams what can be accomplished in the next unit of time (sprint, quarter, etc.). Often, the team replies with a list that’s shorter than the manager would like. The manager then over-assigns the team by adding items as “stretch goals”. If the team works hard enough and accomplishes enough, they’ll have time to stretch themselves to meet these goals. The outcome I usually see is pressure for teams to work longer hours (with no extra pay) so they can deliver more product (at no extra cost to the company).

This book described 3M’s stretch goals very differently, which I’ll summarize in my own words because it’s characterized throughout the book and there’s no single quote that I think captures it. 3M sets these goals to stretch an aspect of the business that’s needed for it to remain a top competitor, and they’re deliberately ambitious. For example, one that 3M actually used: 30% of annual sales should come from products introduced in the last four years. Goals like these drive innovation because they’re too big to meet with the company’s current practices.

The key difference is that 3M isn’t trying to stretch the capacity of individuals. They’re not trying to increase Scrum points by pushing everyone to work late. They’re setting targets for the company that are impossible to meet unless the teams find new ways to work. They’re driving change by looking for things that can only be done with new approaches; things that can’t be done just by working longer hours. And after they set these goals, they send deeply committed managers out into the trenches to help their teams find and implement these changes. Most of the book is about what happens in those trenches. I highly recommend it.

There’s one other thing from the book I want to highlight: the process of innovation doesn’t simplify into management practices you can choose off a menu. There’s more magic to it than that. It takes skilled leaders and a delicate combination of freedom and pressure to build a company where the best engineers can do their best work, and trying to reduce that to a prescription doesn’t work. Here’s a quote from Dick Lidstad, one of the 3M leaders interviewed for the book, talking about staff from other companies who come to 3M looking to learn some of the innovation practices so they can implement them in their own teams:

They want to take away one or two things that will help them to innovate. … We say that maintaining a climate in which innovation flourishes may be the single biggest factor overall. As the conversation winds down, it becomes clear that what they want is something that is easily transferable. They want specific practices or policies, and get frustrated because they’d like to go away with a clear prescription.

I heard truth in that quote. Despite being a believer in the value of tools like Scrum, which are supposed to foster creativity and innovation, I’ve spent a lot of my career held back by the overhead of process that’s good in principle but applied with too little care to be effective. Ever spent an entire day in Scrum ceremonies? There’s more value in the experience of 3M’s teams overall than there is in any list of process.

This book was written in 2000, but not only has 3M stock continued to perform well, I found many parallels in the stories this author tells and my own experience in the modern tech world. It’s heavy with references and first-hand interviews, and I think it’s a valuable read for anyone in tech today.

If you read it, let me know what you think!


Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Hygiene Checklist for Paid Subscriptions

One day I get a text from the illimitable Kai Davis. He’s had a Bad Moment.

Adam. I have terrible OpSec.

A former user had deleted a bunch of files. Luckily, he was able to recover.

Teach me how to OpSec.

No worries buddy. I got you.

Kai is a power user, and in today’s Internet that means he subscribes to two dozen hosted services. How do you manage two dozen services and keep any kind of sanity? I do it with checklists (⬅️ read this book).

Before I show them to you, we need to cover one of the Big Important Things from Mr. Gawande’s book. Kai already knows how to manage his services. He just needs to make sure he hasn’t forgotten something important like disabling access for former users.

I wrote Kai two checklists. One to use monthly to make sure nothing gets missed and one to use when setting up new services to reduce the monthly work. I assume he has a master spreadsheet listing all his services. Kai’s Bad Moment categorizes as OpSec, but I didn’t limit these lists to that category.

Hopefully, these help you as well.

The Monthly Checklist

  • Can I cancel this service?
  • Should I delete users?
  • Should I change shared passwords?
  • Should I un-share anything?
  • Should I force-disconnect any devices?
  • Is the domain name about to expire?
  • Is the credit card about to expire?
  • Am I paying for more than I use?
  • Should I cancel auto-renewal?
  • Are there any messages from the provider in my account?
  • Is the last backup bigger than the one before it?

The Setup Checklist

  • Add row to master spreadsheet.
  • Save URL, account ID, username, password, email address, and secret questions in 1password.
  • Sign up for paperless everything.
  • Enter phone number and mailing address into account profile.
  • Review privacy settings.
  • Enable MFA.
  • Send hardcopy of MFA backup codes offsite.
  • Setup recurring billing.
  • Set alarm to manually check the first auto-bill.
  • Set alarm to revisit billing choices.
  • Set schedule for backups.
  • Check that backups contain the really important data.
  • Create a user for my assistant.
  • Confirm my assistant has logged in.

Some Notes


  • Can I cancel this service? I always ask “can I”, not “should I”. There’s always a reason to keep it, but I want a reason to nuke it.
  • Am I paying for more than I use? I look at current usage, not predicted usage. The number is often not actionable, but it’s a good lens.


  • Save URL, account ID, username, password, email address, and secret questions in 1password. The URL matters because 1password will use it to give you warnings about known vulnerabilities that you need to change your password to remediate. The email address and username may seem redundant, but having both has saved me a bunch of times. Same with secret questions.
  • Enter phone number and mailing address into account profile. These make recovery and support calls easier.
  • Review privacy settings. Remember, Kai already knows how to manage his services. He knows how to pick good privacy settings. But privacy settings are often hidden and it’s easy to forget them when signing up.
  • Enable MFA. I know it sucks, but the security landscape gets worse every day. Use this for anything expensive or private.
  • Send hardcopy of MFA backup codes offsite. I have watched people spend months on account recovery when their phones die and they lose their Google Auth.
  • Set alarm to manually check the first auto-bill. This saves me all the time. All. The. Time.
  • Set alarm to revisit billing choices. This has saved me thousands of dollars.
  • Set schedule for backups. Even if it’s an alarm to do a manual backup once a month.

Stay safe!


Need more than just this article? We’re available to consult.

You might also want to check out these related articles: