A Checklist for Submitting Pull Requests

Hello!

Reviewing code is hard, especially because reviewers tend to inherit some responsibility for problems the code causes later. That can lead to churn while they try to develop confidence that new submissions are ready to merge.

I submit a lot of code for review, so I’ve been through a lot of that churn. Over the years I’ve found a few things that help make it easier for my reviewers to develop confidence in my submissions, so I decided to write a checklist. ✔️

The code I write lives in diverse repos governed by diverse requirements. A lot of the items in my checklist are there to help make sure I don’t mix up the issues I’m working on or the requirements of the repos I’m working in.

This isn’t a guide on writing good code. You can spend a lifetime on that topic. This is a quick checklist I use to avoid common mistakes.

This is written for Pull Requests submitted in git repos hosted on GitHub, but most of its steps are portable to other platforms (e.g. Perforce). It assumes common project features, like a contributing guide. Adjust as needed.

The Checklist

Immediately before submitting:

  1. Reread the issue.
  2. Merge the latest changes from the target branch (e.g. master).
  3. Reread the diff line by line.
  4. Rerun all tests. If the project doesn’t have automated tests, you can still:
    • Run static analysis tools on every file you changed.
    • Manually exercise new functionality.
    • Manually exercise existing functionality to make sure it hasn’t changed.
  5. Check if any documentation needs to be updated to reflect your changes.
  6. Check the rendering of any markup files (e.g. README.md) in the GitHub UI.
    • There are remarkable differences in how markup files render on different platforms, so it’s important to check them in the UI where they’ll live.
  7. Reread the project’s contributing guide.
  8. Write a description that:
    1. Links to the issue it addresses.
    2. Gives a plain English summary of the change.
    3. Explains decisions you had to make. Like:
      • Why you didn’t clean up that one piece of messy code.
      • How you chose the libraries you used.
      • Why you expanded an existing module instead of writing a new one.
      • How you chose the directory and file names you did.
      • Why you put your changes in this repo, instead of that other one.
    4. Lists all the tests you ran. Include relevant output or screenshots from manual tests.

There’s no perfect way to submit code for review. That’s why we still need humans to do it. The creativity and diligence of the engineer doing the work are more important than this checklist. Still, I’ve found that these reminders help me get code through review more easily.

Happy contributing!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Tox: Testing Multiple Python Versions with Pyenv

Hello!

I use Python’s tox to orchestrate a lot of my tests. It lets you set a list of versions in a tox.ini file (in the same directory as your setup.py), like this:

[tox]
envlist = py37, py38
 
[testenv]
allowlist_externals = echo
commands = echo "success"

Then you can run the tox command, it’ll create a venv for each version, and run your tests in each of those environments. It’s an easy way to ensure your code works across all the versions of Python you want to support.

But, if I install tox into a 3.8 environment and run the tox command in the directory where we created the tox.ini above, I get this:

tox
GLOB sdist-make: /Users/adam/Local/fiddle/setup.py
py37 create: /Users/adam/Local/fiddle/.tox/py37
ERROR: InterpreterNotFound: python3.7
py38 create: /Users/adam/Local/fiddle/.tox/py38
py38 inst: /Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py38 installed: example @ file:///Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py38 run-test-pre: PYTHONHASHSEED='2325607949'
py38 run-test: commands[0] | echo success
success
___________________________________________________________________________ summary ____________________________________________________________________________
ERROR:  py37: InterpreterNotFound: python3.7
  py38: commands succeeded

It found the 3.8 interpreter I ran it with, but it couldn’t find 3.7.

pyenv can get you past this. It’s a utility for installing and switching between multiple Python versions. I use it on OS X (⬅️ instructions to get set up, if you’re not already). Here’s how it looks when I have Python 3.6, 3.7, and 3.8 installed, and I’m using 3.8:

pyenv versions
  system
  3.6.11
  3.7.9
* 3.8.5 (set by /Users/adam/.pyenv/version)

Just having those versions installed isn’t enough, though. You still get the error from tox about missing versions. You have to specifically enable each version:

pyenv local 3.8.5 3.7.9
pyenv versions
  system
  3.6.11
* 3.7.9 (set by /Users/adam/Local/fiddle/.python-version)
* 3.8.5 (set by /Users/adam/Local/fiddle/.python-version)

This will create a .python-version file in the current directory that sets your Python versions. pyenv will read that file whenever you’re in that directory. You can also set versions that’ll be picked up in any folder with the pyenv global command.

Now, tox will pick up both versions:

tox
GLOB sdist-make: /Users/adam/Local/fiddle/setup.py
py37 inst-nodeps: /Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py37 installed: example @ file:///Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py37 run-test-pre: PYTHONHASHSEED='1664367937'
py37 run-test: commands[0] | echo success
success
py38 inst-nodeps: /Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py38 installed: example @ file:///Users/adam/Local/fiddle/.tox/.tmp/package/1/example-0.0.0.zip
py38 run-test-pre: PYTHONHASHSEED='1664367937'
py38 run-test: commands[0] | echo success
success
___________________________________________________________________________ summary ____________________________________________________________________________
  py37: commands succeeded
  py38: commands succeeded
  congratulations :)

That’s it! Now you can run your tests in as many verions of Python as you need.

Happy testing,

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Terratest Good Practices: Table-Driven Tests

Hello!

Terratest is a common way to run integration tests against terraform modules. I use it on many of the modules I develop. If you haven’t used it before, check out its quickstart for an example of how it works.

For simple cases, the pattern in that quickstart is all you need. But, bigger modules mean more tests and pretty soon you can end up swimming in all the cases you have to define. Go has a tool to help: table-driven tests. Here’s what you need to get them set up for terratest (Dave Cheney also has a great article on them if you want to go deeper).

First, let’s look at a couple simple tests that aren’t table-driven:

package test

import (
	"testing"

	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestOutputsExample(t *testing.T) {
	terraformOptions := &terraform.Options{
		TerraformDir: ".",
	}
	defer terraform.Destroy(t, terraformOptions)
	terraform.InitAndApply(t, terraformOptions)

	one := terraform.Output(t, terraformOptions, "One")
	assert.Equal(t, "First.", one)
	two := terraform.Output(t, terraformOptions, "Two")
	assert.Equal(t, "Second.", two)
}

Easy. Just repeat the calls to terraform.Output and assert.Equal for each test and assert it’s what you expect. Not a problem, unless you have dozens or hundreds of tests. Then you end up with a lot of duplication.

You can de-duplicate the repeated calls by defining your test cases in a slice of structs (the “table”) and then looping over the cases. Similar to adaptive modeling. Like this:

package test

import (
	"testing"

	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestOutputsTableDrivenExample(t *testing.T) {
	terraformOptions := &terraform.Options{
		TerraformDir: ".",
	}
	defer terraform.Destroy(t, terraformOptions)
	terraform.InitAndApply(t, terraformOptions)

	outputTests := []struct {
		outputName    string
		expectedValue string
	}{
		{"One", "First."},
		{"Two", "Second."},
	}

	for _, testCase := range outputTests {
		outputValue := terraform.Output(t, terraformOptions, testCase.outputName)
		assert.Equal(t, testCase.expectedValue, outputValue)
	}
}

Now, there’s just one statement each for terraform.Output and assert.Equal. With only two tests it actually takes a bit more code to use a table, but once you have a lot of tests it’ll save you.

That’s it! That’s all table-driven tests are. Just a routine practice in Go that work as well in terratest as anywhere.

Happy testing,

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

CloudFormation: cfn-lint Pre-Commit Git Hook

Hello!

There are a few ways to validate CloudFormation templates, but I mostly use cfn-lint. I run it in my build pipelines, but I also like to run it locally just so I don’t have to wait for a build to find out if I did a typo. I work on a lot of templates, though, and it’s easy to forget. I still end up finding out about simple errors after builds fail. Good news! Git can remember to do it for me.

I keep all my templates in git repos, and git has hooks to run scripts on actions like commits. They’re covered in the core docs and Atlassian also has a pretty good guide.

My goal is to keep bad code out of the git history, so I want commits to fail if there are linting errors. For that we need .git/hooks/pre-commit file:

#!/usr/bin/env bash
 
find . -type f -name "*\.yaml" | xargs cfn-lint

It must be executable: chmod u+x .git/hooks/pre-commit

I bodged together a repo with some bad templates in it so you can see the output:

.
└── templates
    ├── app1
    │   └── working.yaml
    └── app2
        └── broken.yaml

There’s a deliberate syntax error in broken.yaml. If I try to commit:

git commit -m "Test cfn-lint pre-commit hook."
E0000 Template needs to be an object.
./templates/app2/broken.yaml:1:1

… and there’s no commit. My changes are still just staged:

git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
 
    new file:   templates/app1/working.yaml
    new file:   templates/app2/broken.yaml

A few notes:

  • You can ignore hook errors with the --no-verify flag. I use this when I need to break down my work into iterations of partially-working commits, then I go back and squash them in an interactive rebase.
  • These hooks are unique to your clone, they don’t automatically copy to others. That’s fine for my cases because this is just my personal development process; the build jobs are the real gateway.
  • cfn-lint doesn’t automatically descend into subdirectories, which is why I do a find.
  • This only works if the hook script exits non-zero when there are errors. My find/xargs pattern does, but there are lots of patterns that don’t. If you use a for/do/done loop, for example, you can end up masking the return codes and the hook won’t block commits. If you modify the pattern for searching, make sure you test the failure cases.
  • I always use the .yaml extension for my templates, and typically they’re the only YAML files in my repo. If your case is more complex than that, you’ll need to modify the pattern for searching.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Best Practices For Backups: Restore Regularly

Hello!

Here’s one of my Golden Rules of Infrastructure:

If you haven’t restored from your backups, you don’t have backups.

It’s common to set up backups, check that the files or images exist and have sizes and contents that look right, then call the job done and move on. I’ve seen this leave plenty of folks with backups they can’t realistically restore from.

The Australian Cyber Security Centre says basically the same thing in their Essential Eight Maturity Model. Their top maturity level for backups has this requirement:

Full restoration of backups is tested at least once when initially implemented and each time fundamental information technology infrastructure changes occur.

100% agree. That’s the best practice I follow.

Here are some examples of failure cases that I’ve seen go undetected in environments with untested backups:

  • Filesystem tools like tar don’t always follow symbolic links automatically. That means things like tar -cf backup.tar ./important_folder/ won’t give you all the files you expect if important_folder symlinks to files in other directories. I’ve seen plenty of backups missing critical files that were hidden behind symlinks.
  • Restoring from AWS RDS snapshots creates a new database instance. The new instance won’t be managed by your automation; you have to connect it yourself. If config like endpoints and setup like network access are managed dynamically, which of course they should be, it can take a bunch of hacking to get the new DB integrated in a way that isn’t reset the next time deploy automation runs (when it’ll re-look up dynamic values for the old DB). Today’s apps usually have tight SLAs and you won’t have time to develop this process in the middle of an incident. You’ll end up compensating customers for too much downtime.

It’s not enough to examine backup files, you have to go through the exercise of restoring from them and validating that the restored system passes all the same tests you’d run before deploying a new production environment.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

PowerShell DSC In Vagrant

I work mostly on Apple Mac OS X. I’ve also been writing a lot of Windows automation, and that means PowerShell DSC. PSDSC doesn’t work on OS X yet, and even once it does I won’t be able to test Windows-only resources. To test my configurations, I use Vagrant boxes. It took a little fiddling to get this set up, so I’m posting the code here in case you’re in the same situation.

This assumes you already have Vagrant working. I use the VirtualBox provider and I install everything with homebrew.

Today, Vagrant doesn’t have a native PowerShell DSC provisioner. I use the shell provisioner to run my Headless PSDSC script, which compiles and runs a configuration (headless_dsc.ps1):

Param(
    [Parameter(Mandatory=$false)]
    [String[]]
    $InputXmlFile
)
  
if ($InputXmlFile) {
    [XML]$Inputs = Get-Content -Path $InputXmlFile
}
  
$NodeData = @{
    NodeName = "localhost"
    Message = "This was hardcoded in the node data."
    Inputs = $Inputs.inputs
}
$ConfigurationData = @{AllNodes = @($NodeData)}
  
Configuration Headless {
    Import-DscResource -ModuleName PSDesiredStateConfiguration
  
    Node 'localhost' {
        Log Message {
            Message = $AllNodes.Message
        }
        if ($AllNodes.Inputs) {
            Log Input {
                Message = $AllNodes.Inputs.Message
            }
        }
    }
}
  
Headless -ConfigurationData $ConfigurationData
Start-DscConfiguration -Wait -Force -Verbose -Path .\Headless\

And here’s an input file to provide the example messages (input.xml):

<Inputs>
    <Message>This message came from the XML inputs.</Message>
</Inputs>

Now all you need is a Vagrantfile in the same directory:

Vagrant.configure("2") do |config|
  config.vm.box = "StefanScherer/windows_2019"
  config.vm.provision "file" do |file|
    file.source = "input.xml"
    file.destination = "input.xml"
  end
  config.vm.provision "shell" do |shell|
    shell.path = "headless_dsc.ps1"
    shell.args = ["-InputXmlFile", "input.xml"]
 
    # https://github.com/hashicorp/vagrant/issues/9138#issuecomment-444408251
    shell.privileged = false
  end
end

To create the box and run PSDSC:

vagrant up

If you’ve made changes and you want to recomplie and rerun the configuration:

vagrant provision

If you need to connect to the instance and run commands, check out vagrant rdp. You’ll need to install Remote Desktop from the App Store. There’s a funky bug: if Remote Desktop isn’t open, the first rdp command will open the app but won’t start a session. Just run it again and you’ll get a session window.

The provisioner uploads the script and XML file on every run even though the current directory is mounted as a synced folder on the box by default. I could have used an inline shell script to call the script directly via that sync, but sometimes I disable the sync for testing and it’s convenient for the provisioner to keep working.

Without shell.privileged = false, I got errors like this:

default: (10,8):UserId:
default: At line:72 char:1
default: + $folder.RegisterTaskDefinition($task_name, $task, 6, $username, $pass ...
default: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
default:     + CategoryInfo          : OperationStopped: (:) [], ArgumentException
default:     + FullyQualifiedErrorId : System.ArgumentException
default: The system cannot find the file specified. (Exception from HRESULT: 0x80070002)
default: At line:74 char:1
default: + $registered_task = $folder.GetTask("\$task_name")
default: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
default:     + CategoryInfo          : OperationStopped: (:) [], FileNotFoundException
default:     + FullyQualifiedErrorId : System.IO.FileNotFoundException
default: You cannot call a method on a null-valued expression.
default: At line:75 char:1
default: + $registered_task.Run($null) | Out-Null
default: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
default:     + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
default:     + FullyQualifiedErrorId : InvokeMethodOnNull

The whole process froze and I had to ctrl+c twice to close it. There’s a bug. I left a link in a comment in the Vagrantfile that I recommend you keep in your code so this setting isn’t confusing later.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

CloudFormation: Limited-Privilege IAM Policies With cfn-nag

Hello!

This article is about security testing in CloudFormation, if you’re looking for functional testing, check out this.

When you write IAM policies, you should grant the smallest set of permissions that work. So, looking at this policy defined in a CloudFormation resource:

DemoPolicy:
  Type: AWS::IAM::Policy
  Properties:
    PolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Action:
          - ec2:DescribeInstances
        Resource: '*'
    PolicyName: star-required

The Resource: '*' looks wrong. It grants permission to make the DescribeInstances on all resources. It would be better if that were limited to specific resources. It’s easy to see in a code snippet like this one, but in a full template that defines a dozen policies it’s easy to miss.

Enter cfn-nag, a Ruby gem designed to detect insecure patterns in CloudFormation templates. If we run it on the template that contains the resource above, we get a warning (cfn_nag_scan is the CLI for the cfn-nag gem):

cfn_nag_scan --input-path ./demo_failing.yaml
------------------------------------------------------------
./demo_failing.yaml
------------------------------------------------------------------------------------------------------------------------
| WARN W12
|
| Resources: ["DemoPolicy"]
| Line Numbers: [5]
|
| IAM policy should not allow * resource
 
Failures count: 0
Warnings count: 1

Great! It auto-detects the '*' (star).

Here’s the problem: in this case star is required. Not all IAM permissions support resource-specification. Some of them, like ec2:DescribeInstances, require star as their resource. You can confirm this in the IAM documentation. Like most AWS services, EC2 has an Actions, Resources, and Condition Keys for Amazon EC2 page. Its Actions Defined by Amazon EC2 section describes when star is required.

So, we need to silence the warning for this resource. Fortunately, cfn-nag supports this. Here’s the modified CF:

DemoPolicy:
  Type: AWS::IAM::Policy
  Metadata:
    cfn_nag:
      rules_to_suppress:
        - id: W12
          reason: |
            At time of writing, ec2:DescribeInstances required '*' as its resource.
            https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonec2.html#amazonec2-actions-as-permissions
  Properties:
    PolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Action:
          - ec2:DescribeInstances
        Resource: '*'
    PolicyName: star-required

Now, cfn-nag runs without warning:

cfn_nag_scan --input-path ./demo_working.yaml 
------------------------------------------------------------
./demo_working.yaml
------------------------------------------------------------
Failures count: 0
Warnings count: 0

I recommend always including an explanation and a link to AWS documentation when you suppress a failure on a star resource. That way it’s clear to future readers (and auditors) why you granted all those permissions.

There’s a way to globally suppress warnings about star resources, but I recommend suppressing only on specific resources. This way, as you add new policies you’ll get warnings about cases where it is possible to limit the resources permissions apply to.

Happy securing!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Python DevOps Code Error Checking: Lint with Pyflakes

Hello!

For those unfamiliar with linting (static analysis), read Dan Bader’s introduction.

There are several linters for Python, but when I’m doing DevOps I use Pyflakes. I love the opening sentence of its design principals:

Pyflakes makes a simple promise: it will never complain about style, and it will try very, very hard to never emit false positives.

I’m not generally rigid about style. And, when I enforce it, I use the code review process and not a static analysis tool. The Python interpreter doesn’t care about style. Style is for humans; humans are the best tools to analyze it. Linters turn what should be a human process into something robotic.

Style is especially hard to enforce in DevOps, where you’re often working with a mix of languages and frameworks and tools that all have different style conventions. For example, lots of folks use Chef in AWS. Chef is a Ruby framework. They also need lambda helper functions, but lambda doesn’t support Ruby so they write those functions in Python and now half their code is Ruby and half is Python. And that’s if you ignore all the HCL in their terraform modules… You can go insane trying to configure your linters to keep up with the variation.

More than that, in DevOps you’re not usually writing much code. A helper lambda function. A little boto3 script. If you’re writing lots of code, you’re probably making a mistake and you should be looking for an existing tool that already implements whatever you’re doing (terraform, troposphere, Ansible, Salt, paramiko, whatever).

Pyflakes is great because it catches syntax errors before execution time but won’t suck you in to The Bog of Style Sorrow. It’ll quickly tell you if you misplaced a quote mark, and then it exits. So if you do this:

bad_variable = 'Oops I forgot to close the string.

You get an error:

pyflakes test.py
test.py:1:51: EOL while scanning string literal
bad_variable = 'Oops I forgot to close the string.
                                                  ^

You also get some handy stuff like checking for unused imports. So if you do this:

import logging
good_variable = 'Huzzah! I remembered to close the string.'

You also get an error:

pyflakes test.py
test.py:1: 'logging' imported but unused

Crucially, Pyflakes will pass if you do this:

list_style_one = ['a', 'b']
list_style_two = [ 'a', 'b' ]

It’s a little funky to do both those patterns right next to each other, and if I were writing that code myself I’d fix it, but I don’t want my linter to error. The code works fine and I can read it easily. I prefer consistency, but not to the point that I want a robot to generate build failures.

I recommend running Pyflakes on all your Python DevOps code because it’s a quick win. Pretty much anything it errors on you should fix before you try to use the code, and it’s usually faster to run Pyflakes than to deploy a new version of the code and see if it works. I like things that are fast. 😁

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Boto3 Best Practices: Assert to Stop Silent Failures

Good morning!

A variation of this article was given as a lighting talk at the San Diego Python Meetup:

This article covers a pattern I use to increase my confidence that my infrastructure code is working. It turns silent errors into loud ones. I’ve handled plenty of code that runs without errors but still ends up doing the wrong thing, so I’m never really sure if it’s safe to go to sleep at night. I don’t like that. I want silence to be a real indicator that everything is fine. Like The Zen of Python says:

Errors should never pass silently.

It’s easy to write assumptions that’ll create silent errors into boto code. Imagine you have an EBS volume called awesome-stuff and you need to snapshot it for backups. You might write something like this:

import datetime
 
import boto3
 
ec2 = boto3.resource('ec2')
volume_filters = [{'Name': 'tag:Name', 'Values': ['awesome-stuff']}]
volumes = list(ec2.volumes.filter(Filters=volume_filters))
volume = volumes[0]
now = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%dT%H-%M-%S%Z")
volume.create_snapshot(Description=f'awesome-stuff-backup-{now}')

Simple enough. We know our volume is named awesome-stuff, so we look up volumes with that name. There should only be one, so we snapshot the first item in that list. I’ve seen this pattern all over the boto code I’ve read.

What if there are two volumes called “awesome-stuff”? That could easily happen. Another admin makes a copy and tags it the same way. An unrelated project in the same account creates a volume with the same name because awesome-stuff isn’t super unique. It’s very possible to have two volumes with the same name, and you should assume it’ll happen. When it does, this script will run without errors. It will create a snapshot, too, but only of one volume. There is no luck in operations, so you can be 100% certain it will snapshot the wrong one. You will have zero backups but you won’t know it.

There’s an easy pattern to avoid this. First, let me show you Python’s assert statement:

awesome_list = ['a', 'b']
assert len(awesome_list) == 1

We’re telling Python we expect awesome_list to contain one item. If we run this, it errors:

Traceback (most recent call last):
    File "error.py", line 2, in <module>
assert len(awesome_list) == 1
AssertionError

This is a sane message. Anyone reading it can see we expected there to be exactly one object in awesome_list but there wasn’t.

Back to boto. Let’s add an assert to our backup script:

import datetime

import boto3

ec2 = boto3.resource('ec2')
volume_filters = [{'Name': 'tag:Name', 'Values': ['awesome-stuff']}]
volumes = list(ec2.volumes.filter(Filters=volume_filters))
assert len(volumes) == 1
volume = volumes[0]
now = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%dT%H-%M-%S%Z")
volume.create_snapshot(Description=f'awesome-stuff-backup-{now}')

Now, if there are two awesome-stuff volumes, our script will error:

Traceback (most recent call last):
    File "test.py", line 8, in <module>
assert len(volumes) == 1
AssertionError

Boom. That’s all you have to do. Now the script either does what we expect (backs up our awesome stuff) or it fails with a clear message. We know we don’t have any backups yet and we need to take action. Because we assert that there should be exactly one volume, this even covers us for the cases where that volume has been renamed or there’s a typo in our filters.

Here’s a good practice to follow in all of your code:

If your code assumes something, assert that the assumption is true so you’ll get a clear, early failure message if it isn’t.

If you’re interested in further reading or more sources for this practice, check out Jim Shore’s Fail Fast article.

In general, these are called logic errors. Problems with the way the code thinks (its “logic”). Often they won’t even cause errors, they’ll just create behavior you didn’t expect and that might be harmful. Writing code that’s resilient to these kinds of flaws will take your infrastructure to the next level. It won’t just seem like it’s working, you’ll have confidence that it’s working.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

DevOps FMEA Shortlist

Failure Modes and Effects Analysis (FMEA) is a formal practice of breaking individual pieces of your system and watching how each failure effects the system as a whole (this is the definition I’ve seen used most in tech). Stop one server in a pool and capacity drops. No problem. Stop one server in a pool and users can’t login anymore? Problem. FMEA helps ensure that downstream effects of failures aren’t bigger than you expect.

Real FMEA is a Big Deal and is often out of reach. But, even doing small FMEA during routine development can save your life. I love a good list, and over the years I’ve assembled one from real world failures that had bigger consequences than expected. Testing those failures has saved me many times, so I thought I’d share the list.

Some failures might look the same, like rebooting and terminating, but the differences matter. I’ve seen a lot of infrastructures go down because deployment automation started app processes with nohup and didn’t create an init config to ensure they restarted on reboot.

FailureExpected Outcome
Clocks are slow (e.g. ntpd died yesterday)Alerts to operators who can respond
Config that prevents the app from starting is deployedClear failure message from deploys (but no outage in the running system)
External service goes down (e.g. PyPI)Clear failure message from deploys (but no outage in the running system)
Database is out of spaceAlerts to operators who can respond
App process is killed by the Linux OOMAlerts to operators and a short drop in capacity that recovers automatically
That singleton host you haven’t gotten rid of terminatesBrief outage that recovers automatically
App host rebootsShort drop in capacity that recovers automatically
App host terminatesShort drop in capacity that recovers automatically
App service stops on a hostShort drop in capacity that recovers automatically
All app hosts terminateBrief outage that recovers automatically
Worker host rebootsShort spike in queue length
Worker host terminatesShort spike in queue length
Worker service stops on a hostShort spike in queue length
All worker hosts terminateSlightly longer spike in queue length
Cache host rebootsTemporarily increased response times, some users are logged out
Cache host terminatesTemporarily increased response times, some users are logged out
Cache service stops on a hostTemporarily increased response times, some users are logged out

There’s a lot more you’d have to test to fully exercise your system, this is just a lightweight list of the failures I’ve found most valuable to test. I usually don’t run every simulation when developing every feature, just the ones closely connected to what I’m building. It’s a good practice to simulate several failures at once. What happens if an external service like PyPI is down and a host is terminated?

If you’re already thinking of modifications you’d need to make for this list to work for you, that’s good! Maybe you don’t depend on time sync (although you likely do, so many base components do that it’s hard to avoid). Maybe you have a legacy queue system that has unique failure cases (like that time the instance type changed which changed the processor count which made it run out of licenses). Maybe you don’t depend on singletons (huzzah!). Adjust as needed.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles: