How to Paginate in boto3: Use Collections Instead

Hello!

When working with boto3, you’ll often find yourself looping. Like if you wanted to get the names of all the objects in an S3 bucket, you might do this:

import boto3
 
s3 = boto3.client('s3')
 
response = s3.list_objects_v2(Bucket='my-bucket')
for object in response['Contents']:
    print(object['Key'])

But, methods like list_objects_v2 have limits on how many objects they’ll return in one call (up to 1000 in this case). If you reach that limit, or if you know you eventually will, the solution used to be pagination. Like this:

import boto3
 
s3 = boto3.client('s3')
 
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='my-bucket')
 
for page in pages:
    for object in page['Contents']:
        print(object['Key'])

I always forget how to do this. I also feel like it clutters up my code with API implementation details that don’t have anything to do with the objects I’m trying to list.

There’s a better way! Boto3 has semi-new things called collections, and they are awesome:

import boto3
 
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-buycket')
objects = bucket.objects.all()
 
for object in objects:
    print(object.key)

If they look familiar, it’s probably because they’re modeled after the QuerySets in Django’s ORM. They work like an object-oriented interface to a database. It’s convenient to think about AWS like that when you’re writing code: it’s a database of cloud resources. You query the resources you want to interact with and read their properties (e.g. object.key like we did above) or call their methods.

You can do more than list, too. For example, in S3 you can empty a bucket in one line (this works even if there are pages and pages of objects in the bucket):

import boto3
 
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-buycket')
bucket.objects.all().delete()

Boom šŸ’„. One line, no loop. Use wisely.

I recommend collections whenever you need to iterate. I’ve found the code is easier to read and their usage is easier to remember than paginators. Some notes:

  • This is just an introduction, collections can do a lot more. Check out filtering. It’s excellent.
  • Collections aren’t available for every resource (yet). Sometimes you have to fall back to a paginator.
  • There are cases where using a collection can result in more API calls than you expect. Most of the time this isn’t a problem, but if you’re seeing performance problems you might want to dig into the nuances in the doc.

Hopefully, this helps simplify your life in the AWS API.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Boto3 Best Practices: Assert to Stop Silent Failures

Good morning!

A variation of this article was given as a lighting talk at the San Diego Python Meetup:

This article covers a pattern I use to increase my confidence that my infrastructure code is working. It turns silent errors into loud ones. I’ve handled plenty of code that runs without errors but still ends up doing the wrong thing, so I’m never really sure if it’s safe to go to sleep at night. I don’t like that. I want silence to be a real indicator that everything is fine. Like The Zen of Python says:

Errors should never pass silently.

It’s easy to write assumptions that’ll create silent errors into boto code. Imagine you have an EBS volume called awesome-stuff and you need to snapshot it for backups. You might write something like this:

import datetime
 
import boto3
 
ec2 = boto3.resource('ec2')
volume_filters = [{'Name': 'tag:Name', 'Values': ['awesome-stuff']}]
volumes = list(ec2.volumes.filter(Filters=volume_filters))
volume = volumes[0]
now = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%dT%H-%M-%S%Z")
volume.create_snapshot(Description=f'awesome-stuff-backup-{now}')

Simple enough. We know our volume is named awesome-stuff, so we look up volumes with that name. There should only be one, so we snapshot the first item in that list. I’ve seen this pattern all over the boto code I’ve read.

What if there are two volumes called “awesome-stuff”? That could easily happen. Another admin makes a copy and tags it the same way. An unrelated project in the same account creates a volume with the same name because awesome-stuff isn’t super unique. It’s very possible to have two volumes with the same name, and you should assume it’ll happen. When it does, this script will run without errors. It will create a snapshot, too, but only of one volume. There is no luck in operations, so you can be 100% certain it will snapshot the wrong one. You will have zero backups but you won’t know it.

There’s an easy pattern to avoid this. First, let me show you Python’s assert statement:

awesome_list = ['a', 'b']
assert len(awesome_list) == 1

We’re telling Python we expect awesome_list to contain one item. If we run this, it errors:

Traceback (most recent call last):
    File "error.py", line 2, in <module>
assert len(awesome_list) == 1
AssertionError

This is a sane message. Anyone reading it can see we expected there to be exactly one object in awesome_list but there wasn’t.

Back to boto. Let’s add an assert to our backup script:

import datetime

import boto3

ec2 = boto3.resource('ec2')
volume_filters = [{'Name': 'tag:Name', 'Values': ['awesome-stuff']}]
volumes = list(ec2.volumes.filter(Filters=volume_filters))
assert len(volumes) == 1
volume = volumes[0]
now = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%dT%H-%M-%S%Z")
volume.create_snapshot(Description=f'awesome-stuff-backup-{now}')

Now, if there are two awesome-stuff volumes, our script will error:

Traceback (most recent call last):
    File "test.py", line 8, in <module>
assert len(volumes) == 1
AssertionError

Boom. That’s all you have to do. Now the script either does what we expect (backs up our awesome stuff) or it fails with a clear message. We know we don’t have any backups yet and we need to take action. Because we assert that there should be exactly one volume, this even covers us for the cases where that volume has been renamed or there’s a typo in our filters.

Here’s a good practice to follow in all of your code:

If your code assumes something, assert that the assumption is true so you’ll get a clear, early failure message if it isn’t.

If you’re interested in further reading or more sources for this practice, check out Jim Shore’s Fail Fast article.

In general, these are called logic errors. Problems with the way the code thinks (its “logic”). Often they won’t even cause errors, they’ll just create behavior you didn’t expect and that might be harmful. Writing code that’s resilient to these kinds of flaws will take your infrastructure to the next level. It won’t just seem like it’s working, you’ll have confidence that it’s working.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles:

Python boto3 Logging

Hello!

If you’re writing a lambda function, check out this article instead.

The best way to log output from boto3 is with Python’s logging library. The core docs have a nice tutorial.

If you use print() statements for output, all you’ll get from boto is what you capture and print yourself. But, boto does a lot of internal logging that we can capture for free.

Good libraries, like boto, use Python’s logging library internally. If you set up a logger using the same library, it will automatically capture boto’s logs along with your own.

Here’s how I set up logging. This is a demo script, in the real world you’d parameterize the inputs, etc.

import logging
import boto3
 
if __name__ == '__main__':
    logging.basicConfig(
        level=logging.INFO,
        format=f'%(asctime)s %(levelname)s %(message)s'
    )
    logger = logging.getLogger()
    logger.debug('The script is starting.')
    logger.info('Connecting to EC2...')
    ec2 = boto3.client('ec2')

That’s it! The basicConfig() function sets up the root logger for you. We’ve told it what amount of output to show (the level) and to show the event time and level on each output line. The logging library docs have more info on what levels and formatting are available.

If you set the level to INFO, it’ll output anything logged with .info() (or higher) by your codeĀ and boto’s internal code. You won’t see our 'The script is starting.' line because anything logged at the DEBUG level will be excluded.

2019-08-18 07:59:20,123 INFO Connecting to EC2...
Traceback (most recent call last):
  File "demo.py", line 11, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

If you change the level to DEBUG, you’ll getĀ everything:

2019-08-18 08:28:06,189 DEBUG The script is starting.
2019-08-18 08:28:06,190 INFO Connecting to EC2...
2019-08-18 08:28:06,190 DEBUG Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
2019-08-18 08:28:06,193 DEBUG Changing event name from before-call.apigateway to before-call.api-gateway
2019-08-18 08:28:06,193 DEBUG Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
2019-08-18 08:28:06,194 DEBUG Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
2019-08-18 08:28:06,195 DEBUG Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
2019-08-18 08:28:06,195 DEBUG Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
2019-08-18 08:28:06,195 DEBUG Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
2019-08-18 08:28:06,197 DEBUG Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
2019-08-18 08:28:06,197 DEBUG Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
2019-08-18 08:28:06,197 DEBUG Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
2019-08-18 08:28:06,197 DEBUG Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
2019-08-18 08:28:06,211 DEBUG Looking for credentials via: env
2019-08-18 08:28:06,211 DEBUG Looking for credentials via: assume-role
2019-08-18 08:28:06,211 DEBUG Looking for credentials via: shared-credentials-file
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: custom-process
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: config-file
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: ec2-credentials-file
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: boto-config
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: container-role
2019-08-18 08:28:06,212 DEBUG Looking for credentials via: iam-role
2019-08-18 08:28:06,213 DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
2019-08-18 08:28:06,213 DEBUG Starting new HTTP connection (1): 169.254.169.254:80
2019-08-18 08:28:07,215 DEBUG Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/meta-data/iam/security-credentials/: Connect timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
Traceback (most recent call last):
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
socket.timeout: timed out
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/httpsession.py", line 258, in send
    decode_content=False,
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/util/retry.py", line 343, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/adam/.pyenv/versions/3.7.2/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/awsrequest.py", line 125, in _send_request
    method, url, body, headers, *args, **kwargs)
  File "/Users/adam/.pyenv/versions/3.7.2/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/adam/.pyenv/versions/3.7.2/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/awsrequest.py", line 152, in _send_output
    self.send(msg)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/awsrequest.py", line 236, in send
    return super(AWSConnection, self).send(str)
  File "/Users/adam/.pyenv/versions/3.7.2/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/urllib3/connection.py", line 164, in _new_conn
    (self.host, self.timeout))
urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPConnection object at 0x1045a1f98>, 'Connection to 169.254.169.254 timed out. (connect timeout=1)')
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/utils.py", line 303, in _get_request
    response = self._session.send(request.prepare())
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/httpsession.py", line 282, in send
    raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
2019-08-18 08:28:07,219 DEBUG Max number of attempts exceeded (1) when attempting to retrieve data from metadata service.
2019-08-18 08:28:07,219 DEBUG Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/endpoints.json
2019-08-18 08:28:07,224 DEBUG Event choose-service-name: calling handler <function handle_service_name_alias at 0x1044b29d8>
2019-08-18 08:28:07,235 DEBUG Loading JSON file: /Users/adam/opt/env3/lib/python3.7/site-packages/botocore/data/ec2/2016-11-15/service-2.json
2019-08-18 08:28:07,258 DEBUG Event creating-client-class.ec2: calling handler <function add_generate_presigned_url at 0x104474510>
Traceback (most recent call last):
  File "demo.py", line 12, in <module>
    ec2 = boto3.client('ec2')
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 86, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 47, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/args.py", line 117, in compute_client_args
    service_name, region_name, endpoint_url, is_secure)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 122, in construct_endpoint
    partition, service_name, region_name)
  File "/Users/adam/opt/env3/lib/python3.7/site-packages/botocore/regions.py", line 135, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

See how it started saying where it found AWS credentials? Imagine you’re trying to figure out why your script worked locally but didn’t work on an EC2 instance; knowing where it found keys is huge. Maybe there are some hardcoded ones you didn’t know about that it’s picking up instead of the IAM role you attached to the instance. In DEBUG mode that’s easy to figure out. With print you’d have to hack out these details yourself.

This is great for simple scripts, but for something you’re going to run in production I recommend this pattern.

Happy automating!

Adam

Need more than just this article? We’re available to consult.

You might also want to check out these related articles: