Working with Amazon S3 using boto: Multithreaded Edition!

Let’s say you need to update lots of keys in Amazon S3. If you have many objects in your S3 bucket, this can be quite slow. Of course, as a Python developer, you’re using the nifty boto library. We can make update all of your keys much, much faster using multiple threads!

In this example, I will enable caching for all of the objects in my bucket.

from multiprocessing import Pool
import boto

conn = boto.connect_s3()
bucket = conn.get_bucket(‘my_bucket_foo’)

cache_control = {‘Cache-Control’: str.encode(’no-transform,public,max-age=300,s-maxage=900’)}

def update(key):
    k = bucket.get_key(key.name)
    cache_control.update({‘Content-Type’: k.content_type})
    k.metadata.update(cache_control)
    key.copy(k.bucket.name,
             k.name,
             k.metadata,
             preserve_acl=True)
    print(k.name)

pool = Pool(processes=100)
pool.map(update, bucket.list()

In this example, I will enable public access to all objects in my bucket.

from multiprocessing import Pool
import boto

all_users = ‘http://acs.amazonaws.com/groups/global/AllUsers'
conn = boto.connect_s3()
bucket = conn.get_bucket(‘my_bucket_foo’)

def update(key):
    acl = key.get_acl()
    for grant in acl.acl.grants:
        if grant.uri != all_users:
            key.make_public()
            print(key.name)

pool = Pool(processes=100)
pool.map(update, bucket.list())

If you’re running this on Windows, a slight change is necessary:

if name == ‘main’:
    freeze_support()
    pool = Pool(processes=100)
    pool.map(update, bucket.list())