Understanding Python's Collections Module

Daniel Morales
Feb 16, 2021

Understanding Python's Collections Module

Feb 16, 2021 13 minutes read

The Python collections module has different specialized data types that function as containers and can be used to replace the general purpose Python containers (`dict`, `tuple`, `list` and `set`). We will study the following parts of this module:

- `ChainMap`
- `defaultdict`
- `deque`

There is a submodule of collections called abc or Abstract Base Classes. These will not be covered in this post, let's start with the ChainMap container!

ChainMap

A ChainMap is a class that provides the ability to link multiple mappings together so that they end up as a single unit. If you look at the documentation, you will notice that it accepts `**maps*`, which means that a ChainMap will accept any number of mappings or dictionaries and convert them into a single view that you can update. Let's see an example so you can see how it works:

from collections import ChainMap
car_parts = {'hood': 500, 'engine': 5000, 'front_door': 750}
car_options = {'A/C': 1000, 'Turbo': 2500, 'rollbar': 300}
car_accessories = {'cover': 100, 'hood_ornament': 150, 'seat_cover': 99}
car_pricing = ChainMap(car_accessories, car_options, car_parts)

car_pricing
>> ChainMap({'cover': 100, 'hood_ornament': 150, 'seat_cover': 99}, {'A/C': 1000, 'Turbo': 2500, 'rollbar': 300}, {'hood': 500, 'engine': 5000, 'front_door': 750})
car_pricing['hood']
>> 500

Here we import ChainMap from our collections module. Next we create three dictionaries. Then we create an instance of our ChainMap by passing it the three dictionaries we just created.

Finally, we try to access one of the keys of our ChainMap. When we do this, the ChainMap will go through each map to see if that key exists and has a value. If it does, then the ChainMap will return the first value it finds that matches that key.

This is especially useful if you want to set default values. Suppose we want to create an application that has some default values. The application will also know the operating system environment variables. If there is an environment variable that matches one of the keys we have as a default in our application, the environment will override our default value. In addition, let's assume that we can pass arguments to our application. 

These arguments take precedence over the environment and the defaults. This is one place where a ChainMap can really come in handy. Let's look at a simple example that is based on one from the Python documentation:

Note: do not run this code from Jupyter Notebook, but from your favorite IDE and calling it from a terminal. this command `python chain_map.py -u daniel`

import argparse
import os

from collections import ChainMap

def main():
    app_defaults = {'username':'admin', 'password':'admin'}

    parser = argparse.ArgumentParser()
    parser.add_argument('-u', '--username')
    parser.add_argument('-p', '--password')
    args = parser.parse_args()

    command_line_arguments = {key:value for key, value in vars(args).items() if value}

    chain = ChainMap(command_line_arguments, os.environ, app_defaults)
    print(chain['username'])

if __name__ == '__main__':
    main()
    os.environ['username'] = 'test'
    main()

➜  python python3 post.py -u daniel       
daniel
daniel

Let's break this down a bit. Here we import the Python `argparse` module along with the `os` module. We also import `ChainMap`.

Next we have a simple function that has some defaults. I have seen these defaults used for some popular routers. We then set up our argument parser and tell it how to handle certain command line options. You'll notice that argparse doesn't provide a way to get a dictionary object from its arguments, so we use a dict comprehension to extract what we need. 

The other interesting piece here is the use of Python's built-in vars. If you called it without a vars argument it would behave like Python's built-in locales. But if you pass it an object, then vars is the equivalent of the `__dict__` property of object. In other words, vars(args) is equal to `args.__dict__`. 

Finally we create our ChainMap by passing our command line arguments (if any), then the environment variables and finally the default values.

At the end of the code, we try calling our function, then setting an environment variable and calling it again. Try it and you will see that it prints admin and then tests as expected. Now let's try calling the script with a command line argument:

python chain_map.py -u daniel

When I run this on my machine, it returns daniel twice. This is because our command line argument overrides everything else. It doesn't matter what we set the environment to because our ChainMap will look at the command line arguments first before anything else. If you try it without the `-u daniel` it will run the actual arguments, in my case `"admin" "test"`.

Now that you know how to use ChainMaps, we can move on to Counter!

Counter

The collections module also provides us with a small tool that allows us to perform convenient and fast counting. This tool is called `Counter`. You can run it with most iterables. Let's try it with a string

from collections import Counter

Counter('superfluous')
>> 
Counter({'s': 2, 'u': 3, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1})counter = Counter('superfluous')
counter['u']
>> 3

In this example, we import `Counter` from `collections` and pass it a string. This returns a Counter object which is a subclass of the Python dictionary. We then run the same command but assign it to the counter variable so that we can access the dictionary more easily. In this case, we have seen that the letter `"u"` appears three times in the example string.

The counter provides some methods that you may be interested in. For example, you can call elements which will get an iterator over the elements that are in the dictionary, but in an arbitrary order. This function can be considered as an "encoder", since the output in this case is an encoded version of the string.

list(counter.elements())
>> ['s', 's', 'u', 'u', 'u', 'p', 'e', 'r', 'f', 'l', 'o']
Another useful method is most_common. You can ask the Counter what are the most common elements by passing a number that represents what are the most recurring `"n"` elements:

counter.most_common(2)
[('u', 3), ('s', 2)]

Here we just asked our Counter which were the two most recurring items. As you can see, it produced a list of tuples that tells us that `"u"` occurred 3 times and `"s"` occurred twice.

The other method I want to cover is the subtract method. The `subtract` method accepts an iterable or a mapping and uses that argument to subtract. It's a little easier to explain if you see some code:

counter_one = Counter('superfluous')
counter_one
>> Counter({'s': 2, 'u': 3, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1})

counter_two = Counter('super')
counter_one.subtract(counter_two)
counter_one
>> Counter({'s': 1, 'u': 2, 'p': 0, 'e': 0, 'r': 0, 'f': 1, 'l': 1, 'o': 1})
So here we recreate our first counter and print it out so we know what's in it. Thus we create our second Counter object. Finally we subtract the second counter from the first. If you look closely at the output at the end, you will notice that the number of letters in five of the elements has been decreased by one.

As I mentioned at the beginning of this section, you can use Counter against any iterable or mapping, so you don't have to use only strings. You can also pass tuples, dictionaries and lists to it.

Try it on your own to see how it works with those other data types. Now we are ready to move on to `defaultdict`!

`defaultdict`

The collections module has a handy tool called `defaultdict`. The `defaultdict` is a subclass of the Python dict that accepts a `default_factory` as its main argument. The `default_factory` is usually a Python data type, such as int or a list, but you can also use a function or a lambda. Let's start by creating a regular Python dictionary that counts the number of times each word is used in a sentence:

sentence = "The red for jumped over the fence and ran to the zoo for food"
words = sentence.split(' ')
words
>> ['The',
 'red',
 'for',
 'jumped',
 'over',
 'the',
 'fence',
 'and',
 'ran',
 'to',
 'the',
 'zoo',
 'for',
 'food']

reg_dict = {}
for word in words:
    if word in reg_dict:
        reg_dict[word] += 1
    else:
        reg_dict[word] = 1

print(reg_dict)
>> {'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1}

Now let's try to do the same with defaultdict!

from collections import defaultdict
sentence = "The red for jumped over the fence and ran to the zoo for food"
words = sentence.split(' ')

d = defaultdict(int)
for word in words:
    d[word] += 1

print(d)
>> defaultdict(<class 'int'>, {'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1})

You will notice right away that the code is much simpler. The defaultdict will automatically assign zero as a value to any key it doesn't already have in it. We add one to make it make more sense and it will also increment if the word appears multiple times in the sentence.

Now let's try using a Python list type as our `default_factory`. We will start with a regular dictionary, as before.

my_list = [(1234, 100.23), (345, 10.45), (1234, 75.00), (345, 222.66), (678, 300.25), (1234, 35.67)]

reg_dict = {}
for acct_num, value in my_list:
    if acct_num in reg_dict:
        reg_dict[acct_num].append(value)
    else:
        reg_dict[acct_num] = [value]

If you run this code, you should get output similar to the following:

print(reg_dict)
>> {1234: [100.23, 75.0, 35.67], 345: [10.45, 222.66], 678: [300.25]}
Now let's reimplement this code using defaultdict:

from collections import defaultdict

my_list = [(1234, 100.23), (345, 10.45), (1234, 75.00), (345, 222.66), (678, 300.25), (1234, 35.67)]

d = defaultdict(list)
for acct_num, value in my_list:
    d[acct_num].append(value)

Again, this eliminates the if/else conditional logic and makes the code easier to follow. Here is the output of the above code:

print(d)
>> defaultdict(<class 'list'>, {1234: [100.23, 75.0, 35.67], 345: [10.45, 222.66], 678: [300.25]})

This is a very good thing! Let's try using a `lambda` also as our `default_factory`!

from collections import defaultdict
animal = defaultdict(lambda: "Monkey")
animal
>> defaultdict(<function __main__.<lambda>()>, {})
animal['Sam'] = 'Tiger'
print (animal['Nick'])
>> Monkey
animal
>> defaultdict(<function __main__.<lambda>()>, {'Sam': 'Tiger', 'Nick': 'Monkey'})

Here we create a `defaultdict` that will assign `Monkey` as default value to any key. The first key is set to `Tiger`, and the next key is not set. If you print the second key, you will see that it has 'Monkey' assigned to it. 

In case you haven't noticed yet, it is basically impossible to cause a KeyError as long as you set the `default_factory` to something that makes sense. The documentation mentions that if you set the `default_factory` to `None`, then you will receive a KeyError.

Let's see how that works:

from collections import defaultdict
x = defaultdict(None)
x['Mike']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-d21c3702d01d> in <module>
      1 from collections import defaultdict
      2 x = defaultdict(None)
----> 3 x['Mike']

KeyError: 'Mike'

In this case, we just created a `defaultdict` with an error. It can no longer assign a default value to our key, so it throws a `KeyError` instead. Of course, since it is a subclass of `dict`, we can simply set the key to some value and it will work. But that defeats the purpose of `defaultdict`.

`deque`

According to the Python documentation, deques "is a generalization of stacks and queues (stacks and queues)". It is pronounced `deck`, which is short for `double-ended queue`. They are a replacement container for the Python list. Deques are thread-safe and allow you to efficiently add and remove data from memory from either side of the deque. 

A list is optimized for fast fixed-length operations. Full details can be found in the Python documentation. A deque accepts a `maxlen` argument that sets the limits for the deque. Otherwise, the deque will grow to an arbitrary size. When a bounded deque is full, any new elements added will cause the same number of elements to come out the other end.

As a general rule, if you need to add or remove elements quickly, use a deque. If you need quick random access use a list. Let's take a moment to see how a deque can be created and used.

from collections import deque
import string

d = deque(string.ascii_lowercase)

for letter in d:
    print(letter)

>> a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
Here we import the deque from our collections module and we also import the strings module. To create an instance of a deque, we need to pass it an iterable. In this case, we pass `string.ascii_lowerercase`, which returns a list of all the lowercase letters of the alphabet. Finally, we loop over our deque and print each element. Now let's look at some of the methods that deque has.

d.append('bye')
d
>> deque(['a',
       'b',
       'c',
       'd',
       'e',
       'f',
       'g',
       'h',
       'i',
       'j',
       'k',
       'l',
       'm',
       'n',
       'o',
       'p',
       'q',
       'r',
       's',
       't',
       'u',
       'v',
       'w',
       'x',
       'y',
       'z',
       'bye'])

d.appendleft('hello')
d
>> deque(['hello',
       'a',
       'b',
       'c',
       'd',
       'e',
       'f',
       'g',
       'h',
       'i',
       'j',
       'k',
       'l',
       'm',
       'n',
       'o',
       'p',
       'q',
       'r',
       's',
       't',
       'u',
       'v',
       'w',
       'x',
       'y',
       'z',
       'bye'])

d.rotate(1)
d
>> deque(['bye',
       'hello',
       'a',
       'b',
       'c',
       'd',
       'e',
       'f',
       'g',
       'h',
       'i',
       'j',
       'k',
       'l',
       'm',
       'n',
       'o',
       'p',
       'q',
       'r',
       's',
       't',
       'u',
       'v',
       'w',
       'x',
       'y',
       'z'])

Let's break this down a bit. First we add a chain to the right end of the deque. Then we add another string to the left side of the deque. Finally, we call `rotate` on our deque and pass it a one, which causes it to rotate once to the right. 

In other words, it makes an element rotate from the far right and in front. You can pass it a negative number to make the deque rotate to the left instead. 

Let's end this section with an example based on some of the Python documentation

from collections import deque
def get_last(filename, n=5):
    """
    Returns the last n lines from the file
    """
    try:
        with open(filename) as f:
            return deque(f, n)
    except OSError:
        print("Error opening file: {}".format(filename))
        raise

This code works much like the Linux tail program does. Here we pass a `filename` to our script along with the `n` number of lines we want it to return. 

The deque is limited to whatever number we pass as `n`. This means that once the deque is full, when new lines are read and added to the deque, the oldest lines are pulled from the other end and discarded. 

I have also wrapped the file opening with a simple exception handler because it is very easy to pass a malformed path. This will catch files that do not exist.

Conclusion

We've covered a lot of ground in this post. You learned how to use a defaultdict and a Count. We also learned about a subclass of the Python list, the deque. Finally, we saw how to use them to perform various activities. I hope you found each of these collections interesting. They may be of great use to you in your day-to-day work.
Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!