In the previous solution to the word count problem, we used the get
method of the vanilla dict
class to supply a default value, thereby not having to have any conditional checks or error handling to clutter the code.
But who says we should only use the vanilla dict
class? Python comes with a rich standard library, part of the batteries included philosophy. One module in the standard library is collections
, a module that contains quite a few interesting data structures. In this post, we will take a look at two of them: defaultdict
and Counter
defaultdict
defaultdict
allows us to create a dictionary which has a default value. If the key exists, the dict returns the corresponding value as usual. However, if the key does not exist, it uses a factory function to create the default value and sets that value into the dictionary for that key.
Here is an example of defaultdict
in action
from collections import defaultdict
def count_words(line):
words = line.split()
counts = defaultdict(int)
for word in words:
counts[word] = counts[word] + 1
return counts
In line 5 we are creating the defaultdict
passing in int
as the factory function. If we index a key which doesn't exist, then it will execute int()
(which returns 0) to set the default value.
In line 7, we are using counts[word]
as normal. If the key exists, its value will be incremented. If it doesn't exist, int()
will be called and zero will be set as the value for the key, and then it will be incremented.
How does defaultdict
differ from using the get
method with a default value? The get
method does not modify the dict in any way. If the key is missing it just returns the default value but does not change the dict itself. Whereas defaultdict
sets the value for the missing key in the dict, so if you were to check that key later it will then have the value.
Counter
Is it possible to believe that our word count problem can be solved in one line?
Yes!! This is because there is already an in-built functionality to count words. The data structure we want is Counter
and it is part of the collections
module. It does exactly what we have been doing manually so far: count the number of times an item appears 😄
As a bonus, the Counter
can also return the data sorted by the count, if in case you wanted to know the top 5 most common words or anything like that.
In general, the Counter
implements a multi-set data structure. Discussing the scope of what all can be done with a multi-set is for another day. For now, here is the one-liner code for the word count problem.
from collections import Counter
def count_words(line):
return Counter(line.split())
Many times we find that the problem we want to solve is already solved within python's vast standard library. That's the beauty of python 🎉
In the final article of this series, we will look at implementing the bonus functionality.
Did you like this article?
If you liked this article, consider subscribing to this site. Subscribing is free.
Why subscribe? Here are three reasons:
- You will get every new article as an email in your inbox, so you never miss an article
- You will be able to comment on all the posts, ask questions, etc
- Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only