Python, like languages, contains several built-in collections. Let’s review the various types of collection and how can you also create your own collections.
Iterable collections
The first type of collections are iterable types, which means they can be used in a “for elt in collection” loop. Pretty much all the collections in Python are iterable. You can create your own iterable types by implementing the following:
- The type must implement an __iter__() method that returns an iterator
- That iterator must implement an __iter__() method that returns itself
- That iterator must also implement a __next__() method that returns the next value
Here is a simple example of a custom iterable type:
>>> class MyIter(object): ... def __init__(self, max): ... self.max = max ... self.number = 0 ... def __iter__(self): ... return self ... def __next__(self): ... if self.number >= self.max: ... raise StopIteration ... self.number += 1 ... return self.number ... >>> class MyRange(object): ... def __init__(self, max): ... self.max = max ... def __iter__(self): ... return new MyIter(self.max) ... >>> obj = MyRange(4) >>> for nb in obj: ... print(nb) ... 1 2 3 4
An iterator that supports key/value iterations (e.g. “for key, value in my_collection”) works the same way, except __next__() returns a tuple (key, value) instead of just the value.
Supporting slices
The slice notation “collection[start:end:step]” allows to describe parts of a collection (e.g. my_collection[3:10]). It order to implement this notation, a type must define the following methods:
- __getitem__(self, slice) to return a subcollection (e.g. “my_collection[3:4]”)
- __setitem__(self, slice, elt) when one or more elements of the collection are replaced (e.g. “my_collection[3:4] = elt”). This is for mutable collections only.
- __delitem__(self, slice) when one or more elements of the collection are removed (e.g. “del my_collection[3:4]”). This is for mutable collections only.
The slice argument is either an integer if the traditional index notation is used (“my_collection[3]”) or a slice object when the slice notation is used. For example, “obj[1:10]” calls “obj.__getitem__(slice(1, 10, None))” under the hood.
Sequences
Python provide several built-in sequences that implement the following operations:
- The use of in and not in (e.g. “2 in [1, 2, 3]”, “2 not in [1, 2, 3]”), both implemented by defining __contains__(elt)
- Addition (e.g. [1, 2, 3] + [4, 5, 6]), implemented by defining __add__(sequence)
- Multiplication (e.g. [1, 2, 3] * 3), implemented by defining __mul__(number)
- Slices (as seen previously)
- The use of len(s), implemented by defining __len__()
- The use of max(s) and min(s), both implemented by providing an iterator
- s.index(idx) that returns the element at position idx
- s.count(elt) that returns the number of occurrences elt is in the sequence
Mutable sequences, on top of that, implement modifications through the [] or del operators, append(), clear(), copy(), insert(), extend(), insert(), pop(), remove() and reverse()
The most popular sequences in Python are lists, dictionaries, sets, bytes, bytearrays, memoryviews, tuples, ranges, strings or generators. Of these, lists, set and dictionaries are the only mutable sequences.
Lists
Lists are arrays of elements. And by arrays I mean a pre-allocated contiguous structure. This has several implication. The upsides is that the memory footprint impact is limited. Also, random access is very fast.
There are however some downsides. The first is that if the program tries to append an element in an array which is full, Python needs to allocate a new, larger one and copy all the elements from the old array on to the new one. In order to minimize this, lists have two sizes: the one that len() returns, and the actual allocated size. If the two are equal at initialization, the allocation size increases faster than the “official” size. The idea is that if the addition of a new element requires to rebuild the array, let’s rebuild it with more than one extra element, as chances are that other elements are going to be added.
Another consequence is that a list is designed to add elements at the end rather than at the beginning of the list. Consider the following code which is adding new elements at the beginning of the list:
>>> def add1(arr, nb): ... r = range(nb) ... t1 = time.time() ... for _ in r: ... a.insert(0, 1) ... t2 = time.time() ... print(t2-t1) ... >>> add1([], 1000) 0.0019998550415039062 >>> add1([], 10000) 0.06000399589538574 >>> add1([], 100000) 4.750272035598755 >>> add1([], 200000) 18.93508291244507
As we can see, the time to insert elements at the beginning of the list explodes as the number of elements increases. Inserting 20,000 objects takes 4 times as long as inserting 10,000 objects. The reason is because, for each insertion, Python needs to copy all the elements, shifting them of one place. This starts to be costly when each and every insertion requires to copy over 10,000 objects. By comparison, adding a million or even 10 million objects using append() is much faster:
>>> def add2(arr, nb): ... r = range(nb) ... t1 = time.time() ... for _ in r: ... arr.append(1) ... t2 = time.time() ... print(t2-t1) ... >>> add2([], 1000000) 0.16200900077819824 >>> add2([], 10000000) 1.3950800895690918
Tuples
Tuples can be seen as immutable lists. They are also allocated as an array of elements which makes sense considering the main downside of an array is when it mutates.
Interestingly, CPython may sometimes reuse tuples (like the empty tuple) to avoid duplicates, or even “resize” them instead of deaallocating a tuple and allocating a new one.
Ranges
With Python 3, ranges are always dynamically generated. The only thing that a range stores is the start point, the end point and the step. This is a change with Python 2 where ranges were actually pre-computing the numbers (exception for xrange which behaved like Python 3’s range). Which means that allocating a range of 2 elements or 2 million isn’t any faster (going through the range is another story).
Range are immutable in the sense that they cannot be modified after they are created. When a slice is applied to a range (e.g. range(1, 10)[2:3]), a new range gets created. Note that Python does NOT go through an iterator to find the start and end of the new range, but rather computes the new values. It is using a similar algorithm when calling “24 in range(1, 100, 4)”.