Collections

Python, like languages, contains several built-in collections. Let’s review the various types of collection and how can you also create your own collections.

Iterable collections

The first type of collections are iterable types, which means they can be used in a “for elt in collection” loop. Pretty much all the collections in Python are iterable. You can create your own iterable types by implementing the following:

The type must implement an __iter__() method that returns an iterator
That iterator must implement an __iter__() method that returns itself
That iterator must also implement a __next__() method that returns the next value

Here is a simple example of a custom iterable type:

>>> class MyIter(object):
...     def __init__(self, max):
...             self.max = max
...             self.number = 0
...     def __iter__(self):
...             return self
...     def __next__(self):
...             if self.number >= self.max:
...                     raise StopIteration
...             self.number += 1
...             return self.number
...
>>> class MyRange(object):
...     def __init__(self, max):
...             self.max = max
...     def __iter__(self):
...             return new MyIter(self.max)
...
>>> obj = MyRange(4)
>>> for nb in obj:
...     print(nb)
...
1
2
3
4

An iterator that supports key/value iterations (e.g. “for key, value in my_collection”) works the same way, except __next__() returns a tuple (key, value) instead of just the value.

Supporting slices

The slice notation “collection[start:end:step]” allows to describe parts of a collection (e.g. my_collection[3:10]). It order to implement this notation, a type must define the following methods:

__getitem__(self, slice) to return a subcollection (e.g. “my_collection[3:4]”)
__setitem__(self, slice, elt) when one or more elements of the collection are replaced (e.g. “my_collection[3:4] = elt”). This is for mutable collections only.
__delitem__(self, slice) when one or more elements of the collection are removed (e.g. “del my_collection[3:4]”). This is for mutable collections only.

The slice argument is either an integer if the traditional index notation is used (“my_collection[3]”) or a slice object when the slice notation is used. For example, “obj[1:10]” calls “obj.__getitem__(slice(1, 10, None))” under the hood.

Sequences

Python provide several built-in sequences that implement the following operations:

The use of in and not in (e.g. “2 in [1, 2, 3]”, “2 not in [1, 2, 3]”), both implemented by defining __contains__(elt)
Addition (e.g. [1, 2, 3] + [4, 5, 6]), implemented by defining __add__(sequence)
Multiplication (e.g. [1, 2, 3] * 3), implemented by defining __mul__(number)
Slices (as seen previously)
The use of len(s), implemented by defining __len__()
The use of max(s) and min(s), both implemented by providing an iterator
s.index(idx) that returns the element at position idx
s.count(elt) that returns the number of occurrences elt is in the sequence

Mutable sequences, on top of that, implement modifications through the [] or del operators, append(), clear(), copy(), insert(), extend(), insert(), pop(), remove() and reverse()

The most popular sequences in Python are lists, dictionaries, sets, bytes, bytearrays, memoryviews, tuples, ranges, strings or generators. Of these, lists, set and dictionaries are the only mutable sequences.

Lists

Lists are arrays of elements. And by arrays I mean a pre-allocated contiguous structure. This has several implication. The upsides is that the memory footprint impact is limited. Also, random access is very fast.

There are however some downsides. The first is that if the program tries to append an element in an array which is full, Python needs to allocate a new, larger one and copy all the elements from the old array on to the new one. In order to minimize this, lists have two sizes: the one that len() returns, and the actual allocated size. If the two are equal at initialization, the allocation size increases faster than the “official” size. The idea is that if the addition of a new element requires to rebuild the array, let’s rebuild it with more than one extra element, as chances are that other elements are going to be added.

Another consequence is that a list is designed to add elements at the end rather than at the beginning of the list. Consider the following code which is adding new elements at the beginning of the list:

>>> def add1(arr, nb):
...     r = range(nb)
...     t1 = time.time()
...     for _ in r:
...         a.insert(0, 1)
...     t2 = time.time()
...     print(t2-t1)
...
>>> add1([], 1000)
0.0019998550415039062
>>> add1([], 10000)
0.06000399589538574
>>> add1([], 100000)
4.750272035598755
>>> add1([], 200000)
18.93508291244507

As we can see, the time to insert elements at the beginning of the list explodes as the number of elements increases. Inserting 20,000 objects takes 4 times as long as inserting 10,000 objects. The reason is because, for each insertion, Python needs to copy all the elements, shifting them of one place. This starts to be costly when each and every insertion requires to copy over 10,000 objects. By comparison, adding a million or even 10 million objects using append() is much faster:

>>> def add2(arr, nb):
...     r = range(nb)
...     t1 = time.time()
...     for _ in r:
...         arr.append(1)
...     t2 = time.time()
...     print(t2-t1)
...
>>> add2([], 1000000)
0.16200900077819824
>>> add2([], 10000000)
1.3950800895690918

Tuples

Tuples can be seen as immutable lists. They are also allocated as an array of elements which makes sense considering the main downside of an array is when it mutates.

Interestingly, CPython may sometimes reuse tuples (like the empty tuple) to avoid duplicates, or even “resize” them instead of deaallocating a tuple and allocating a new one.

Ranges

With Python 3, ranges are always dynamically generated. The only thing that a range stores is the start point, the end point and the step. This is a change with Python 2 where ranges were actually pre-computing the numbers (exception for xrange which behaved like Python 3’s range). Which means that allocating a range of 2 elements or 2 million isn’t any faster (going through the range is another story).

Range are immutable in the sense that they cannot be modified after they are created. When a slice is applied to a range (e.g. range(1, 10)[2:3]), a new range gets created. Note that Python does NOT go through an iterator to find the start and end of the new range, but rather computes the new values. It is using a similar algorithm when calling “24 in range(1, 100, 4)”.

	lpoulain on The Garbage Collector
	stef1996 on The Garbage Collector
	lpoulain on The Garbage Collector
	stef1996 on The Garbage Collector
	Python Garbage Colle… on The Garbage Collector

Yet Another Python Internals Blog

Collections

Iterable collections

Supporting slices

Sequences

Lists

Tuples

Ranges

Leave a comment Cancel reply

Iterable collections

Supporting slices

Sequences

Lists

Tuples

Ranges

Partager:

Related

Leave a comment Cancel reply