Fastest way to generate a random-like unique string with random length in Python 3

I know how to create random string, like:

''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))

However, there should be no duplicates so what I am currently just checking if the key already exists in a list, like shown in the following code:

import secrets
import string
import numpy as np


amount_of_keys = 40000

keys = []

for i in range(0,amount_of_keys):
    N = np.random.randint(12,20)
    n_key = ''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(N))
    if not n_key in keys:
        keys.append(n_key)

Which is okay for a small amount of keys like 40000, however the problem does not scale well the more keys there are. So I am wondering if there is a faster way to get to the result for even more keys, like 999999

Solution:

Basic improvements, sets and local names

Use a set, not a list, and testing for uniqueness is much faster; set membership testing takes constant time independent of the set size, while lists take O(N) linear time. Use a set comprehension to produce a series of keys at a time to avoid having to look up and call the set.add() method in a loop; properly random, larger keys have a very small chance of producing duplicates anyway.

Because this is done in a tight loop, it is worth your while optimising away all name lookups as much as possible:

import secrets
import numpy as np
from functools import partial

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(secrets.choice, string.ascii_uppercase + string.digits)
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys

The _randint keyword argument binds the np.random.randint name to a local in the function, which are faster to reference than globals, especially when attribute lookups are involved.

The pickchar() partial avoids looking up attributes on modules or more locals; it is a single callable that has all the references in place, so is faster in execute, especially when done in a loop.

The while loop keeps iterating only if there were duplicates produced. We produce enough keys in a single set comprehension to fill the remainder if there are no duplicates.

Timings for that first improvement

For 100 items, the difference is not that big:

>>> timeit('p(100)', 'from __main__ import produce_amount_keys_list as p', number=1000)
8.720592894009314
>>> timeit('p(100)', 'from __main__ import produce_amount_keys_set as p', number=1000)
7.680242831003852

but when you start scaling this up, you’ll notice that the O(N) membership test cost against a list really drags your version down:

>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_list as p', number=10)
15.46253142200294
>>> timeit('p(10000)', 'from __main__ import produce_amount_keys_set as p', number=10)
8.047800761007238

My version is already almost twice as fast as 10k items; 40k items can be run 10 times in about 32 seconds:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_list as p', number=10)
138.84072386901244
>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_set as p', number=10)
32.40720253501786

The list version took over 2 minutes, more than ten times as long.

Numpy’s random.choice function, not cryptographically strong

You can make this faster still by forgoing the secrets module and using np.random.choice() instead; this won’t produce a cryptographic level randomness however, but picking a random character is twice as fast:

def produce_amount_keys(amount_of_keys, _randint=np.random.randint):
    keys = set()
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    while len(keys) < amount_of_keys:
        keys |= {''.join([pickchar() for _ in range(_randint(12, 20))]) for _ in range(amount_of_keys - len(keys))}
    return keys

This makes a huge difference, now 10 times 40k keys can be produced in just 16 seconds:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_npchoice as p', number=10)
15.632006907981122

Further tweaks with the itertools module and a generator

We can also take the unique_everseen() function from the itertools module Recipes section to have it take care of the uniqueness, then use an infinite generator and the itertools.islice() function to limit the results to just the number we want:

# additional imports
from itertools import islice, repeat

# assumption: unique_everseen defined or imported

def produce_amount_keys(amount_of_keys):
    pickchar = partial(
        np.random.choice,
        np.array(list(string.ascii_uppercase + string.digits)))
    def gen_keys(_range=range, _randint=np.random.randint):
        while True:
            yield ''.join([pickchar() for _ in _range(_randint(12, 20))])
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is slightly faster still, but only marginally so:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_itertools as p', number=10)
14.698191125993617

os.urandom() bytes and a different method of producing strings

Next, we can follow on on Adam Barnes’s ideas for using UUID4 (which is basically just a wrapper around os.urandom()) and Base64. But by case-folding Base64 and replacing 2 characters with randomly picked ones, his method severely limits the entropy in those strings (you won’t produce the full range of unique values possible, a 20-character string only using (256 ** 15) / (36 ** 20) == 1 in every 99437 bits of entropy!).

The Base64 encoding uses both upper and lower case characters and digits but also adds the - and / characters (or + and _ for the URL-safe variant). For only uppercase letters and digits, you’d have to uppercase the output and map those extra two characters to other random characters, a process that throws away a large amount of entropy from the random data provided by os.urandom(). Instead of using Base64, you could also use the Base32 encoding, which uses uppercase letters and the digits 2 through 8, so produces strings with 32 ** n possibilities versus 36 ** n. However, this can speed things up further from the above attempts:

import os
import base64
import math

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=base64.b32encode, _randint=np.random.randint, _factor=math.log(256, 32)):
        while True:
            count = _randint(12, 20)
            # count + 1 / _factor gives us the number of bytes needed 
            # to produce *at least* count encoded characters
            yield _encode(_urandom(int((count + 1) / _factor)))[:count].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is really fast:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b32 as p', number=10)
4.266283575998386

40k keys, 10 times, in just over 4 seconds. So about 75 times as fast; the speed of using os.urandom() as a source is undeniable.

This is, cryptographically strong again; os.urandom() produces bytes for cryptographic use. On the other hand, we reduced the number of possible strings produced by more than 90% (((36 ** 20) - (32 ** 20)) / (36 ** 20) * 100 is 90.5), we are no longer using the 0, 1, 8 and 9 digits in the outputs.

So we should use the urandom() trick to produce a proper Base36 encoding; we’ll have to produce our own b36encode() function:

import string
import math

def b36encode(b, _range=range, _c=(string.ascii_uppercase + string.digits).encode()):
    """Encode a bytes value to Base36 (uppercase ASCII and digits)

    This isn't too friendly on memory because we convert the whole bytes
    object to an int, but for smaller inputs this should be fine.
    """
    b_int = int.from_bytes(b, 'big')
    length = len(b) and int(math.log(256 ** len(b), 36))
    count = b_int and int(math.log(b_int, 36))
    return bytes(_c[(b_int // 36 ** i) % 36] for i in _range(count, -1, -1)).rjust(length, b'A')

and use that:

def produce_amount_keys(amount_of_keys):
    def gen_keys(_urandom=os.urandom, _encode=b36encode, _randint=np.random.randint, _factor=math.log(256, 36)):
        while True:
            count = _randint(12, 20)
            # count + 1 / _factor gives us the number of bytes needed 
            # to produce *at least* count encoded characters
            yield _encode(_urandom(int((count + 1) / _factor)))[:count].decode('ascii')
    return list(islice(unique_everseen(gen_keys()), amount_of_keys))

This is still fast enough, and above all produces the full range of 36 uppercase letters and digits:

>>> timeit('p(40000)', 'from __main__ import produce_amount_keys_b36 as p', number=10)
7.3695860790030565

Granted, the base32 version is almost twice as fast as this one (thanks to an efficient Python implementation using a table) but using a custom Base36 encoder is still twice the speed of the non-cryptographically secure numpy.random.choice() version! In my book, that’s a good result.

Note, the Base64 version is faster, but I personally don’t trust that it throws away so much randomness from the os.urandom() source.

Encode array of integers into unique int

I have a fixed amount of int arrays of the form:

[a,b,c,d,e]

for example:

[2,2,1,1,2]

where a and b can be ints from 0 to 2, c and d can be 0 or 1, and e can be ints from 0 to 2.

Therefore there are: 3 * 3 * 2 * 2 * 3: 108 possible arrays of this form.

I would like to assign to each of those arrays a unique integer code from 0 to 107.

I am stuck, i thought of adding each numbers in the array, but two arrays such as:

[0,0,0,0,1] and [1,0,0,0,0]

would both add to 1.

Any suggestion?

Thank you.

Solution:

You could use np.ravel_multi_index:

>>> np.ravel_multi_index([1, 2, 0, 1, 2], (3, 3, 2, 2, 3))
65

Validation:

>>> {np.ravel_multi_index(j, (3, 3, 2, 2, 3)) for j in itertools.product(*map(range, (3,3,2,2,3)))} == set(range(np.prod((3, 3, 2, 2, 3))))
True

Going back the other way:

>>> np.unravel_index(65, dims=(3, 3, 2, 2, 3))
(1, 2, 0, 1, 2)

SED or AWK script to replace multiple text

I am trying to do the following with a sed script but it’s taking too much time. Looks like something I’m doing wrongly.

Scenario:
I’ve student records (> 1 million) in students.txt.
In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on

students.txt

10000000019234567890XXX…
10000000029325788532YYY…
.
.
.
10010000008766443367ZZZZ…

I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below

encrypted_contact_numbers.txt

Phone_Number, Encrypted_Phone_Number

9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747

I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.

Expected Output:

10000000011122334455XXX…
10000000024466742178YYY…
.
.
.
10010000002964267747ZZZZ…

I am using the below sed script to do this operation. It is working fine but too slowly.

Approach 1:

while read -r pattern replacement; do   
    sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt

Approach 2:

sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt

Is there any way to process this huge file quickly?

Solution:

Following awk may help you on same.

awk '
FNR==NR{
  sub(/ +$/,"");
  a[$1]=$2;
  next
}
(substr($0,11,10) in a){
  print substr($0,1,10) a[substr($0,11,10)] substr($0,21)
}
' FS=", " encrypted_contact_number.txt students.txt

Output will be as follows. Will add explanation too shortly.

10000000011122334455XXX...
10000000024466742178YYY...

Elegant way of finding unique string pairs from dictionary values in a list of dictionaries

I have a list of dictionaries:

    some_list = [
                  {
                     item1:'a', 
                     item2:['1', '2']
                  }, 
                  {
                     item1:'b', 
                     item2:['1', '2']
                  }, 
                  {
                     item1:'a', 
                     item2:['1']
                  }
                ]

I would like to get:

[‘a 1’, ‘a 2’, ‘b 1’, ‘b 2’] where each value from item 1 is paired with value from item 2 for each dictionary, and then only the unique strings are left.

I can think of an obvious way to do it, namely:

  1. iterate through the some_list;
  2. for each dictionary, get a each[‘item_1’] and each[‘item_2’]
  3. for member of each[‘item_2’] do each[‘item_1’] + ‘ ‘ + member
  4. Now, make the list into a set and I have my unique values

I am wondering if these is a more elegant way to do it using list comprehension.

Solution:

If order doesn’t matter, then you can easily translate your logic into a set-comprehnsion (and convert to list if you’d like):

In [1]: some_list = [
   ...:                   {
   ...:                      'item1':'a',
   ...:                      'item2':['1', '2']
   ...:                   },
   ...:                   {
   ...:                      'item1':'b',
   ...:                      'item2':['1', '2']
   ...:                   },
   ...:                   {
   ...:                      'item1':'a',
   ...:                      'item2':['1']
   ...:                   }
   ...:                 ]

In [2]: {f"{d['item1']} {v}" for d in some_list for v in d['item2']}
Out[2]: {'a 1', 'a 2', 'b 1', 'b 2'}

What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

I’ve noticed three methods of selecting a column in a Pandas DataFrame:

First method of selecting a column using loc:

df_new = df.loc[:, 'col1']

Second method – seems simpler and faster:

df_new = df['col1']

Third method – most convenient:

df_new = df.col1

Is there a difference between these three methods? I don’t think so, in which case I’d rather use the third method.

I’m mostly curious as to why there appear to be three methods for doing the same thing.

Solution:

If you are selecting a single column, a list of columns, or a slice or rows then there is no difference. However, [] does not allow you to select a single row, a list of rows or a slice of columns. More importantly, if your selection involves both rows and columns, then assignment becomes problematic.

df[1:3]['A'] = 5

This selects rows 1 and 2, and then selects column ‘A’ of the returning object and assign value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of this assignment is

df.loc[1:3, 'A'] = 5

With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).

Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu’s answer for more detail.


Note: Getting columns with [] vs . is a completely different topic. . is only there for convenince. It only allows accessing columns whose name are valid Python identifier (i.e. they cannot contain spaces, they cannot be composed of numbers…). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won’t work if there is no column a). Other than that, . and [] are the same.

Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?

>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)

Solution:

Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

The NFKC form of both the super and subscript characters is the lowercase u:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'

So in the end, all you have is a single identifier, Xu:

>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
  1           0 LOAD_CONST               0 (42)
              2 STORE_NAME               0 (Xu)

  2           4 LOAD_NAME                1 (print)
              6 LOAD_NAME                0 (Xu)
              8 LOAD_NAME                0 (Xu)
             10 LOAD_NAME                0 (Xu)
             12 BUILD_TUPLE              3
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

Identifiers are normalized to avoid many potential ‘look-alike’ bugs, where you’d otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.

Conditional operations on a list of dictionaries in Python

I’d like to sum the ages of all people in this list of dictionaries, based on a condition on another key (e.g. Gender == 'male'):

list_of_dicts= [  
              {"Name": "Ethan", "Gender": "male", "Age": 11},
              {"Name": "Nathan", "Gender": "male", "Age": 6},
              {"Name": "Sophie", "Gender": "female", "Age": 14},
              {"Name": "Patrick", "Gender": "male", "Age": 11}
]

The following code accomplishes it, but I wonder if there is a more Pythonic/compact way to do it? Maybe something like an SQL query for list of dictionaries?

total_male_age = 0

for dict in list_of_dicts: 
    if dict.get("Gender") == "male":
        total_male_age = total_male_age + dict.get("Age")  


#total_male_age  = 28

Solution:

How’s this?

>>> sum(d['Age'] for d in list_of_dicts if d['Gender'] == 'male')
28

Here you are technically calling sum on a generator expression–one that filters down to dictionaries where Gender equals “male”.

PEP 289 provides an example for some further reading.

To find a product rather than a sum:

import numpy as np
np.product([d['Age'] for d in list_of_dicts if d['Gender'] == 'male'])

And if you want to find a product while staying within Python’s standard library:

from functools import reduce
from operator import mul
reduce(mul, (d['Age'] for d in list_of_dicts if d['Gender'] == 'male'), 1)

How do I open a binary matrix and convert it into a 2D array or a dataframe?

I have a binary matrix in a txt file that looks as follows:

0011011000
1011011000
0011011000
0011011010
1011011000
1011011000
0011011000
1011011000
0100100101
1011011000

I want to make this into a 2D array or a dataframe where there is one number per column and the rows are as shown. I’ve tried using numpy and pandas, but the output has only one column that contains the whole number. I want to be able to call an entire column as a number.

One of the codes I’ve tried is:

with open("a1data1.txt") as myfile:
    dat1=myfile.read().split('\n')
dat1=pd.DataFrame(dat1)

Solution:

After you read your txt, you can using following code fix it

pd.DataFrame(df[0].apply(list).values.tolist())
Out[846]: 
   0  1  2  3  4  5  6  7  8  9
0  0  0  1  1  0  1  1  0  0  0
1  1  0  1  1  0  1  1  0  0  0
2  0  0  1  1  0  1  1  0  0  0
3  0  0  1  1  0  1  1  0  1  0
4  1  0  1  1  0  1  1  0  0  0
5  1  0  1  1  0  1  1  0  0  0
6  0  0  1  1  0  1  1  0  0  0
7  1  0  1  1  0  1  1  0  0  0
8  0  1  0  0  1  0  0  1  0  1
9  1  0  1  1  0  1  1  0  0  0

String to Bytes Python

I have this issue and I can’t figure out how to solve it. I have this string:

data = '\xc4\xb7\x86\x17\xcd'

When I tried to encode it:

data.encode()

I get this result:

b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'

I only want:

b'\xc4\xb7\x86\x17\xcd'

Anyone knows the reason and how to fix this. The string is already store in an variable, so I can’t add the literal b in-front of it.

Solution:

You can use 'raw_unicode_escape' as your encoding:

In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'

As mentioned in comments you can also pass the encoding directly to the encode method of your string.

In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'

iterating through the list elements inside the dictionary

I have a dictionary dict = {1:['cup','pen'],2:['one','two'],3:['five']}in which i have to iterate through the first elements of all the keysat first and then the second element.
I have written in below way, but the output in which it,iterates through list index 0first and then index 1.
Any one help me out to modify the code, so that it can iterate all the list’s first element’s at first and the second element

dict = {1:['cup','pen'],2:['one','two'],3:['five']}
for v in dict.values():
  for i in range (len(v)):
    print v[i]

Output of above code:

cup
pen
one
two
five

but want it in this below way :

cup
one
five
pen
two

Solution:

Using the zipped-transpose idiom, zip upto the longest list with itertools.zip_longest. Then iterate over each item in the sub-lists, filtering out None values.

from itertools import zip_longest

for i in zip_longest(*data.values()):
    for j in filter(None, i):
        print(j)

cup
one
five
pen
two

Here, data is your input dictionary. Don’t use dict to name variables, it shadows the builtin class with the same name.

Note that this order is not guaranteed on python versions below python-3.6, since dictionaries are not ordered in these older versions of python.

Final note, for python-2.x, the appropriate function to be used is itertools.izip_longest.


The output of zip_longest looks like this:

print(list(zip_longest(*data.values())))
[('cup', 'one', 'five'), ('pen', 'two', None)]

These None values are inserted in place of missing values due to the “zip longest’ behaviour of the function. These Nones are removed with filter.


You can use this to get creative using itertools.chain and a print statement, accomplishing this in one line:

from itertools import chain
print(*filter(None, chain.from_iterable(zip_longest(*data.values()))), sep='\n')

cup
one
five
pen
two

Re-written a little more clearly:

print(
   *filter(                                 # `filter` out None (iterable unpacking)       
        None, 
        chain.from_iterable(                # flatten an iterable of iterables
               zip_longest(*data.values())  # zip-longest transpose
       )
    ), 
    sep='\n'                                # separate unpacked arguments with newline
)

One more option, for homework submission without zip_longest. This just cycles over the keys, popping one element at a time from each list.

while any(bool(v) for v in data.values()):
    for k in data:
        try:
            print(data[k].pop(0))
        except IndexError:
            pass

cup
one
five
pen
two

This empties data of its contents, so you may want to make a copy of your data beforehand.