Get common elements majority of lists in python

Given 4 lists, I want to get elements that are common to 3 or more lists.

a = [1, 2, 3, 4]
b = [1, 2, 3, 4, 5]
c = [1, 3, 4, 5, 6]
d = [1, 2, 6, 7]

Hence, the output should be [1, 2, 3, 4].

My current code is as follows.

result1 = set(a) & set(b) & set(c)
result2 = set(b) & set(c) & set(d)
result3 = set(c) & set(d) & set(a)
result4 = set(d) & set(a) & set(b)

final_result = list(result1)+list(result2)+list(result3)+list(result4)
print(set(final_result))

It works fine, and give the desired output. However, I am interested in knowing if there is an easy way of doing this in Python, ie: are there any built in functions for this?

Solution:

Using a Counter, you can do this like:

Code:

a = [1, 2, 3, 4]
b = [1, 2, 3, 4, 5]
c = [1, 3, 4, 5, 6]
d = [1, 2, 6, 7]

from collections import Counter

counts = Counter(sum(([list(set(i)) for i in (a, b, c, d)]), []))
print(counts)

more_than_three = [i for i, c in counts.items() if c >= 3]
print(more_than_three)

Results:

Counter({1: 4, 2: 3, 3: 3, 4: 3, 5: 2, 6: 2, 7: 1})

[1, 2, 3, 4]

How can i compare strings in a tuple that is inside a list

I have this list of tuples [(amount, name)]:

[(214.05, 'Charlie'), (153.57, 'Ben'),(213.88, 'Charlie')]

I am trying to compare them by their names and if there is a tuple that has the same name, I want to add the amounts together.

The output would go into another list with the same structure [(amount,name)].

I managed to extract the name part with this:

for i in range(0, len(spendList)):
    print(spendList[i][1])

The output:

Charlie
Ben
Charlie

How can I compare the names with each other?

Solution:

One way to do these sorts of operations is to use dict.setdefault() like:

Code:

data = [(214.05, 'Charlie'), (153.57, 'Ben'), (213.88, 'Charlie')]
summed = {}
for amount, name in data:
    summed.setdefault(name, []).append(amount)
summed = [(sum(amounts), name) for name, amounts in summed.items()]
print(summed)

How does this work?

  1. Start by defining a dict object to accumulate the amounts for each name.

    summed = {}
    
  2. Step through every pair of amounts and names:

    for amount, name in data:
    
  3. Using the dict property that things that hash the same will end up in the same slot in dict, and the dict method: setdefault() to make sure that the dict has an empty list available for every name we come across, create a list of amounts for each name:

    summed.setdefault(name, []).append(amount)
    

    This creates a dict of lists like:

    {'Charlie': [214.05, 213.88], 'Ben': [153.57]}
    
  4. Finally using a comprehension we can sum() up all of the items with the same name.

    summed = [(sum(amounts), name) for name, amounts in summed.items()]
    

Results:

[(427.93, 'Charlie'), (153.57, 'Ben')]

python: return False if there is not an element in a list starting with 'p='

having a list like

lst = ['hello', 'stack', 'overflow', 'friends']

how can i do something like:

if there is not an element in lst starting with 'p=' return False else return True

?

i was thinking something like:

for i in lst:
   if i.startswith('p=')
       return True

but i can’t add the return False inside the loop or it goes out at the first element.

Solution:

This will make a list of whether or not each element of lst satisfies your condition, then computes the or of those results:

any([x.startswith("p=") for x in lst])

How to pass a list by reference?

I’m trying to implement a function ‘add’ that combines the list L1 with L2 into L3:

def add(L1,L2,L3):
    L3 = L1 + L2

L3 = []
add([1],[0],L3)
print L3

The code above produces an empty list as a result instead of [1,0] – This means that L3 wasn’t passed by reference.

How to pass L3 by reference?

Solution:

Lists are already passed by reference, in that all Python names are references, and list objects are mutable. Use slice assignment instead of normal assignment.

def add(L1, L2, L3):
    L3[:] = L1 + L2

However, this isn’t a good way to write a function. You should simply return the combined list.

def add(L1, L2):
    return L1 + L2

L3 = add(L1, L2)

Recursive and random grouping a list

I’m trying to write a function which creates dichotomously grouped list, f. ex. if my input is as following:

[a, b, c, d, e, f, g, h]

I want to choose random integer which will split it into smaller sublists again and again recursively until the sublists’ length is maximum two, like:

[[a, [b, c]], [d, [[e, f], [g, h]]]]

This is what I’ve got so far but still gives me
TypeError(list indices must be integers or slices, not list):

def split_list(l):
    if l.__len__() > 2:
        pivot = np.random.random_integers(0, l.__len__() - 1)
        print(pivot, l)
        l = [RandomTree.split_list(l[:pivot])][RandomTree.split_list(l[pivot:])]
    return l

I got stuck and I would be very thankful for any advice.

Solution:

Here’s a solution that will not make one-element lists, as per your example:

import random

def splitlist(l, minlen=2):
    if len(l) <= minlen:  # if the list is 2 or smaller,
        return l if len(l) > 1 else l[0]  # return the list, or its only element
    x = random.randint(1, len(l)-1)  # choose a random split
    return [splitlist(l[:x], minlen), splitlist(l[x:], minlen)]

Usage example:

>>> splitlist(list(range(8)))
[[0, [1, [2, [3, 4]]]], [[5, 6], 7]]

How to tell if a python module is intended to be python 2 or python 3?

Python 2 and Python 3 have subtle differences which mean that it is not possible to look at a python module and certainly know, just from automatic code analysis, if it will work identically on python 2 and python 3. (Right? That seems to be the answer to Is it possible to check if python sourcecode was written only for one version (python 2 or python 3) )

Therefore, I suppose there must be some convention by which a developer can annotate a file to explicitly indicate that it is intended to be compatible with Python 2, Python 3, or both, so that this annotation can be read by developers, checked automatically, etc..

What is this convention?

I don’t see different file extensions, like .py2 vs .py3. I don’t see any global variable declaration intended to act as metadata. But it seems like something must exist, beyond ad hoc comments in code and readme files. So what is it?

Solution:

Unfortunately, there is not really any official way to specify it. However, a common way that’s widely used is to specify the required Python version in the distribution’s metadata.

You may see a line in the setup.py file (or the setup.cfg file, for modern versions of setuptools) declaring the python_requires option using the PEP440 syntax. See also PEP 345 – Metadata for Python Software Packages, specifically the section about environment markers and Requires-Python metadata. Using these markers will prevent pip from downloading/installing a distributions with an incorrect Python interpreter version.

For older packages, it’s usually just mentioned in the docs or README file, or using trove classifiers. This is often listed on the PyPI and/or github landing page.

Pandas: drop_duplicates().fillna(0) not filling zeroes

I’m trying to drop duplicates from a series and then fill the NaNs with 0. Should be pretty straightforward, but when I chain the two functions together, I still get NaN where I would expect 0.0.

df = pd.DataFrame({'a':[1,1,2,3,3,4], 'b':[10,20,30,40,50,60]})
df['a'] = df['a'].drop_duplicates().fillna(0)

yields:

     a   b
0  1.0  10
1  NaN  20
2  2.0  30
3  3.0  40
4  NaN  50
5  4.0  60

Whereas:

df['a'] = df['a'].drop_duplicates()
df['a'] = df['a'].fillna(0)

works fine, but seems unnecessarily verbose.

I’m on pandas 0.20.3.

Solution:

There is problem after drop_duplicates some values are removed and aligning create NaNs values. So reindex help here:

df['a'] = df['a'].drop_duplicates().reindex(df.index, fill_value=0)
print (df)
   a   b
0  1  10
1  0  20
2  2  30
3  3  40
4  0  50
5  4  60

Or use duplicated with loc:

df.loc[df['a'].duplicated(), 'a'] = 0
print (df)
   a   b
0  1  10
1  0  20
2  2  30
3  3  40
4  0  50
5  4  60

Python: redundancy of `re.match(pattern)` as `re.search('^' + pattern)` does the same

While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.

>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>

Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?

Is it a good practice to only use re.search()?

Solution:

You should take a look at Python’s re.search() vs. re.match() document which clearly mentions about the other difference which is:

Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with ‘^’ will match at the beginning of each line.

>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match

>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match

<_sre.SRE_Match object; span=(4, 5), match='X'>

The first difference (for future readers) being:

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).

For example:

>>> re.match("c", "abcdef") # No match

>>> re.search("c", "abcdef") # Match

<_sre.SRE_Match object; span=(2, 3), match='c'>

Regular expressions beginning with ‘^’ can be used with search() to restrict the match at the beginning of the string:

>>> re.match("c", "abcdef") # No match

>>> re.search("^c", "abcdef") # No match

>>> re.search("^a", "abcdef") # Match

<_sre.SRE_Match object; span=(0, 1), match='a'>

How to update a column for list of values in Pandas Dataframe

I have dataframe and I want update a value of a column of a dataframe for specific set of data. How can this be done.I have around 20000 records to be updated.

Sample Input

Id  Registered
345     Y
678     N
987     N
435     N
2345    Y
123     N
679     N

I want to update the Registered column to Y when I give Set of Id numbers . How this can be done
I want to change the Registered column of 678,124,435 to Y . How this can this can done when for a large list.

Solution:

You can use a mask generated with isin to index df‘s Registered column and set values accordingly.

df.loc[df.Id.isin(ids), 'Registered'] = 'Y'
df

     Id Registered
0   345          Y
1   678          Y
2   987          N
3   435          Y
4  2345          Y
5   123          N
6   679          N

Is there a way to avoid typing the dataframe name, brackets, and quotes when creating a new column in a Python/Pandas dataframe?

Suppose I had a Python/Pandas dataframe called df1 with columns a and b, each with only one record (a = 1 and b = 2). I want to create a third column, c, whose value equals a + b or 3.

Using Pandas, I’d write:

df1['c'] = df1['a'] + df1['b'] 

I’d prefer just to write something simpler and easier to read, like the following:

with df1:
    c = a + b

SAS allows this simpler syntax in its “data step”. I would love it if Python/Pandas had something similar.

Thanks a lot!
Sean

Solution:

Short answer: no. pandas is constrained by Python’s syntax rules. The expression c = a + b requires a, b, and c to be names in the global namespace and it is not a good idea for a library to modify global namespace like that (what if you already have those names? What happens if there is a conflict?). That leaves out “no quotes” part.

With quotes, you have some options. For adding a new column, you can use eval:

df.eval('c = a + b')

The eval method basically evaluates the expression passed as a string. In this case, it adds a new column to a copy of the original DataFrame. Eval is quite limited though, see the docs for its usage and limitations.

For adding a new column, another option is assign. It is designed to add new columns on the fly but since it allows callables, you can also write things like:

very_long_data_frame_name.assign(new_column=lambda x: x['col1'] + x['col2'])

This is an alternative to the following:

very_long_data_frame_name['col1'] + very_long_data_frame_name['col2']

pandas also adds column names as attributes to the DataFrame if the column name is a valid Python identifier. That allows using the dot notation as juanpa.arrivillaga also mentioned:

df['c'] = df1.a + df2.a

Note that for non-existing columns you still have to use the brackets (see the left hand side of the assignment). If you already have a column named c, you can use df.c on the left side too.

Similar to eval, there is a query method for selection. It doesn’t add a new column but queries the DataFrame by parsing the string passed to it. The string, again, should be a valid Python expression.