Finding a float in a python list

I am surprised to see this python behavior, but couldn’t understand why? I am not able to search 0.3 in a python list.

>> import numpy as np
>> Lambdas = np.arange(0.0, 1.05, 0.05)
>> print(Lambdas)
[0.   0.05 0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45 0.5  0.55 0.6  0.65
 0.7  0.75 0.8  0.85 0.9  0.95 1.  ]
>> print(0.3 in Lambdas)
False
>> print(0.30 in Lambdas)
False
>> print(0.1 in Lambdas)
True
>> print(0.4 in Lambdas)
True
>> print(1 in Lambdas)
True
>> print(1.0 in Lambdas)
True
>> print(0.1 in Lambdas)
True
>>

Solution:

According to http://0.30000000000000004.com/

Your language isn’t broken, it’s doing floating point math. Computers can only natively store integers, so they need some way of
representing decimal numbers. This representation comes with some
degree of inaccuracy. That’s why, more often than not, .1 + .2 != .3.

Why does this happen? It’s actually pretty simple. When you have a
base 10 system (like ours), it can only express fractions that use a
prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2,
1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the
denominators all use prime factors of 10. In contrast, 1/3, 1/6, and
1/7 are all repeating decimals because their denominators use a prime
factor of 3 or 7. In binary (or base 2), the only prime factor is 2.
So you can only express fractions cleanly which only contain 2 as a
prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly
as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1
and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are
repeating decimals in the base 2 system the computer is operating in.
When you do math on these repeating decimals, you end up with
leftovers which carry over when you convert the computer’s base 2
(binary) number into a more human readable base 10 number.

In Python, should a regex pattern always start with ^ and end with $?

I have noticed that most regex patterns start with ^ and end with $. However, if these identifiers are not provided, the pattern still works like it is supposed to most of the time. So my question is, is this just a good practice to make sure that this is always the case?

The reason I ask is I am building a regex tester website using Django where users can store their regex objects, and then test them out. If this is always a good idea, I would write a function which takes in the user input pattern and make sure it starts with ^ and ends with $. Something like this:

 def standardize_pattern(self):
    pattern = self.pattern
    if len(self.pattern) > 0:
        if not self.pattern[0] == '^':
            pattern = '^' + pattern
        if not self.pattern[-1] == '$':
            pattern = pattern + '$'
    else:
        pattern = '^$'
    self.pattern = pattern

Any explanation is appreciated.

Solution:

No, start and end anchors may not be about a good practice. Sometimes they might be required, sometimes might be unnecessary, sometimes is just good to add for ensuring additional constraints to just have a more safe expression, especially if the expression is for validation purposes.

If we wish to capture or search things in string, usually we do not use them.

For your goal, it seems to be necessary.

Remove punctuations with regular expression

I tried the following but it

s = '白云区H(52)077楼盘'

''.join(re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))', s))

But I got 白云区H52)077楼盘 instead of 白云区H52077楼盘

What is the correct approach?

Thanks.

Solution:

In my understanding, you could do:

print(re.sub(u'[^\w\s]', '', s))

Which outputs:

白云区H52077楼盘

Retrieve definition for parenthesized abbreviation, based on letter count

I need to retrieve the definition of an acronym based on the number of letters enclosed in parentheses. For the data I’m dealing with, the number of letters in parentheses corresponds to the number of words to retrieve. I know this isn’t a reliable method for getting abbreviations, but in my case it will be. For example:

String = ‘Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).’

Desired output: family health history (FHH), nurse practitioner (NP)

I know how to extract parentheses from a string, but after that I am stuck. Any help is appreciated.

 import re

 a = 'Although family health history (FHH) is commonly accepted as an 
 important risk factor for common, chronic diseases, it is rarely considered 
 by a nurse practitioner (NP).'

 x2 = re.findall('(\(.*?\))', a)

 for x in x2:
    length = len(x)
    print(x, length) 

Solution:

Use the regex match to find the position of the start of the match. Then use python string indexing to get the substring leading up to the start of the match. Split the substring by words, and get the last n words. Where n is the length of the abbreviation.

import re
s = 'Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP).'


for match in re.finditer(r"\((.*?)\)", s):
    start_index = match.start()
    abbr = match.group(1)
    size = len(abbr)
    words = s[:start_index].split()[-size:]
    definition = " ".join(words)

    print(abbr, definition)

This prints:

FHH family health history
NP nurse practitioner

How to convert list of list into structured dict, Python3

I have a list of list, the content of which should be read and store in a structured dictionary.

my_list = [
    ['1', 'a1', 'b1'],
    ['',  'a2', 'b2'],
    ['',  'a3', 'b3'],
    ['2', 'c1', 'd1'],
    ['',  'c2', 'd2']]

The 1st, 2nd, 3rd columns in each row represents 'id', 'attr1', 'attr2'. If 'id' in a row is not empty, a new object starts with this 'id'. In the example above, there are two objects. The object with 'id' being '1' has 3 elements in both 'attr1' and 'attr2'; while the object with 'id' being '2' has 2 elements in both 'attr1' and 'attr2'. In my real application, there can be more objects, and each object can have an arbitrary number of elements.

For this particular example, the outcome should be

my_dict = {
    'id': ['1', '2'],
    'attr1': [['a1', 'a2', 'a3'], ['c1', 'c2']]
    'attr2': [['b1', 'b2', 'b3'], ['d1', 'd2']]

Could you please show me how to write a generic and efficient code to achieve it?

Thanks!

Solution:

Just build the appropriate dict in a loop with the right conditions:

d = {f: [] for f in ('id', 'attr1', 'attr2')}

for id, attr1, attr2 in my_list:
    if id:
        d['id'].append(id)
        d['attr1'].append([])
        d['attr2'].append([])
    d['attr1'][-1].append(attr1)
    d['attr2'][-1].append(attr2)

"or" boolean in an inline "if" statement

I have a program that begins by parsing several arguments, one of which is a “verbose” flag. However, I also have a “simulate” flag, which I would like to automatically flip the verbose flag to “True” if it is on.

Right now I have this working:

if args.verbose or simulate:  
    verbose = True

How can I get this onto one line? I was expecting to be able to do something like:

verbose = True if args.verbose or simulate

or like:

verbose = True if (args.verbose or simulate)

While searching here, I found a solution that fits on one line:

verbose = (False, True)[args.verbose or simulate]

However, I find that solution to be much less readable than the others that I was hoping would work. Is this possible, and I’m just missing something? Or is it not possible to use an “or” between two checks for “True” like this in one line?

Solution:

The problem isn’t with or, it’s that you need an else clause to specify what the value should be if the if statement fails. Otherwise, what is getting assigned if the condition is false?

verbose = True if args.verbose or simulate else False

There’s no need for the if at all, though. It’s even simpler if you just assign the result of the test to verbose directly:

verbose = args.verbose or simulate

How to return the most frequent letters in a string and order them based on their frequency count

I have this string: s = "china construction bank". I want to create a function that returns the 3 most frequent characters and order them by their frequency of appearance and the number of times they appear, but if 2 characters appears the same number of times, they should be ordered based on their alphabetical order. I also want to print each character in a separate line.

I have built this code by now:

from collections import Counter
def ordered_letters(s, n=3):
    ctr = Counter(c for c in s if c.isalpha())
    print ''.join(sorted(x[0] for x in ctr.most_common(n)))[0], '\n', ''.join(sorted(x[0] for x in ctr.most_common(n)))[1], '\n', ''.join(sorted(x[0] for x in ctr.most_common(n)))[2]`

This code applied to the above string will yield:

a 
c 
n

But this is not what i really want, what i would like as output is:

1st most frequent: 'n'. Appearances: 4
2nd most frequent: 'c'. Appearances: 3
3rd most frequent: 'a'. Appearances: 2

I’m stuck in the part where i have to print in alphabetical order the characters which have the same frequencies. How could i do this?

Thank you very much in advance

Solution:

You can use heapq.nlargest with a custom sort key. We use -ord(k) as a secondary sorter to sort by ascending letters. Using a heap queue is better than sorted as there’s no need to sort all items in your Counter object.

from collections import Counter
from heapq import nlargest

def ordered_letters(s, n=3):
    ctr = Counter(c.lower() for c in s if c.isalpha())

    def sort_key(x):
        return (x[1], -ord(x[0]))

    for idx, (letter, count) in enumerate(nlargest(n, ctr.items(), key=sort_key), 1):
        print('#', idx, 'Most frequent:', letter, '.', 'Appearances:', count)

ordered_letters("china construction bank")

# 1 Most frequent: n . Appearances: 4
# 2 Most frequent: c . Appearances: 3
# 3 Most frequent: a . Appearances: 2

Python / Get unique tokens from a file with a exception

I want to find the number of unique tokens in a file. For this purpose I wrote the below code:

splittedWords = open('output.txt', encoding='windows-1252').read().lower().split()
uniqueValues = set(splittedWords)

print(uniqueValues)

The output.txt file is like this:

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl 
club+Noun toplanti+Noun+A3pl+P3sg 
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj 

With this code I can get the unique tokens like Türkiye+Noun, Türkiye+Noun+Gen. But I want to get forexample Türkiye+Noun, Türkiye+Noun+Gen like only one token before the + sign. I only want Türkiye part. In the end Türkiye+Noun and Türkiye+Noun+Gen tokens needs to be same and only treated as a single unique token. I think I need to write regex for this purpose.

Solution:

It seems the word you want is always the 1st in a list of '+'-joined words:

Split the splitted words at + and take the 0th one:

text = """Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl 
club+Noun toplanti+Noun+A3pl+P3sg 
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj """

splittedWords = text.lower().replace("\n"," ").split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))

print(uniqueValues)

Output:

{'imha', 'çaba', 'ülke', 'arzula', 'terörizm', 'olus', 'daha', 'istikrar', 'küresel', 
 'sagla', 'önle', 'üzere', 'nisbi', 'türkiye', 'gelis', 'bir', 'karar', 'hedef', '2', 
 've', 'silah', 'kur', 'alan', 'club', 'boyut', '-', 'anlasma', 'iliski', 
 'izafi', 'kurumsal', 'karsi', 'ankara', 'ortaklik', 'obur', 'kitle', 'güven', 
 'uygula', 'ol', 'düzey', 'konsey', 'teknik', 'rejim', 'komite', 'gümrük', 'samimi', 
  'gel', 'yay', 'toplanti', '.', 'asama', 'mahiyet', 'ab', '69', 'için', 
 'paylas', '6', '/', 'nispi', 'dünya', 'at', 'sayili', 'görece', 'isbirlik', 'birlik', 
 ',', 'tüm', 'ile', 'düzen', 'uyar', 'göster', 'tehdit', 'madde'}

You might need to do some additional cleanup to remove things like

',' '6' '/'

Split and remove anything thats just numbers or punctuation

from string import digits, punctuation

remove=set(digits+punctuation)

splittedWords = text.lower().split()
uniqueValues = set( ( s.split("+")[0] for s in splittedWords))

# remove from set anything that only consists of numbers or punctuation
uniqueValues = uniqueValues - set ( x for x in uniqueValues if all(c in remove for c in x))
print(uniqueValues)

to get it as:

{'teknik', 'yay', 'göster','hedef', 'terörizm', 'ortaklik','ile', 'daha', 'ol', 'istikrar', 
 'paylas', 'nispi', 'üzere', 'sagla', 'tüm', 'önle', 'asama', 'uygula', 'güven', 'kur', 
 'türkiye', 'gel', 'dünya', 'gelis', 'sayili', 'ab', 'club', 'küresel', 'imha', 'çaba', 
 'olus', 'iliski', 'izafi', 'mahiyet', 've', 'düzey', 'anlasma', 'tehdit', 'bir', 'düzen', 
 'obur', 'samimi', 'boyut', 'ülke', 'arzula', 'rejim', 'gümrük', 'karar', 'at', 'karsi', 
 'nisbi', 'isbirlik', 'alan', 'toplanti', 'ankara', 'birlik', 'kurumsal', 'için', 'kitle', 
 'komite', 'silah', 'görece', 'uyar', 'madde', 'konsey'} 

Pandas: replace numpy.nan cell with maximum of non-nan adjacent cells

test case:

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                    [3, 4, np.nan, 1],
                    [np.nan, np.nan, np.nan, 5],
                    [np.nan, 3, np.nan, 4]],
                    columns=list('ABCD'))

where A[i + 1, j], A[i – 1, j], A[i, j + 1], A[i, j – 1] are the set of
entries adjacent to A[i,j].

In so many words, this:

     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

should become this:

     A    B   C  D
0  3.0  2.0 2.0  0.0
1  3.0  4.0 4.0  1.0
2  3.0  4.0 5.0  5.0
3  3.0  3.0 4.0  4.0

Solution:

You can use the rolling method over both directions and then find the max of each. Then you can use that to fill in the missing values of the original.

df1 = df.rolling(3, center=True, min_periods=1).max().fillna(-np.inf)
df2 = df.T.rolling(3, center=True, min_periods=1).max().T.fillna(-np.inf)
fill = df1.where(df1 > df2).fillna(df2)
df.fillna(fill)

Output

     A    B    C  D
0  3.0  2.0  2.0  0
1  3.0  4.0  4.0  1
2  3.0  4.0  5.0  5
3  3.0  3.0  4.0  4

Invalid syntax during reading of csv file in python

I am trying to read a file using csv.reader in python. I am new to Python and am using Python 2.7.15.

The example that I am trying to recreate is gotten from “Reading CSV Files With csv” section of this page. This is the code:

import csv

with open('employee_birthday.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{row[0]} works in the {row[1]} department, and was born in {row[2]}.')
            line_count += 1
    print(f'Processed {line_count} lines.')

During execution of the code, I am getting the following errors:

File "sidd_test2.py", line 11
  print(f'Column names are {", ".join(row)}')
                                         ^
SyntaxError: invalid syntax 

What am I doing wrong? How can I avoid this error. I will appreciate any help.

Solution:

Because f in front of strings (f-strings) are only for versions above python 3.5, so try this:

print('Column names are',", ".join(row))

Or:

print('Column names are %s'%", ".join(row))

Or:

print('Column names are {}'.format(", ".join(row)))