Weird behaviour of non-ASCII Python identifiers

I have learnt from PEP 3131 that non-ASCII identifiers were supported in Python, though it’s not considered best practice.

However, I get this strange behaviour, where my 𝜏 identifier (U+1D70F) seems to be automatically converted to τ (U+03C4).

class Base(object):
    def __init__(self):
        self.𝜏 = 5 # defined with U+1D70F

a = Base()
print(a.𝜏)     # 5             # (U+1D70F)
print(a.τ)     # 5 as well     # (U+03C4) ? another way to access it?
d = a.__dict__ # {'τ':  5}     # (U+03C4) ? seems converted
print(d['τ'])  # 5             # (U+03C4) ? consistent with the conversion
print(d['𝜏'])  # KeyError: '𝜏' # (U+1D70F) ?! unexpected!

Is that expected behaviour? Why does this silent conversion occur? Does it have anything to see with NFKC normalization? I thought this was only for canonically ordering Unicode character sequences

Solution:

Per the documentation on identifiers:

All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.

You can see that U+03C4 is the appropriate result using unicodedata:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', '𝜏')
'τ'

However, this conversion doesn’t apply to string literals, like the one you’re using as a dictionary key, hence it’s looking for the unconverted character in a dictionary that only contains the converted character.

self.𝜏 = 5  # implicitly converted to "self.τ = 5"
a.𝜏  # implicitly converted to "a.τ"
d['𝜏']  # not converted

You can see similar problems with e.g. string literals used with getattr:

>>> getattr(a, '𝜏')
Traceback (most recent call last):
  File "python", line 1, in <module>
AttributeError: 'Base' object has no attribute '𝜏'
>>> getattr(a, unicodedata.normalize('NFKD', '𝜏'))
5

Python Pandas – from data frame create an array or matrix for multiplication

I found this prior post and it gets me close.
how-to-convert-a-pandas-dataframe-subset-of-columns-and-rows-into-a-numpy-array

But instead of making a single array (or matrix) of two columns based on the value in a third, I need to iterate through the data frame and create a 3×3 array (or matrix) from columns ‘b’ through ‘j’ for each correctly matching value in ‘a’.

         dft = pd.DataFrame({'a' : ['NW'  ,'NW', 'SL', 'T'], 
'b' : [1,2,3,4], 
'c' : [5,6,7,8], 
'd' : [11,12,13,14], 
'e' : [9,10,11,12], 
'f' : [4,3,2,1], 
'g' : [15,14,13,12], 
'h' : [13,14,15,16], 
'i' : [5,4,3,2], 
'j' : [9,8,7,6]
})

    print(dft)
         a  b   c   d   e   f   g   h   i   j
     0  NW  1   5   11  9   4   15  13  5   9
     1  NW  2   6   12  10  3   14  14  4   8
     2  SL  3   7   13  11  2   13  15  3   7
     3  T   4   8   14  12  1   12  16  2   6

What I want is 2 separate arrays, 1 for each NW

     [[ 1  5 11]
      [ 9  4 15]
      [13  5  9]]

     [[ 2  6 12]
      [10  3 14]
      [14  4  8]]

I have tried the following and received a really ugly error. The code is an attempt based on the original post.

    dft.loc[dft['a'] == 'NW',['b', 'c', 'd'], ['e', 'f', 'g'], ['h', 'i', 'j']].values

Here is the error –

IndexingError Traceback (most recent call
last) in ()
—-> 1 dft.loc[dft[‘a’] == ‘NW’,[‘b’, ‘c’, ‘d’], [‘e’, ‘f’, ‘g’], [‘h’, ‘i’, ‘j’]].values

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in
getitem(self, key) 1323 except (KeyError, IndexError): 1324 pass
-> 1325 return self._getitem_tuple(key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj)

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in
_getitem_tuple(self, tup)
839
840 # no multi-index, so validate all of the indexers
–> 841 self._has_valid_tuple(tup)
842
843 # ugly hack for GH #836

D:\Applications\Anaconda\lib\site-packages\pandas\core\indexing.py in
_has_valid_tuple(self, key)
186 for i, k in enumerate(key):
187 if i >= self.obj.ndim:
–> 188 raise IndexingError(‘Too many indexers’)
189 if not self._has_valid_type(k, i):
190 raise ValueError(“Location based indexing can only have [%s] “

IndexingError: Too many indexer

Thoughts? I am so close, yet tantalizing far.

  • And I have no clue how to format the error code- so any help on that to clear it up?

Solution:

You can do this without loop

a = df.loc[df['a'] == 'NW', 'b':'j']
n = a.shape[0]
new_a = a.values.reshape(n,3,3)

You get

array([[[ 1,  5, 11],
        [ 9,  4, 15],
        [13,  5,  9]],

       [[ 2,  6, 12],
        [10,  3, 14],
        [14,  4,  8]]])

Python + Selenium WebDriver, click on submit button

I am working on a web scraper where I need to click the submit button to trigger a JavaScript script to load a page. I am not able to identify the right tag and perform the click() function. I am thinking this has to do something with the aria-hidden="true" tag. Can you please let me know how we can achieve this?

Here is the page_source section for the button:

Search  

Here is the code that I have:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.PhantomJS(executable_path='C:/Users/50178/Documents/learn/pracpy/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs.exe')
#driver = webdriver.Chrome(executable_path='C:/Users/50178/Documents/learn/pracpy/ChromeDriver/chromedriver.exe')
driver.get("http://www16.co.hennepin.mn.us/cfrs/search.do")
driver.implicitly_wait(5)
driver.find_element_by_css_selector("input[type='radio']").click()
print(driver.page_source)
print (driver.find_element_by_xpath('.//div[@class="btn btn-primary"]'))
driver.close()

Solution:

"btn btn-primary" is a class name of button, but not div. Try

//button[@class="btn btn-primary"]

or

//button[.="Search"]

Transform pandas dataframe to another layout

I have a dataframe that looks like this:

  column1  column2  column3
0       A    0.020     0.76
1       B    0.045     1.30
2       C    0.230     0.32
3       D    0.130     0.67

I would like to modify this dataframe structure to make it look like this:

column1  newCol 
A        column2    0.020
         column3    0.760
B        column2    0.045
         column3    1.300
C        column2    0.230
         column3    0.320
D        column2    0.130
         column3    0.670
Name: value, dtype: float64

Where column1, column2, column3, newCol are the names for the columns
A, B, C, D are unique values for rows

My problem is that I don’t know how to convert column1 and column2 from columns to rows in the new dataframe.

Solution:

Use melt + set_index + sort_index

df.melt('column1', var_name='newCol')\
  .set_index(['column1', 'newCol'])\
  .sort_index().value

column1  newCol 
A        column2    0.020
         column3    0.760
B        column2    0.045
         column3    1.300
C        column2    0.230
         column3    0.320
D        column2    0.130
         column3    0.670
Name: value, dtype: float64

Works with v0.20 and above. For older versions, use pd.melt instead.

Python: Counting occurrence of List element within List

I’m trying to count occurrence of elements within a list, if such elements are also lists. The order is also important

[PSUEDOCODE]

list = [ ['a', 'b', 'c'], ['d', 'e', 'f'], ['a', 'b', 'c'], ['c', 'b', 'a'] ]
print( count(list) )


> { ['a', 'b', 'c'] : 2, ['d', 'e', 'f']: 1, ['c', 'b', 'a']: 1 }

One important factor is that ['a', 'b', 'c'] != ['c', 'b', 'a']

I have tried:

from collections import counter
print( Counter([tuple(x) for x in list]) )
print( [[x, list.count(x)] for x in set(list)] )

Which both resulted in ['a', 'b', 'c'] = ['c', 'b', 'a'], one thing i didn’t want

I also tried:

from collections import counter
print( Counter( list ) )

Which only resulted in error; since lists can’t be used as keys in dicts.

Is there a way to do this?

Solution:

You can’t have list as a key to the dict because dictionaries only allows immutable objects as it’s key. Hence you need to firstly convert your objects to tuple. Then you may use collection.Counter to get the count of each tuple as:

>>> from collections import Counter
>>> my_list = [ ['a', 'b', 'c'], ['d', 'e', 'f'], ['a', 'b', 'c'], ['c', 'b', 'a'] ]

#            v to type-cast each sub-list to tuple
>>> Counter(tuple(item) for item in my_list)
Counter({('a', 'b', 'c'): 2, ('d', 'e', 'f'): 1, ('c', 'b', 'a'): 1})

Why do I get an Attribute Error when using pandas apply?

How should I convert NaN value into categorical value based on condition. I am getting error while trying to convert Nan value.

category           gender     sub-category    title

health&beauty      NaN         makeup         lipbalm

health&beauty      women       makeup         lipstick

NaN                NaN         NaN            lipgloss

My DataFrame looks like this. And my function to convert NaN values in gender to categorical value looks like

def impute_gender(cols):
    category=cols[0]
    sub_category=cols[2]
    gender=cols[1]
    title=cols[3]
    if title.str.contains('Lip') and gender.isnull==True:
        return 'women'
df[['category','gender','sub_category','title']].apply(impute_gender,axis=1)

If I run the code I am getting error

----> 7     if title.str.contains('Lip') and gender.isnull()==True:
      8         print(gender)
      9 

AttributeError: ("'str' object has no attribute 'str'", 'occurred at index category')

Complete Dataset –https://github.com/lakshmipriya04/py-sample

Solution:

Some things to note here –

  1. If you’re using only two columns, calling apply over 4 columns is wasteful
  2. Calling apply is wasteful in general, because it is slow and offers no vectorisation benefits to you
  3. In apply, you’re dealing with scalars, so you do not use the .str accessor as you would a pd.Series object. title.contains would be enough. Or more pythonically, "lip" in title.
  4. gender.isnull is completely wrong, gender is a scalar, it has no isnull attribute

Option 1
np.where

m = df.gender.isnull() & df.title.str.contains('lip')
df['gender'] = np.where(m, 'women', df.gender)

df
        category gender sub-category     title
0  health&beauty  women       makeup   lipbalm
1  health&beauty  women       makeup  lipstick
2            NaN  women          NaN  lipgloss

Which is not only fast, but simpler as well. If you’re worried about case sensitivity, you can make your contains check case insensitive –

m = df.gender.isnull() & df.title.str.contains('lip', flags=re.IGNORECASE)

Option 2
Another alternative is using pd.Series.mask/pd.Series.where

df['gender'] = df.gender.mask(m, 'women')

Or,

df['gender'] = df.gender.where(~m, 'women')

df
        category gender sub-category     title
0  health&beauty  women       makeup   lipbalm
1  health&beauty  women       makeup  lipstick
2            NaN  women          NaN  lipgloss

The mask implicitly applies the new value to the column based on the mask provided.

Split a string and save them to a list with Python

I have a string that I inserted a space into it in all different positions and saved them to a list. Now this list of strings with space in them, I want to split those strings and put the output in one list, when am doing this, it happens that am having multiple list inside:

This is the code am working on:

var ='sans'
res = [var[:i]+' '+var[i:] for i in range(len(var))]
// The previous line: AM adding a space to see maybe that would generate other words
cor = [res[i].split() for i in range (len(res))]

And this is the output am getting:

>>> cor
[['sans'], ['s', 'ans'], ['sa', 'ns'], ['san', 's']]

What am expecting:

>>> cor
    ['sans', 's', 'ans', 'sa', 'ns', 'san', 's']

Am new to python, I don’t know what am missing.

Thanks

Solution:

An alternative approach:

cor = " ".join(res).split()

Output:

['sans', 's', 'ans', 'sa', 'ns', 'san', 's']

Explanation

" ".join(res) will join the individual strings in res with a space in between them. Then calling .split() will split this string on whitespace back into a list.

EDIT: A second approach that doesn’t involve the intermediate variable res, although this one isn’t quite as easy on the eyes:

cor = [var[:i/2+1] if i%2==1 else var[i/2:] for i in range(2*len(var)-1)]

Basically you flip between building substrings from the front and the back.

Python – How to check if socket is still connected

I have the following code, which is self explanatory:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(host, port)
s.send("some data")
# don't close socket just yet... 
# do some other stuff with the data (normal string operations)
if s.stillconnected() is true:
    s.send("some more data")
if s.stillconnected() is false:
    # recreate the socket and reconnect
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(host, port)
    s.send("some more data")
s.close()

How do I implement s.stillconnected()
I do not wish to recreate the socket blindly.

Solution:

If the server connection is no longer alive, calling the send method will throw an exception, so you can use a try-exception block to attempt to send data, catch the exception if it’s thrown, and reestablish the connection:

try:
    s.send("some more data")
except:
    # recreate the socket and reconnect
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(host, port)
    s.send("some more data")

EDIT: As per @Jean-Paul Calderone’s comments, please consider using the sendall method, which is a higher level method that sends all the data or throws an error, instead of send, which is a lower level method that does not guarantee the transmission of all the data, OR use higher level modules like an HTTP library that can handle socket lifecycles.

Python __dict__

The attribute __dict__ is supposed to contain user defined attributes. But if we print the __dict__ of an empty class, I would also get:

__module__
__dict__ 
__weakref__
__doc__

Which are prepopulated by Python in the __dict__ attribute accordingly to the class object type.

Now, __base__ and __class__ are also Python defined attributes of a class object type, but not included in __dict__.

Is there any rule that specifies which dunder attribute is included in an object __dict__ and which are not?

Solution:

The attribute __dict__ is supposed to contain user defined attributes.

No, the __dict__ contains the dynamic attributes of an object. Those are not the only attributes an object can have however, the type of the object is usually also consulted to find attributes.

For example, the methods on a class can be found as attributes on an instance too. Many such attributes are descriptor objects and are bound to the object when looked up. This is the job of the __getattribute__ method all classes inherit from object; attributes on an object are resolved via type(object).__getattribute__(attribute_name), at which point the descriptors on the type as well as attributes directly set on the object (in the __dict__ mapping) are considered.

The __bases__ attribute of a class is provided by the class metatype, which is type() by default; it is a descriptor:

>>> class Foo:
...     pass
...
>>> Foo.__bases__
(<class 'object'>,)
>>> type.__dict__['__bases__']
<attribute '__bases__' of 'type' objects>
>>> type.__dict__['__bases__'].__get__(Foo, type)
(<class 'object'>,)

__dict__ just happens to be a place to store attributes that can have any valid string name. For classes that includes several standard attributes set when the class is created (__module__ and __doc__), and others that are there as descriptors for instances of a class (__dict__ and __weakref__). The latter must be added to the class, because a class itself also has those attributes, taken from type, again as descriptors.

So why is __bases__ a descriptor, but __doc__ is not? The latter is writeable, while __bases__ is not. The Python core developers use descriptors to restrict what can be set, or when setting a value requires additional work (like clearing caches, etc.).

Swapping two characters in a string and store the generated strings in a list in Python

I want to swap every two characters in a string and store the output in a list (to check every string later wether it exists in the dictionary)

I have seen some codes that swap the characters all at once, but that is not what I’am looking for.

For example:

var = 'abcde'

Expected output:

['bacde','acbde','abdce','abced']

How can I do this in Python?

Solution:

You may use a below list comprehension expression to achieve this:

>>> var = 'abcde'

#                         v To reverse the substring
>>> [var[:i]+var[i:i+2][::-1]+var[i+2:] for i in range(len(var)-1)]
['bacde', 'acbde', 'abdce', 'abced']