Why is the result of chr(0x24) + chr(0x84) different in python 2 and 3

I was using python to solve the protostar challenges from exploit-exercises. And I was surprised by the different output for this code with python 3.

payload = chr(0x24) + chr(0x84)
print (payload)

In terminal:

$ python exploit-stack3.py | xxd
00000000: 2484 0a                                  $..
$ python3 exploit-stack3.py | xxd
00000000: 24c2 840a                                $...

Could someone please explain where the c2 is comming from ?

Solution:

It’s coming from encoding the character as UTF-8.

>>> '\x84'.encode('utf-8')
b'\xc2\x84'

How to sum N columns in python?

I’ve a pandas df and I’d like to sum N of the columns. The df might look like this:

A B C D ... X

1 4 2 6     3
2 3 1 2     2 
3 1 1 2     4
4 2 3 5 ... 1

I’d like to get a df like this:

A Z

1 14
2 8
3 8
4 11

The A variable is not an index, but a variable.

Solution:

Use join for new Series created by sum all columns without A:

df = df[['A']].join(df.drop('A', 1).sum(axis=1).rename('Z'))

Or extract column A first by pop:

df = df.pop('A').to_frame().join(df.sum(axis=1).rename('Z'))

If want select columns by positions use iloc:

df = df.iloc[:, [0]].join(df.iloc[:, 1:].sum(axis=1).rename('Z'))

print (df)
   A   Z
0  1  15
1  2   8
2  3   8
3  4  11

Make code less complex and more readable

I need to rewrite my simple code. I’m getting simple strings as below:

  • Distrib ABC 1-2-x
  • Distrib ABC DEF 1-2-x
  • Distrib ABC DEF GHI 1-2-x

I’m getting to .split() all words after “Distrib ” and I’ve to fulfill following conditions:

  1. If string[0] is text && string[1] is as integer then join only these to get as result “ABC/1”

  2. If string[0] is text && string[1] is text join only them and then get as result “ABC/DEF”

  3. If string[0] is text && string[1] is text && string[2] is text join them all and get as result: “ABC/DEF/GHI”

I wrote a simple code to do this but I’m really interested how to write it less complex and more readable 😉

import re

def main_execute():
    #input_text = "Distrib ABC 1-2-x"
    #input_text = "Distrib ABC DEF 1-2-x"
    #input_text = "Distrib ABC DEF GHI 1-2-x"

    print(str(input_text))
    load_data = re.search('\s[A-Z]*.[A-Z]*.[A-Z]+ [0-9]', input_text).group()
    print("Pobrany ciąg znaków: " + load_data)

    words_array = load_data.split()

    if re.match('[0-9]', words_array[1]):
        print("Złożony ciąg: "
              + words_array[0]
              + "/"
              + words_array[1])
        elif re.match('[A-Z]', words_array[0]) and re.match('[A-Z]', words_array[1]) and re.match('[0-9]', words_array[2]):
        print("Złożony ciąg: "
              + words_array[0]
              + "/"
              + words_array[1])
    elif re.match('[A-Z]', words_array[0]) and re.match('[A-Z]', words_array[1]) and re.match('[A-Z]', words_array[2]) and re.match('[0-9]', words_array[3]):
        print("Złożony ciąg: "
              + words_array[0]
              + "/"
              + words_array[1]
              + "/"
              + words_array[2])


    if __name__ == "__main__":
        main_execute()

Solution:

This can be vastly simplified to

import re

data = """
Distrib ABC 1-2-x
Distrib ABC DEF 1-2-x
Distrib ABC DEF GHI 1-2-x
"""

rx = re.compile(r'Distrib (\w+) (\w+)\s*((?:(?!\d)\w)+)?')

results = ["/".join([n for n in m.groups() if n]) for m in rx.finditer(data)]
print(results)

Which yields

['ABC/1', 'ABC/DEF', 'ABC/DEF/GHI']

See a demo for the expression on regex101.com.


Another approach, as proposed by @Wiktor, could be

Distrib (\w+) (\w+)\s*([^\W\d]+)?

The part [^\W\d]+ is saying: not not (the doubling is no mistake!) word characters, not digits, as long as possible.

multiplying lists of lists with different lengths

I have:

a = [[1,2],[3,4],[7,10]]
b = [[8,6],[1,9],[2,1],[8,8]]

I want to multiply (pairwise) every element of a with b

1*8+2*6+1*1+2*9+.....+1*8+2*8+3*8+4*6+......+7*8+10*8

Here’s my code so far:

def f(a, b):
    new = [x for x in a or x in b]
    newer = []
    for tuple1, tuple2 in new:
        newer.append(map(lambda s,t: s*t, new, new))
    return sum(newer)

so my plan of attack was to get all the lists in one list and then multiply everything together. I have seen that lambda work for multiplying lists pairwise but I can’t get it to work for the one list.

Solution:

That kind of combination is called the Cartesian product. I’d use itertools.product for this, which can easily cope with more than 2 lists, if you want.

Firstly, here’s a short demo that shows how to get all of the pairs and how to use tuple assignment to grab the individual elements of the pairs of sublists.

from itertools import product

a = [[1,2],[3,4],[7,10]]
b = [[8,6],[1,9],[2,1],[8,8]]

for (u0, u1), (v0, v1) in product(a, b):
    print(u0, u1, v0, v1)

output

1 2 8 6
1 2 1 9
1 2 2 1
1 2 8 8
3 4 8 6
3 4 1 9
3 4 2 1
3 4 8 8
7 10 8 6
7 10 1 9
7 10 2 1
7 10 8 8

And here’s how to find the sum of products that you want.

total = sum(u0 * v0 + u1 * v1 for (u0, u1), (v0, v1) in product(a, b))
print(total)

output

593

Here’s an alternative approach, using the distributive property, as mentioned by Prune.

Firstly, here’s an unreadable list comprehension version. 😉

a = [[1,2],[3,4],[7,10]]
b = [[8,6],[1,9],[2,1],[8,8]]

print(sum([u*v for u,v in zip(*[[sum(t) for t in zip(*u)] for u in (a, b)])]))

output

593

How it works

By the distributive law, the sum you want from the given input data can be written as

(1 + 3 + 7) * (8 + 1 + 2 + 8) + (6 + 9 + 1 + 8) * (2 + 4 + 10)

We can re-arrange the data to produce that expression as follows.

# Use zip to transpose each of the a & b lists.
for u in (a, b):
    for t in zip(*u):
        print(t)

output

(1, 3, 7)
(2, 4, 10)
(8, 1, 2, 8)
(6, 9, 1, 8)

Now we modify that slightly to get the sums of those lists

# Use zip to transpose each of the a & b lists and compute the partial sums.
partial_sums = []
for u in (a, b):
    c = []
    for t in zip(*u):
        c.append(sum(t))
    partial_sums.append(c)
print(partial_sums)        

output

[[11, 16], [19, 24]]

Now we just need to multiply the corresponding items of those lists, and add those products together to get the final sum. Once again, we use zip to perform the transposition.

total = 0
for u, v in zip(*partial_sums):
    print(u, v)
    total += u * v
print(total)        

output

11 19
16 24
593

Pandas Datetimeindex with numpy.maximum gives error

I’m experiencing an error that is possibly a bug in pandas (v. 0.22 on Windows, Python version 3.6.3), or rather in its interaction with NumPy (v. 1.14), but I wonder if I’m missing something more profound.

Here’s the issue: if I have two Datetimeindex objects of the same length and I use np.maximum between them, the output is as expected:

import pandas as pd
import numpy as np
v1 = pd.DatetimeIndex(['2016-01-01', '2018-01-02', '2018-01-03'])
v2 = pd.DatetimeIndex(['2017-01-01', '2017-01-02', '2019-01-03'])
np.maximum(v1, v2)

returns the elementwise maximum:

DatetimeIndex([‘2017-01-01’, ‘2018-01-02’, ‘2019-01-03′], dtype=’datetime64[ns]’, freq=None)

However, if I try to only use one element of the two, I get an error:

np.maximum(v1, v2[0])

pandas_libs\tslib.pyx in pandas._libs.tslib._Timestamp.richcmp()

TypeError: Cannot compare type ‘Timestamp’ with type ‘int’

Two workarounds that work, but both are rather nasty to write, are either to use slicing or to explicitly convert to pydatetime:

np.maximum(v1, v2[:1])

DatetimeIndex([‘2017-01-01’, ‘2018-01-02’, ‘2018-01-03′], dtype=’datetime64[ns]’, freq=None)

or:

v1.to_pydatetime() - v2[0].to_pydatetime()

array([datetime.datetime(2017, 1, 1, 0, 0),
datetime.datetime(2018, 1, 2, 0, 0),
datetime.datetime(2018, 1, 3, 0, 0)], dtype=object)

The first workaround is actually quite weird, because doing v2 - v1[0] works correctly, while v2 - v1[:1] gives an error (rather as expected this time, since the two resulting time series have unaligned indices).

Solution:

One solution is to convert to a pd.Series, and then use pd.Series.clip:

pd.Series(v1).clip(v2[0])

# 0   2017-01-01
# 1   2018-01-02
# 2   2018-01-03
# dtype: datetime64[ns]

python pandas groupby to identify rows

I used to clean data using SAS but I would like to switch to Python.

I had a large dataset which was scrapped from some filings (html) but included some noisy information and I would like to get rid of these irrelevant data.

Basically, I need to remove certain rows of data after a row with condition being True (however, this might be a list, multiple True/or no True at all; and if there are Trues, I want to identify the last one).

Raw data:

<table>
  <tr>
    <td>Report_ID</td>
    <td>Table_ID</td>
    <td>Group_ID</td>
    <td>Item_ID</td>
    <td>Flag_old</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item1</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item2</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item3</td>
    <td>1</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item4</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item5</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item1</td>
    <td>1</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item2</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item3</td>
    <td>1</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item4</td>
    <td>0</td>
  </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item2</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item3</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item4</td>
    <td>0</td>
  </tr>
</table>

Expected data:

<table>
  <tr>
    <td>Report_ID</td>
    <td>Table_ID</td>
    <td>Group_ID</td>
    <td>Item_ID</td>
    <td>Flag_old</td>
    <td>Flag_new</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item1</td>
    <td>0</td>
    <td>0</td>    
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item3</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item4</td>
    <td>0</td>
    <td>1</td>
    </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item5</td>
    <td>0</td>
    <td>1</td>
    </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item1</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item3</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item4</td>
    <td>0</td>
    <td>1</td>
  </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item1</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item3</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item4</td>
    <td>0</td>
    <td>0</td>
  </tr>
</table>

As you can see from the above, I wanted to identify rows below the rows with condition Flag_old == 1.

Given the structure of the data, I have firstly use groupby to segment my whole dataframe and I was thinking to define a function to select rows and apply the function to the dataframe groupby object and then of course create a new column for the whole dataframe indicating these rows of noisy data.

def lastline(series):
    return max(series[series.values == 1].index)

df['lastline'] = df.groupby('id').apply(lastline(df['flag']))

but I got 'int' object is not callable error.

Could you please advice me how to do this properly? I have been struggling with this for few days now…Many thanks.

Solution:

I think you need custom function with transform for return new column:

def f(x):
    #get cumulative sum, shift
    a = x.cumsum().shift()
    #check max value of cumsumed a and chain condition for remove 0 only groups
    #convert Trues to 1 by astype
    return ((a == a.max()) & (a != 0)).astype(int)

df['Flag_new'] = df.groupby('Group_ID')['Flag_old'].transform(f)
print (df)
   Report_ID  Table_ID  Group_ID Item_ID  Flag_old  Flag_new
0          A         1         1   item1         0         0
1          A         1         1   item2         0         0
2          A         1         1   item3         1         0
3          A         1         1   item4         0         1
4          A         1         1   item5         0         1
5          A         1         2   item1         1         0
6          A         1         2   item2         0         0
7          A         1         2   item3         1         0
8          A         1         2   item4         0         1
9          A         1         3   item1         0         0
10         A         1         3   item2         0         0
11         A         1         3   item3         0         0
12         A         1         3   item4         0         0

Use multiple output stream in python?

What I am going to do is to create multiple output steams in a python function, and refer them as 1, 2, 3…..:
In test.py:

def main():
  ...
  print >>fd1, 'words1'
  print >>fd2, 'words2'
  print >>fd3, 'words3'
  ...

Redirect it while using:

python test.py 1>1.txt 2>2.txt 3>3.txt

The content of these files:

1.txt ->  words1
2.txt ->  words2
3.txt ->  words3

The question is, how to create those fd1, fd2, fd3?

Solution:

On Linux, the file handles that you want exist in /proc/self/fd/. For example:

with open('/proc/self/fd/1', 'w') as fd1, open('/proc/self/fd/2', 'w') as fd2, open('/proc/self/fd/3', 'w') as fd3:
   print >>fd1, 'words1'
   print >>fd2, 'words2'
   print >>fd3, 'words3'

On some other unices, you may find similar file handles under /dev/fd.

Now, you can run your command and verify that the output files are as desired:

$ python test.py 1>1.txt 2>2.txt 3>3.txt
$ cat 1.txt
words1
$ cat 2.txt
words2
$ cat 3.txt
words3

Limitations on number of open file descriptors

The OS places limits on the maximum number of open file descriptors that a process may have. For a discussion of this see “Limits on the number of file descriptors”.

When using bash’s numbered file descriptors, the restrictions are much tighter. Under bash, only file descriptors up to 9 are reserved for the user. The use of higher numbers may cause conflict with bash’s internal use. From man bash:

Redirections using file descriptors greater than 9 should be used with
care, as they may conflict with file descriptors the shell uses
internally.

If, as per the comments, you want to assign hundreds of file descriptors, then don’t use shell redirection or the numbered descriptors in /proc/self/fd. Instead, use python’s open command, e.g. open('255.txt', 'w') directly on each output file that you want.

Splitting strings based on multiple delimiters does not yield consistent result

I have a file type with many rows containing information as follows:

  P087 = ( 4.000000000000000E+001,-6.250000000000000E-001 )
  P088 = ( 4.000000000000000E+001, 0.000000000000000E+000 )

I’m reading this file line by line using

fo = open(FileName, 'r')
for line in fo:
    #do stuff to line

I’d like to see how to split each line to give lists as follows:

[87, 40.0,-0.625]
[88, 40.0, 0.0]

I tried splitting using python‘s regular .split() method but it doesn’t split the lines consistently, yielding varying list lengths for each line.

I also investigated re.split() using stuff like re.split([ = ( ]|,) but that didn’t work either. I’m also not a big regular expression user (though I know they are very powerful) which explains why I’m having a hard time finding the right one.

I guess I need to delimit the lines by ' = ( ' and ',' though I’m really not sure how to do it such that the resulting lists are consistent. Any help would be much appreciated.

Thanks

Solution:

This should do it:

for line in fo:
    parts = re.match(r'\s*P(\d+)\s*=\s*[(]\s*([^ ,]*)[ ,]+([^ ,]*)[ )]*',line).groups()
    print([int(parts[0]), float(parts[1]), float(parts[2])])

The re.match is used to extract the important parts, then each is parsed to the appropriate type to be printed.

Accessing a class instance in a library from two separate scripts in a project

I searched all over and could not come up with a reasonable search query to produce helpful results. I’ll try to explain this with a simple example (that is tested).

Suppose I have some small custom Python library that contains just the following private class and public instance of it:

#!/usr/bin/env python

class _MyClass(object):
    def __init__(self):
        self.val = "Default"

my_instance = _MyClass()

Now, I also have two other python files (‘file_a’ and ‘file_b’) that will end up importing this instance from my library as seen below.

The full code in ‘file_a’:

#!/usr/bin/env python

from my_lib import my_instance

my_instance.val = "File A was here!"
import file_b
file_b.check_val()

The full code in ‘file_b’:

#!/usr/bin/env python

from my_lib import my_instance

def check_val():
    print "From 'file_b', my_instance.val is: {}".format(my_instance.val)

The resulting output, if I only execute ‘file_a’ within a directory that also contains ‘file_b’ and ‘my_lib’, is this:

From 'file_b', my_instance.val is: File A was here!

Can someone explain to me how ‘file_b’ is able to access the same exact instance as ‘file_a’ in my example? Does this have to do with how the value being set in ‘file_a’ is global?

By the way, I do know I can just make ‘MyClass’ public again and instantiate it whenever a unique instance is needed in either ‘file_a’ or ‘file_b’, but the main reason I am posting this question is to wrap my head around this specific concept.

Solution:

There are two things you need to understand here:

1. Module caching

Python caches module imports to improve performance, this happens even when you do from foo import bar. The module object gets stored in sys.modules.

Hence, in your case both file_a and file_b are accessing same module object my_lib and same instance my_instance.

2. References

In Python variable assignment is basically adding a new reference to the same object, this is true for imports as well.

from my_lib import my_instance

is basically

import my_lib
my_instance = my_lib.my_instance
del my_lib

Now as we modify this instance in file_a, we have basically modified the instance in my_lib, and file_b will also see this change.


You can modify file_a and file_b to verify this.

file_a:

#!/usr/bin/env python

from my_lib import my_instance

my_instance.val = "File A was here!"

print "Inside file_a"
import sys
print id(sys.modules['my_lib']), sys.modules['my_lib'].my_instance, my_instance

import file_b
file_b.check_val()

file_b:

#!/usr/bin/env python

from my_lib import my_instance

print "Inside file_b"
import sys
print id(sys.modules['my_lib']), sys.modules['my_lib'].my_instance, my_instance


def check_val():
    print "From 'file_b', my_instance.val is: {}".format(my_instance.val)

Output(check the object IDs):

>>> %run file_a.py

Inside file_a
4396461816 <my_lib._MyClass object at 0x106158ad0> <my_lib._MyClass object at 0x106158ad0>
Inside file_b
4396461816 <my_lib._MyClass object at 0x106158ad0> <my_lib._MyClass object at 0x106158ad0>
From 'file_b', my_instance.val is: File A was here!

Changing multiple column names

Let’s say I have a data frame with such column names:

['a','b','c','d','e','f','g'] 

And I would like to change names from ‘c’ to ‘f’ (actually add string to the name of column), so the whole data frame column names would look like this:

['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']

Well, firstly I made a function that changes column names with the string i want:

df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)

But now I really want to understand how to implement something like this:

df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)

Solution:

One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.

Contiguous columns by position

d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)

Contiguous columns by name

If you need to calculate the numerical indices:

cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)

Specifically identified columns

If you want to provide the columns explicitly:

d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)