Select only regex match from a continuous string

I want to use this regex

r"Summe\d+\W\d+"

to match this string

150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung

but I want to only filter out this specific part

Summe50,90

I can select the entire string with this regex but I’m not sure how to filter out only the matching part

here is the function it is in where i am trying to get the amount from a pdf:

    def get_amount(url):
      data = requests.get(url)
      with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
        f.write(data.content)

      pdfFileObj = open('exmpl.pdf', 'rb')
      pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

      pageObj = pdfReader.getPage(0)
      text = pageObj.extractText().split()

      regex = re.compile(r"Summe\d+\W\d+")

      matches = list(filter(regex.search, text))
      for i in range(len(matches)):
        matchString = '\n'.join(matches)


      print(matchString)

as described above, I would like guidance on how I can best filter out a part of this string so that it returns just the matching portion. preferably with varying lengths of characters on either side but that’s not a priority.

thanks!!

Solution:

This is what you want, your regex is correct but you must get the match after searching for it.

  regex = re.compile(r"Summe\d+\W\d+")
  text = ["150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"]

  matches = []
  for t in text:
    m = regex.search(t)
    if m:
      matches.append(m.group(0))

  print(matches)

re.search returns a Match object on success, None on failure, and that object contains all the information about your matching regex. To get the whole match you call Match.group().

Excel can't read xlsxwriter created formulas SUMIF

I’m trying to create an excel containing some formulas using the code below

import xlsxwriter

workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()

worksheet.write('A1', 1)
worksheet.write('A2', 2)
worksheet.write('A3', 3)
worksheet.write('B1', '+')
worksheet.write('B2', '-')
worksheet.write('B3', '+')
formula = '=((SUMIF(B{f}:B{la},"+",{cl}{f}:{cl}{la})-'\
         + 'SUMIF(B{f}:B{la},"-",{cl}{f}:{cl}{la}))'\
         + '/MAX(SUMIF(B{f}:B{la},"/",{cl}{f}:{cl}{la}),1)'
print(formula.format(f=1, la=3, cl='A'))
# =((SUMIF(B1:B3,"+",A1:A3)-SUMIF(B1:B3,"-",A1:A3))/MAX(SUMIF(B1:B3,"/",A1:A3),1)

worksheet.write_formula('B5', formula.format(f=1, la=3, cl='A'))
workbook.close()

When opening this file using Microsoft excel we get the following:

Excel error

the cell containing 0 is the one with the formula.

When I use Libreoffice to open the same file I get the correct value.

Solution:

You used two parenthesis, there should be only one in the start of your formula. it should be:

formula = '=(SUMIF(B{f}:B{la},"+",{cl}{f}:{cl}{la})-'\
         + 'SUMIF(B{f}:B{la},"-",{cl}{f}:{cl}{la}))'\
         + '/MAX(SUMIF(B{f}:B{la},"/",{cl}{f}:{cl}{la}),1)'

Remove string element in a list of strings if the first characters match with another string element in the list

I want to lookup and compare efficiently the string elements in a list and then remove those which are parts of other string elements in the list (with the same beginning point)

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

I intend to get a list which looks like this:

list2 = [  'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

In other words, I want to keep the longest string element from those elements which start with the same first characters.

Solution:

As suggested by John Coleman in comments, you can first sort the sentences and then compare consecutive sentences. If one sentences is a prefix of another, it will appear right before that sentences in the sorted list, so we just have to compare consecutive sentences. To preserve the original order, you can use a set for quickly looking up the filtered elements.

list1 = ['a boy ran', 'green apples are worse', 
         'a boy ran towards the mill', ' this is another sentence ',
         'a boy ran towards the mill and fell']                                                                

srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
    if b.startswith(a):
        filtered.remove(a)

list2 = [x for x in list1 if x in filtered]                                     

Afterwards, list2 is the following:

['green apples are worse',
 ' this is another sentence ',
 'a boy ran towards the mill and fell']

With O(nlogn) this is considerably faster than comparing all pairs of sentences in O(n²), but if the list is not too long, the much simpler solution by Vicrobot will work just as well.

Pandas- pivoting column into (conditional) aggregated string

Lets say I have the following data set, turned into a dataframe:

data = [
    ['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
    ['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])

This yields a dataframe that looks like:

     Job        Date Employee Manager
0  Job 1  2019-06-09      Jim     Tom
1  Job 1  2019-06-09     Bill     Tom
2  Job 1  2019-06-09      Tom     Tom
3  Job 1  2019-06-10     Bill    None
4  Job 2  2019-06-10      Tom     Tom

What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:

  1. All employee names are unique (I’ll actually be using unique employee ids rather than names), and Managers are also “employees”, so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
  2. A work crew can have a manager, or not (see row with id 3, for an example without)
  3. A manager will always also be listed as an employee (see row with id 2 or 4)
  4. A job could have a manager, with no additional employees (see row id 4)

I’d like the resulting dataframe to look like:

     Job        Date  Manager     Employees
0  Job 1  2019-06-09      Tom     Jim, Bill
1  Job 1  2019-06-10     None          Bill
2  Job 2  2019-06-10      Tom          None

Which leads to my questions:

  1. Is there a way to do a ‘,’.join like aggregation in a pandas pivot?
  2. Is there a way to make this aggregation conditional (exclude the name/id in the manager column)

I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.

Solution:

Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.

s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})

                 Manager   Employee
Job   Date                         
Job 1 2019-06-09     Tom  Jim, Bill
      2019-06-10    None       Bill
Job 2 2019-06-10     Tom       None

Write multiple Excel files for each value of certain column Python Pandas

Consider the following dataframe:

data = {'Col_A': [3, 2, 1, 0], 'Col_B': ['a', 'b', 'a', 'b']}
df = pd.DataFrame.from_dict(data)

I can create a dataframa a and write it to Excel as follows:

a = df[df.Col_B == 'a']
a
a.to_excel(excel_writer = 'F:\Desktop\output.xlsx', index = False)

I am looking for a way to write an Excel file, for each value of column B. In reality, there are hundreds of values for Col_B. Is there a way to loop through this?

Any help is greatly appreciated!

Regards,

M.

Solution:

If need for each group separate excel file loop by groupby object:

for i, a in df.groupby('Col_B'):
    a.to_excel(f'F:\Desktop\output_{i}.xlsx', index = False)

If need for each group separate sheetname in one exel file use ExcelWriter:

with pd.ExcelWriter('output.xlsx') as writer:
    for i, a in df.groupby('Col_B'):
        a.to_excel(writer, sheet_name=i, index = False)

Extra newline in output when adding strings

I have a text file something like,

3 forwhomthebelltolls
-6 verycomplexnumber

The question is, if the integer K is positive, take the first K characters and put them at the end of the string. If it’s negative, take the last K characters and put them at the front of the string. Like this:

whomthebelltollsfor
numberverycomplex

My code is

file = open("tex.txt","r")

for line in file:
    K, sentence = int(line.split(" ")[0]), line.split(" ")[1]
    new_sentence = sentence[K:] + sentence[:K]
    print(new_sentence)

however it prints the values like:

whomthebelltolls
for
numberverycomplex

I do not understand why this happens. I am just adding two strings however in printing part the second part that I add goes down.

Solution:

The actual contents of the file:

3 forwhomthebelltolls\n
-6 verycomplexnumber

When you iterate through the file with for line in file, you’re taking the entire line, including the newline character at the end. You’re never removing that. And so, sentence[K:] resolves to whomthebelltolls\n, and sentence[:k] resolves to for. The entire string is whomthebelltolls\nfor. Since there’s a newline in the middle of the string, that gets printed.

To fix this, strip the string first:

for line in file:
    K, sentence = int(line.split(" ")[0]), line.split(" ")[1].strip()
    ...

How to permute values of pandas column?

I have the following DataFrame df:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        5.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   0        5.0

The sum of Supply is 40. I need to permute Suppy 10 in order to assign them to higher values of Price, while Supply 0 should occur at lower values of Price.

This is the expected result:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   10       5.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   10       5.0

Any clues how to do it?

Solution:

argsort

  • Multiply by negative one as a convenient way to switch the sort
  • Use argsort to track the positions of where to drop my values
  • Create b to house my permuted values
  • Populate b with a sorted version of Supply
  • Assign back to df

a = df.Price.mul(-1).to_numpy().argsort()
b = np.empty_like(df.Supply)

b[a] = df.Supply.sort_values(ascending=False)

df.loc[:, 'Supply'] = b

df

           Datetime  Supply  Price
0  2019-02-01 12:00      10    2.0
1  2019-02-01 12:00       0    1.0
2  2019-02-01 12:00      10    5.0
3  2019-02-01 12:00       0    1.0
4  2019-02-01 12:00       0    2.0
5  2019-02-01 12:00      10    4.0
6  2019-02-01 12:00      10    5.0

There is room to optimize this code but the general idea is there.

Extra space added on reverse string function

I’m trying to figure out how to reverse a string using Python without using the [::-1] solution.

My code seems to work fine for several test cases, but it adds an extra space for one instance and I can’t figure out why.

def reverse(s):
    r = list(s)
    start, end = 0, len(s) - 1
    x = end//2
    for i in range(x):
        r[start], r[end] = r[end], r[start]
        start += 1
        end -= 1

    print(''.join(r))


reverse('A man, a plan, a canal: Panama')

# returns 'amanaP :lanac  a,nalp a ,nam A'
# note the double space ^^ - don't know why


reverse('a monkey named fred, had a banana')

# 'returns ananab a dah ,derf deman yeknom a'


reverse('Able was I ere I saw Elba')

# returns 'ablE was I ere I saw elbA'

Solution:

Change

    x = end//2

to

    x = len(s)//2

Python how can I manually end an infinite while loop that's collecting data, without ending the code and not using KeyboardInterrupt?

In my code I have a “while True:” loop that needs to run for a varying amount of time while collecting live data (3-5 hours). Since the time is not predetermined I need to manually end the while loop without terminating the script, so that it may continue to the next body of code in the script.

I do not want to use “input()” at the end of the loop, because then I have to manually tell it to continue looping every time it finishes the loop, I am collecting live data down to the half second, so this is not practical.

Also I do not want to use keyboard interrupt, have had issues with it. Are there any other solutions? All I have seen is try/except with “keyboardinterrupt”

def datacollect()
def datacypher()

while True:
    #Insert code that collects data here
    datacollect()

#end the while loop and continue on
#this is where i need help

datacypher()
print('Yay it worked, thanks for the help')

I expect to end the loop manually and then continue onto the code that acts upon the collected data.

If you need more details or have problem with my wording, let me know. I have only asked one question before. I am learning.

Solution:

How about adding a key listener in a second thread? After you press Enter, you’ll manually move the script to the next stage by means of a shared bool. The second thread shouldn’t slow down the process since it blocks on input().

from threading import Thread
from time import sleep

done = False

def listen_for_enter_key_press():
    global done
    input()
    done = True

listener = Thread(target=listen_for_enter_key_press)
listener.start()

while not done:
    print('working..')
    sleep(1)

listener.join()

print('Yay it worked, thanks for the help')

How to use random to choose colors

I’m trying to make a little program to learn more, but am stuck when it comes to using random.

Here is an example of what im going off of https://trinket.io/python/3338c95430

I’ve tried using random.randrange, random.choice, random.random with everything for them and it sends an error code saying random doesnt have a function of randrange, choice, or random.

import turtle, math, random, time

wn = turtle.Screen()
wn.bgcolor('grey')
Rocket = turtle.Turtle()
Rocket.speed(0)
Rocket.color('red') ## this is what i want to randomize
rotate=int(90)

def drawCircles(t,size):
    for i in range(15):
        t.circle(size)
        size=size-10
def drawSpecial(t,size,repeat):
    for i in range(repeat):
        drawCircles(t,size)
        t.right(360/repeat)
drawSpecial(Rocket,100,10)

Eventually i would like to implement more randomized processes like the size and placement but for now im just focusing on color.

Solution:

Without using additional imports it’s fairly simple:

turtle.colormode(255) # sets the color mode to RGB

R = random.randrange(0, 256, 100) # last value optional (step) 
B = random.randrange(0, 256)
G = random.randrange(0, 256)

# using step allows more control if needed
# for example the value of `100` would return `0`, `100` or `200` only

Rocket.color(R, G, B) ## randomized from values above

Using randomized values of (200,255,23):

enter image description here

EDIT: Regarding “would i just change the turtle.colormode()
to Rocket.colormode() for the next one?”

The way I would recommend doing it would be to create a function:

def drawColor():

    turtle.colormode(255)

    R = random.randrange(0, 256)
    B = random.randrange(0, 256)
    G = random.randrange(0, 256)

    return R, G, B

Rocket.color(drawColor())

This way you can call drawColor() anytime you want a new color.

Now that you have the basics of randomizing colors for your drawing you can get quite creative with the values for some awesome looking results (adjust integers to your liking):

enter image description here

#!/usr/bin/python

import turtle, math, random, time

def drawColor(a, b, o):
    turtle.colormode(255)
    R = random.randrange(a, b, o) # last value is step (optional)
    B = random.randrange(a, b, o)
    G = random.randrange(a, b, o)
    # print(R, G, B)
    return R, G, B

def drawRocket(offset):
    Rocket = turtle.Turtle()
    Rocket.speed(0)
    Rocket.color(drawColor(20, 100, 1)) ## this is what i want to randomize
    rotate=int(random.randrange(90))
    drawSpecial(Rocket,random.randrange(0, 10), offset)

def drawCircles(t,size):
    for i in range(30):
        t.circle(size)
        size = size-20

def drawSpecial(t,size,repeat):
    for i in range(repeat):
        drawCircles(t,size)
        t.right(360/repeat)

def drawMain(x, y):
    wn = turtle.Screen()
    wn.bgcolor(drawColor(0, 20, 2))

    for i in range(3):
        drawRocket(x)
        x+=y
        # print(x)

drawMain(2, 10)
input("Press ENTER to exit")