NumPy: why does np.linalg.eig and np.linalg.svd give different V values of SVD?

I am learning SVD by following this MIT course.

the Matrix is constructed as

C = np.matrix([[5,5],[-1,7]])
C
matrix([[ 5,  5],
        [-1,  7]])

the lecturer gives the V as

enter image description here

this is close to

w, v = np.linalg.eig(C.T*C)
matrix([[-0.9486833 , -0.31622777],
        [ 0.31622777, -0.9486833 ]])

but np.linalg.svd(C) gives a different output

u, s, vh = np.linalg.svd(C)
vh
matrix([[ 0.31622777,  0.9486833 ],
        [ 0.9486833 , -0.31622777]])

it seems the vh exchange the elements in the V vector, is it acceptable?

did I do and understand this correctly?

Solution:

For linalg.eig your Eigenvalues are stored in w. These are:

>>> w
array([20., 80.])

For your singular value decomposition you can get your Eigenvalues by squaring your singular values (C has maximum rank so everything is easy here):

>>> s**2
array([80., 20.])

As you can see their order is flipped.

From the linalg.eig documentation:

The eigenvalues are not necessarily ordered

From the linalg.svd documentation:

Vector(s) with the singular values, within each vector sorted in descending order. …

In general routines that give you Eigenvalues and Eigenvectors do not “sort” them necessarily the way you might want them. So it is always important to make sure you have the Eigenvector for the Eigenvalue you want. If you need them sorted (e.g. by Eigenvalue magnitude) you can always do this yourself (see here: sort eigenvalues and associated eigenvectors after using numpy.linalg.eig in python).

Finally note that the rows in vh contain the Eigenvectors, whereas in v it’s the columns.

So that means that e.g.:

>>> v[:,0].flatten()
matrix([[-0.9486833 ,  0.31622777]])
>>> vh[:,1].flatten()
matrix([[ 0.9486833 , -0.31622777]])

give you both the Eigenvector for the Eigenvalue 20.

json.loads() returns a string

Why is json.loads() returning a string? Here’s is my code:

import json

d = """{
    "reference": "123432",
    "business_date": "2019-06-18",
    "final_price": 40,
    "products": [
        {
            "quantity": 4,
            "original_price": 10,
            "final_price": 40,
        }
    ]
}"""

j = json.loads(json.dumps(d))
print(type(j))

Output:

<class 'str'>

Shouldn’t it returning a json object? What change is required here?

Solution:

Two points:

  1. You have a typo in your products key : "final_price": 40, should be "final_price": 40 (without comma)
  2. j should be json.loads(d)

Output

dict

EDIT

Reasons why you can not have a trailing comma in a json objects are explained in this post Can you use a trailing comma in a JSON object?

Unfortunately the JSON specification does not allow a trailing comma. There are a few browsers that will allow it, but generally you need to worry about all browsers.

Select only regex match from a continuous string

I want to use this regex

r"Summe\d+\W\d+"

to match this string

150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung

but I want to only filter out this specific part

Summe50,90

I can select the entire string with this regex but I’m not sure how to filter out only the matching part

here is the function it is in where i am trying to get the amount from a pdf:

    def get_amount(url):
      data = requests.get(url)
      with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
        f.write(data.content)

      pdfFileObj = open('exmpl.pdf', 'rb')
      pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

      pageObj = pdfReader.getPage(0)
      text = pageObj.extractText().split()

      regex = re.compile(r"Summe\d+\W\d+")

      matches = list(filter(regex.search, text))
      for i in range(len(matches)):
        matchString = '\n'.join(matches)


      print(matchString)

as described above, I would like guidance on how I can best filter out a part of this string so that it returns just the matching portion. preferably with varying lengths of characters on either side but that’s not a priority.

thanks!!

Solution:

This is what you want, your regex is correct but you must get the match after searching for it.

  regex = re.compile(r"Summe\d+\W\d+")
  text = ["150,90‡50,90‡8,13‡Summe50,90•50,90•8,13•Kreditkartenzahlung"]

  matches = []
  for t in text:
    m = regex.search(t)
    if m:
      matches.append(m.group(0))

  print(matches)

re.search returns a Match object on success, None on failure, and that object contains all the information about your matching regex. To get the whole match you call Match.group().

Excel can't read xlsxwriter created formulas SUMIF

I’m trying to create an excel containing some formulas using the code below

import xlsxwriter

workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()

worksheet.write('A1', 1)
worksheet.write('A2', 2)
worksheet.write('A3', 3)
worksheet.write('B1', '+')
worksheet.write('B2', '-')
worksheet.write('B3', '+')
formula = '=((SUMIF(B{f}:B{la},"+",{cl}{f}:{cl}{la})-'\
         + 'SUMIF(B{f}:B{la},"-",{cl}{f}:{cl}{la}))'\
         + '/MAX(SUMIF(B{f}:B{la},"/",{cl}{f}:{cl}{la}),1)'
print(formula.format(f=1, la=3, cl='A'))
# =((SUMIF(B1:B3,"+",A1:A3)-SUMIF(B1:B3,"-",A1:A3))/MAX(SUMIF(B1:B3,"/",A1:A3),1)

worksheet.write_formula('B5', formula.format(f=1, la=3, cl='A'))
workbook.close()

When opening this file using Microsoft excel we get the following:

Excel error

the cell containing 0 is the one with the formula.

When I use Libreoffice to open the same file I get the correct value.

Solution:

You used two parenthesis, there should be only one in the start of your formula. it should be:

formula = '=(SUMIF(B{f}:B{la},"+",{cl}{f}:{cl}{la})-'\
         + 'SUMIF(B{f}:B{la},"-",{cl}{f}:{cl}{la}))'\
         + '/MAX(SUMIF(B{f}:B{la},"/",{cl}{f}:{cl}{la}),1)'

Remove string element in a list of strings if the first characters match with another string element in the list

I want to lookup and compare efficiently the string elements in a list and then remove those which are parts of other string elements in the list (with the same beginning point)

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

I intend to get a list which looks like this:

list2 = [  'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

In other words, I want to keep the longest string element from those elements which start with the same first characters.

Solution:

As suggested by John Coleman in comments, you can first sort the sentences and then compare consecutive sentences. If one sentences is a prefix of another, it will appear right before that sentences in the sorted list, so we just have to compare consecutive sentences. To preserve the original order, you can use a set for quickly looking up the filtered elements.

list1 = ['a boy ran', 'green apples are worse', 
         'a boy ran towards the mill', ' this is another sentence ',
         'a boy ran towards the mill and fell']                                                                

srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
    if b.startswith(a):
        filtered.remove(a)

list2 = [x for x in list1 if x in filtered]                                     

Afterwards, list2 is the following:

['green apples are worse',
 ' this is another sentence ',
 'a boy ran towards the mill and fell']

With O(nlogn) this is considerably faster than comparing all pairs of sentences in O(n²), but if the list is not too long, the much simpler solution by Vicrobot will work just as well.

Pandas- pivoting column into (conditional) aggregated string

Lets say I have the following data set, turned into a dataframe:

data = [
    ['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
    ['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])

This yields a dataframe that looks like:

     Job        Date Employee Manager
0  Job 1  2019-06-09      Jim     Tom
1  Job 1  2019-06-09     Bill     Tom
2  Job 1  2019-06-09      Tom     Tom
3  Job 1  2019-06-10     Bill    None
4  Job 2  2019-06-10      Tom     Tom

What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:

  1. All employee names are unique (I’ll actually be using unique employee ids rather than names), and Managers are also “employees”, so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
  2. A work crew can have a manager, or not (see row with id 3, for an example without)
  3. A manager will always also be listed as an employee (see row with id 2 or 4)
  4. A job could have a manager, with no additional employees (see row id 4)

I’d like the resulting dataframe to look like:

     Job        Date  Manager     Employees
0  Job 1  2019-06-09      Tom     Jim, Bill
1  Job 1  2019-06-10     None          Bill
2  Job 2  2019-06-10      Tom          None

Which leads to my questions:

  1. Is there a way to do a ‘,’.join like aggregation in a pandas pivot?
  2. Is there a way to make this aggregation conditional (exclude the name/id in the manager column)

I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.

Solution:

Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.

s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})

                 Manager   Employee
Job   Date                         
Job 1 2019-06-09     Tom  Jim, Bill
      2019-06-10    None       Bill
Job 2 2019-06-10     Tom       None

Write multiple Excel files for each value of certain column Python Pandas

Consider the following dataframe:

data = {'Col_A': [3, 2, 1, 0], 'Col_B': ['a', 'b', 'a', 'b']}
df = pd.DataFrame.from_dict(data)

I can create a dataframa a and write it to Excel as follows:

a = df[df.Col_B == 'a']
a
a.to_excel(excel_writer = 'F:\Desktop\output.xlsx', index = False)

I am looking for a way to write an Excel file, for each value of column B. In reality, there are hundreds of values for Col_B. Is there a way to loop through this?

Any help is greatly appreciated!

Regards,

M.

Solution:

If need for each group separate excel file loop by groupby object:

for i, a in df.groupby('Col_B'):
    a.to_excel(f'F:\Desktop\output_{i}.xlsx', index = False)

If need for each group separate sheetname in one exel file use ExcelWriter:

with pd.ExcelWriter('output.xlsx') as writer:
    for i, a in df.groupby('Col_B'):
        a.to_excel(writer, sheet_name=i, index = False)

Extra newline in output when adding strings

I have a text file something like,

3 forwhomthebelltolls
-6 verycomplexnumber

The question is, if the integer K is positive, take the first K characters and put them at the end of the string. If it’s negative, take the last K characters and put them at the front of the string. Like this:

whomthebelltollsfor
numberverycomplex

My code is

file = open("tex.txt","r")

for line in file:
    K, sentence = int(line.split(" ")[0]), line.split(" ")[1]
    new_sentence = sentence[K:] + sentence[:K]
    print(new_sentence)

however it prints the values like:

whomthebelltolls
for
numberverycomplex

I do not understand why this happens. I am just adding two strings however in printing part the second part that I add goes down.

Solution:

The actual contents of the file:

3 forwhomthebelltolls\n
-6 verycomplexnumber

When you iterate through the file with for line in file, you’re taking the entire line, including the newline character at the end. You’re never removing that. And so, sentence[K:] resolves to whomthebelltolls\n, and sentence[:k] resolves to for. The entire string is whomthebelltolls\nfor. Since there’s a newline in the middle of the string, that gets printed.

To fix this, strip the string first:

for line in file:
    K, sentence = int(line.split(" ")[0]), line.split(" ")[1].strip()
    ...

How to permute values of pandas column?

I have the following DataFrame df:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        5.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   0        5.0

The sum of Supply is 40. I need to permute Suppy 10 in order to assign them to higher values of Price, while Supply 0 should occur at lower values of Price.

This is the expected result:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   10       5.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   10       5.0

Any clues how to do it?

Solution:

argsort

  • Multiply by negative one as a convenient way to switch the sort
  • Use argsort to track the positions of where to drop my values
  • Create b to house my permuted values
  • Populate b with a sorted version of Supply
  • Assign back to df

a = df.Price.mul(-1).to_numpy().argsort()
b = np.empty_like(df.Supply)

b[a] = df.Supply.sort_values(ascending=False)

df.loc[:, 'Supply'] = b

df

           Datetime  Supply  Price
0  2019-02-01 12:00      10    2.0
1  2019-02-01 12:00       0    1.0
2  2019-02-01 12:00      10    5.0
3  2019-02-01 12:00       0    1.0
4  2019-02-01 12:00       0    2.0
5  2019-02-01 12:00      10    4.0
6  2019-02-01 12:00      10    5.0

There is room to optimize this code but the general idea is there.

Extra space added on reverse string function

I’m trying to figure out how to reverse a string using Python without using the [::-1] solution.

My code seems to work fine for several test cases, but it adds an extra space for one instance and I can’t figure out why.

def reverse(s):
    r = list(s)
    start, end = 0, len(s) - 1
    x = end//2
    for i in range(x):
        r[start], r[end] = r[end], r[start]
        start += 1
        end -= 1

    print(''.join(r))


reverse('A man, a plan, a canal: Panama')

# returns 'amanaP :lanac  a,nalp a ,nam A'
# note the double space ^^ - don't know why


reverse('a monkey named fred, had a banana')

# 'returns ananab a dah ,derf deman yeknom a'


reverse('Able was I ere I saw Elba')

# returns 'ablE was I ere I saw elbA'

Solution:

Change

    x = end//2

to

    x = len(s)//2