Correct hungarian support for rectangular matrices#4
Correct hungarian support for rectangular matrices#4WeatherGod wants to merge 15 commits intoGaelVaroquaux:hungarianfrom
Conversation
…ices. Also enabled the tests. NOTE: Only tested on rectangular matrices of shape nxm such that m > n. Tests need to be expanded to test m < n.
… rows. All assignments are made, but the algorithm wants to keep going because there are some rows left.
…f the cost function is zero-length. The result in such a situation should be an empty array.
…, we do it by padding the array. Also modified the hungarian test to also test the transpose of the test arrays to see if identical results occur.
|
Heh, oddly enough... my tracking results are better with the code from zerohungarian branch. Weird... |
|
Hey, Sorry, that pull request had slipped under my radar. I was about to merge it, but I noticed you last remark, and I am not sure what to do about it. Could you comment a bit more on it: I don't see how supporting non square matrix should affect the performance of a hungarian algorithm. But maybe I understood you wrong. |
|
No, it is not a performance issue. I am not 100% sure that the results are correct. I will be doing further analysis today. |
OK, so you are saying that the square matrices are taking a performance
Tahnks, keep me poster. |
No... I am not talking at all about performance (in terms of speed or memory usage). I am only talking about the correctness of the algorithm output. I need to find some more test cases to see how the algorithm is right or wrong. |
|
Ok, I finally figured it out. It was a combination of issues. First, we don't need to pad the cost matrix. One could also transpose it when m > n and then just make sure to switch the columns of the results before returning it. Second (and most importantly!), the functional method "hungarian()" strips out important information when the matrix is rectangular. When the input cost matrix is square, we know that there will be an association for each row. However, when m > n, there can only be n associations, which means some rows are not accounted for in the results list from H.compute(). Because hungarian() strips the first column from the results list, we typically loop over the enumeration of results, accidentally assuming that the results are for the first n rows. Therefore, I recommend not stripping the first column from the results list in hungarian(). Also, if desired, I can make a new pull request that implements the rectangular hungarian solver without cost padding. |
OK, that's good to know.
Yes, that seems the right thing to do.
Fair enough. Could you just make sure that find_permutation still works.
That would be great. Thanks |
Update tutorial to match the current master API + typo. Thanks Lars.
* initial commit * used random class * fixed failing testcases, reverted __init__.py * fixed failing testcases #2 - passed rng as parameter to ParameterSampler class - changed seed from 0 to 42 (as original) * fixed failing testcases #2 - passed rng as parameter to SparseRandomProjection class * fixed failing testcases #4 - passed rng as parameter to GaussianRandomProjection class * fixed failing test case because of flake 8
* ENH Adds files * ENH Adds permutation importance * RFC Better names * STY Flake8 * ENH: Adds inspect module * DOC Adds pre_dispatch * DOC Adds permutation importance example * Trigger CI * BLD Adds inspect to configuration * RFC Update to only inspect fitted model * RFC Removes parameters * ENH: Adds pandas support * STY Flake8 * DOC Adds new permutation importance example * ENH Renames module to model_inspection * DOC Fix links * DOC Fixes image link * DOC Fixes image link * DOC Spelling * DOC * TST Fix keyword * Rework RF Imp vs Perm Imp example (#4) * WIP * WIP * WIP * DOC Adds multcollinear features example * WIP * DOC: Clean up docs * TST Adds tests for strings * STY Indent correction * WIP * ENH Uses check_X_y * TST Adds test with strings * STY Fix * TST Adds column transformer to test * CLN Address comments * CLN Removes import * TST Adds test with nan * CLN Removes import * ENH Parallel * DOC comments * ENH Better handling of pandas * ENH Clear checking of pandas dataframe * STY Formatting * ENH Copies in parallel helper * DOC Adds comments * BUG Fix copying * BUG Fix for pandas * BUG Fix for pandas * REV * BLD Trigger CI * BUG Fix * BUG Fix * TST Does this work * BUG Fixes test * BUG Fixes test * BUG Fix * BUG Fix * BUG Fix * STY Fix * TST Fix * TST Fix segfault * CLN Address comments * CLN Address comments * ENH Returns a bunch * STY Flake8 * CLN Renames bunch key * DOC Updates api * DOC Updates api * TST Adds permutation test with linear_regression * DOC update * DOC Fix label cutoff * CLN Address comments * TST Adds test for random_state effect * DOC Adds permutation importance * DOC Adds ogrisel suggestion * DOC Address guillaumes comments * DOC Address andreas comments * DOC Update
Additional testing revealed that my previous square2rect branch was wrong in the case of m > n. Therefore, this branch fixes that mistake by going the route of padding the array and recording the original matrix shape.
Also modified the hungarian test to also test the transpose of the test arrays to see if identical results occur.
As a consequence of this new approach, we do not need some of the code that was introduced by the zerohungarian branch (but we do still need to keep some of it).