Skip to content

Conversation

@taldcroft
Copy link
Member

Closes #7728

Should be used as the base for #9287.

@mhvk
Copy link
Contributor

mhvk commented Oct 8, 2019

I like the idea, but first a general comment: could we not more simply rely on the presence of argsort? Time.argsort already exists and does, I think, what you want (if not, it is a bug!). One could then perhaps an info.argsort as a possible fall-back for columns that do not implement argsort directly, but I think I would stick with that name.

@taldcroft
Copy link
Member Author

The problem is if you have (e.g. as in the test) both Time and other key columns, it becomes inconvenient to lexically sort all the columns. Just being able to argsort the Time column in isolation is not enough. The basis of the join operation is making a pure ndarray structured array corresponding to the key columns that sorts properly. Of course there are ways to do this basic operation differently, but I am not sure any of them are simpler.

@mhvk
Copy link
Contributor

mhvk commented Oct 8, 2019

Ah, yes, I see how it helps in that sense. I'll try to have a more detailed look later.

raise tme


def _get_join_sort_idxs(out_descrs, keys, len_left, len_right, left, right):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again I think that out_descrs isn't needed, and len_left and len_right could be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

sort_right = {} # sortable ndarray from right table

for key in keys:
# TODO: shape here seems out of place. Can a key column have a non-trivial shape?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shape is not actually used now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this - it seems it should be impossible to sort on anything but 1-dimensional columns.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks and very clever to me. Take my approval with a pinch of salt, as I'm not sure how serious something need to be in here so I would spot it.

# a new structured array that represents the lexical ordering of those
# key columns. This structured array is then argsort'ed. The trick here
# is that some columns (e.g. Time) may need to be expanded into multiple
# columns for ordering here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these detailed comments, it really helps a lot now and will in the future, too

Return a list of arrays which can be lexically sorted to represent
the order of the parent column.
:return: list of ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super nit picky, and don't need to be changed right now, but longer term it would be nice to use numpydoc format. If you still use pycharm, the docstring format can be changed in preferences-->tools-->python integrated tools

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done. Thanks for the hint in setting this preference. I had this set before but I think I accidentally blew away my .idea directory at some point with git clean -fxd, so the docstring format got lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got bitten by that, too, now I only have git clean -dfx -e .idea -e notes in my history and stick to use it from there rather than typing it in :)

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice! Only small comments - the main one is your own one that the new sorting function's signature can be simplified.

It does seem to me that columns with non-trivial shape cannot be sorted without further information - in which case your new get_sortable_columns is just the right place to put that information. So, it would seem to make sense to insist on that. (From quick experiments, it seems argsort just takes only the first element)

For Column this is just the column itself.
:return: list of Column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @bsipocz noted, should be something like

Returns
-------
arrays : list of Column

Though I think it is fine to just delete it too, since the title already states what it is.

sort_right = {} # sortable ndarray from right table

for key in keys:
# TODO: shape here seems out of place. Can a key column have a non-trivial shape?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this - it seems it should be impossible to sort on anything but 1-dimensional columns.


if len(left_sort_cols) != len(right_sort_cols):
# Should never happen because cols are screened beforehand for compatibility
raise ValueError('unexpected mismatch in sort cols lengths')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use RuntimeError for things that should never happen (still good to check).

Alternatively, I think it makes sense to use assert ..., "<reason>" here, so that it does not get executed with python -O.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with RuntimeError. I don't actually understand why you would want to not execute this test via python -O. I must just be missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matter much, since the test is far cheaper than the rest, but I think assert is reasonable since you write right about that this "Should never happen", i.e., you have already tested in earlier code and thus here you simply 'assert' that the earlier checks were done. (I like this use of assert - it asserts assumptions made in the code that should hold, but that, given that bugs do happen, may be wrong any way - I think when you optimize for speed, it is fine to go with the assumptions...)

raise tme


def _get_join_sort_idxs(out_descrs, keys, len_left, len_right, left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member Author

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have addressed all comments.

@taldcroft taldcroft requested a review from mhvk October 11, 2019 11:58
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks OK now, nice! merging....

@mhvk mhvk merged commit 04baa3a into astropy:master Oct 11, 2019
@taldcroft taldcroft deleted the table-sortable branch December 10, 2019 23:30
eerovaher added a commit to eerovaher/astropy that referenced this pull request Aug 9, 2021
This commit removes documentation that has been made obsolete by pull
requests astropy#9340 and astropy#11127.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow mixins as key columns in Table join

3 participants