Skip to content

Auto-expand string width for Column insert#9559

Merged
taldcroft merged 3 commits intoastropy:masterfrom
taldcroft:table-insert
Nov 11, 2019
Merged

Auto-expand string width for Column insert#9559
taldcroft merged 3 commits intoastropy:masterfrom
taldcroft:table-insert

Conversation

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Nov 7, 2019

This changes Column insert so that if you insert a string value that is longer than the available width then the column dtype is auto-expanded.

Example:

In [2]: c = Column(['a', 'b'])

In [3]: c
Out[3]: 
<Column dtype='str1' length=2>
a
b

In [4]: c.insert(1, 'longer!')
Out[4]: 
<Column dtype='str7' length=3>
      a
longer!
      b

cc: @mhvk @astrofrog @bsipocz

new_dtype = self.dtype.kind + str(values_str_len)

# Explicitly convert to dtype of this column. Needed because numpy 1.7
# enforces safe casting by default, so . This isn't the case for 1.6 or 1.8+.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support old numpies, so no need to think about how to workaround for <1.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a copy/paste, but is removed now, good catch!

@bsipocz
Copy link
Member

bsipocz commented Nov 7, 2019

@taldcroft - a quick turnaround for 4.0 sounds good to me (one can see this as a bugfix)

@mhvk
Copy link
Contributor

mhvk commented Nov 7, 2019

Yes, I like this idea, and since we already have the unicode sandwich and thus are deviation from numpy anyway, I think it is fine to try to make things a bit more userfriendly; I think it would be rare indeed that a user passes in a long string and only wants the first characters stored!

@pllim pllim added this to the v4.1 milestone Nov 7, 2019
@astrofrog
Copy link
Member

I'm on board with this, but wondering whether we should consider emitting a warning in this case?

# Explicitly convert to dtype of this column. Needed because numpy 1.7
# enforces safe casting by default, so . This isn't the case for 1.6 or 1.8+.
values = np.asarray(values, dtype=new_dtype)
data = np.insert(self.astype(new_dtype), obj, values, axis=axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a copy=False inside the astype, otherwise you make an extra copy also if the dtype is unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different implementation so this does not apply now.

@bsipocz bsipocz modified the milestones: v4.1, v4.0 Nov 7, 2019
@taldcroft
Copy link
Member Author

I'm on board with this, but wondering whether we should consider emitting a warning in this case?

I'm open to other opinions, but my inclination is no warning. Pandas does not warn in this case (of course it handles strings as objects anyway), and I think that that 99.9% of the time that users just want this to work silently.

My original use case was an empty ECSV table with a string column that I then appended rows to. This should just work!

accord with the existing unit.

>>> t.add_row([-8, -9, 10 * u.cm / u.s, 11])
>>> t.add_row([-8, 'string', 10 * u.cm / u.s, 10])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a bad behavior previously, where it accepted an int -9 to a string column and silently converted. No longer, inserting the wrong type into a string column raises an exception now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems better indeed. Probably best to add a test for it...

assert np.all(t['a'] == np.array([1, 2, 3, 4]))
assert np.allclose(t['b'], np.array([4.0, 5.1, 6.2, 7.2]))
assert np.all(t['c'] == np.array(['7', '8', '9', '1']))
assert np.all(t['c'] == np.array(['7', '8', '9', '10']))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original was truncating.

@taldcroft taldcroft changed the title WIP: auto-expand string width for column insert [skip ci] Auto-expand string width for Column insert Nov 8, 2019
@taldcroft taldcroft requested a review from mhvk November 8, 2019 13:42
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft - looks good module the use of np.shape instead of np.broadcast(..).shape, and possibly an extra test. I'll approve now since I'm travelling and may not be able to look at this again in the next few days.

mask = False
else:
mask = np.zeros(values.shape, dtype=bool)
mask = np.zeros(np.broadcast(values).shape, dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do np.shape(values)

accord with the existing unit.

>>> t.add_row([-8, -9, 10 * u.cm / u.s, 11])
>>> t.add_row([-8, 'string', 10 * u.cm / u.s, 10])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems better indeed. Probably best to add a test for it...

@mhvk
Copy link
Contributor

mhvk commented Nov 10, 2019

p.s. I'm also in favour of not warning.

@taldcroft taldcroft merged commit 8fc0ca6 into astropy:master Nov 11, 2019
@taldcroft taldcroft deleted the table-insert branch November 11, 2019 21:37
bsipocz pushed a commit that referenced this pull request Nov 18, 2019
Auto-expand string width for Column insert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants