Skip to content

ENH: Add np.strings namespace#25463

Merged
ngoldbaum merged 13 commits intonumpy:mainfrom
lysnikolaou:np-strings-namespace
Jan 9, 2024
Merged

ENH: Add np.strings namespace#25463
ngoldbaum merged 13 commits intonumpy:mainfrom
lysnikolaou:np-strings-namespace

Conversation

@lysnikolaou
Copy link
Member

No description provided.

@ngoldbaum
Copy link
Member

Nice, thanks for taking this on!

I think we should probably copy over the chararray tests to make new tests for these ufuncs, adapting the tests where there are any behavior differences.

@lysnikolaou
Copy link
Member Author

I think we should probably copy over the chararray tests to make new tests for these ufuncs, adapting the tests where there are any behavior differences.

Yup, I'm writing tests now, that's why I opened it as a draft. I'm writing new tests though that make a bit more sense, since the defchararray ones also try to test the rstrip logic etc, and it might take some more time (I'm also gonna be out next week).

@ngoldbaum
Copy link
Member

Yeah, just wanted to raise it in case you hadn't thought about it yet. Happy holidays!

@lysnikolaou lysnikolaou marked this pull request as ready for review January 3, 2024 11:53
@lysnikolaou
Copy link
Member Author

I'd say let's try and get this merged. I'll backfill replace afterwards, since it's too buggy 😬, and the tests/fixes would be better off in a separate PR.

@lysnikolaou
Copy link
Member Author

Also, if someone has any advice on how to get test coverage for C/C++ code, that'd be really helpful, since I can't find how to do it.

@mattip
Copy link
Member

mattip commented Jan 3, 2024

We used to use gcov to generate coverage in testing, and upload it to https://about.codecov.io/. Maybe we lost this in the move to meson? In any case, gcov is the tool that produces the database of code coverage in C/C++

@lysnikolaou
Copy link
Member Author

We used to use gcov to generate coverage in testing, and upload it to https://about.codecov.io/. Maybe we lost this in the move to meson? In any case, gcov is the tool that produces the database of code coverage in C/C++

Thanks! This helped a lot.

@lysnikolaou lysnikolaou force-pushed the np-strings-namespace branch from 17b5693 to fbd0f0c Compare January 3, 2024 15:10
@lysnikolaou lysnikolaou force-pushed the np-strings-namespace branch from fbd0f0c to 14add39 Compare January 3, 2024 15:23
@ngoldbaum
Copy link
Member

Can we also get a barebones docs API page for the namespace? Take a look at #25507 where I created an np.dtypes docs page with an autogenerated API listing. If we feel like reviewing that one too, I'd appreciate it!

@ngoldbaum
Copy link
Member

The test failure is coming from using wchar_t, which is 16 bits, not 32 bits.

Instead of using wmemchr, you can do a for loop:

diff --git a/numpy/_core/src/umath/string_buffer.h b/numpy/_core/src/umath/string_buffer.h
--
index 2004b04f4..635254900 100644
--- a/numpy/_core/src/umath/string_buffer.h
+++ b/numpy/_core/src/umath/string_buffer.h
@@ -158,7 +158,11 @@ struct Buffer {
newbuf.buf = (char *) memchr(buf, ch, len);
break;
case ENCODING::UTF32:
-            newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len);
+            for (int i=0; i<len; i++) {
+                if (newbuf[i] == ch) {
+                    newbuf += i;
+                }
+            }
break;
}
return newbuf;

@lysnikolaou
Copy link
Member Author

I'm on arm64 macOS, so I can't build the docs due to matplotlib/matplotlib#27445. Is there a way to circumvent that or do we build the docs in CI somewhere?

@lysnikolaou lysnikolaou force-pushed the np-strings-namespace branch from a7ea092 to e0fe2ce Compare January 8, 2024 11:40
@ngoldbaum
Copy link
Member

Is there a way to circumvent that

Yeah, you should be able to manually download, rename the wheel file, and install the renamed wheel. The wheel is fine, pip just doesn't like the filename.

do we build the docs in CI somewhere

Circleci does the docs build. Click the "details" link on the ci/circleci: build artifact check on the CI checks listing on github and that'll take you straight to the HTML docs build in your browser. You can look at the build log on the other circleci job.

@lysnikolaou
Copy link
Member Author

Thanks @ngoldbaum for the info! Everything looks fine from my side, so another review would be great, so that we can merge this.

@ngoldbaum
Copy link
Member

I forgot to mention this earlier, it needs a release note. I think I'll likely edit it later but something to start with would be nice.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotations are greek to me but everything else looks OK besides the comments I left. If it gets a release note I think this is ready but maybe @BvB93 could look over the new type annotations?

@lysnikolaou lysnikolaou requested a review from ngoldbaum January 9, 2024 11:14
@ngoldbaum
Copy link
Member

Doctests are failing:

oc/source/reference/routines.strings.rst
-----------------------------------------

File "/tmp/tmp17a_xmo6/doc/source/reference/routines.strings.rst", line 26, in doc/source/reference/routines.strings.rst
Failed example:
    np.strings.endswith('hello', 'lo')
Expected:
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: endswith() takes from 4 to 5 positional arguments but 2 were given
Got:
    Traceback (most recent call last):
      File "/home/circleci/.pyenv/versions/3.11.4/lib/python3.11/doctest.py", line 1351, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest doc/source/reference/routines.strings.rst[2]>", line 1, in <module>
        np.strings.endswith('hello', 'lo')
    TypeError: endswith() takes from 4 to 5 positional arguments but 2 were given

You can run the doctests locally with:

$ python tools/refguide_check.py --doctests
$ python tools/refguide_check.py --rst

The latter would catch the error in the tests.

Also can we get a release note too?

@lysnikolaou lysnikolaou force-pushed the np-strings-namespace branch from bf1e311 to 7f85895 Compare January 9, 2024 15:17
@ngoldbaum
Copy link
Member

Pulling this in, thanks @lysnikolaou. If we catch issues with the typing later we can fix in a followup. I'd really like to have the extensive tests available in main for my work on implementing these ufuncs in stringdtype.

No need to ping the mailing list, I think, this has been proposed as part of NEP 55 for quite a while and no one has objected to the new namespace in any of the NEP 55 review.

I'll be expanding on the docs added here in the stringdtype PR.

@ngoldbaum ngoldbaum merged commit 6e3b923 into numpy:main Jan 9, 2024
@lysnikolaou
Copy link
Member Author

Thanks for writing the release note @ngoldbaum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants