Allow to used Collation based sorting of strings#6229
Allow to used Collation based sorting of strings#6229chrisbra wants to merge 7 commits intovim:masterfrom
Conversation
|
I can see that collation-ordered sorting can be useful, but for the readdir() function I thought the question was for case-ignored sorting. So README, Readme and readme sort together. And I would expect at least some users wanting byte-order sorting (like we do nearly everywhere currently). |
Yes, that's what I believe |
|
Christian wrote:
> So README, Readme and readme sort together.
Yes, that's what I believe `:lang collate de_DE` would cause. If you
do not want this, use `:lang collate C`
It's generally not a good idea to have the result of a function depend
on a global variable. It tends to break plugins.
A generic, but possibly complicated solution, is to add an argument to
specify the sorting:
none - byte order
icase - current case-folding comparison
collate - using language from environment
collate:{lang} - collattion using specific language
If we introduce collation here, then no doubt will it be requested where
we currently have a choice between ignore-case and byte order.
…--
** Hello and Welcome to the Psychiatric Hotline **
If you are obsessive-compulsive, please press 1 repeatedly.
If you are co-dependent, please ask someone to press 2.
If you have multiple personalities, please press 3, 4, 5 and 6.
If you are paranoid-delusional, we know who you are and what you want
- just stay on the line so we can trace the call.
If you are schizophrenic, listen carefully and a little voice will
tell you which number to press next.
If you are manic-depressive, it doesn't matter which number you press
- no one will answer.
If you suffer from panic attacks, push every button you can find.
If you are sane, please hold on - we have the rest of humanity on the
other line and they desparately want to ask you a few questions.
/// Bram Moolenaar -- Bram@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
|
Okay, will rework. |
How about: Returning the result without sorting may have speed advantage. |
Yes, this corresponds with my original request and seems like the least likely to cause problems. I was about to alter my feature request after realizing that I was asking for a set of conditions that are jointly unsatisfiable. I wanted the results of readdirex() to (i) match those of Vimscript glob(), and (ii) match what one would get by sorting using Vimscript '>?' I also realized that the latter is probably what I really want for my plugin. Thus an optional flag to sort in that fashion is most welcome! (I don't know whether that is what is meant by "icase - current case-folding comparison"). |
|
On 2020-06-10, K.Takata wrote:
A generic, but possibly complicated solution, is to add an argument to
specify the sorting:
none - byte order
icase - current case-folding comparison
collate - using language from environment
collate:{lang} - collattion using specific language
How about:
none - unsorted
case - byte order (default)
icase - current case-folding comparison
As I understand it, Vim's current case-folding comparison does
a stationary sort. So, if locale shows LC_COLLATE="en_US.UTF-8",
and the OS happened to put README, Readme and readme in the
directory in this order, for example,
Readme
readme
README
Vim's "sort i" would leave them in that order, but Vim's glob("*")
would sort them as
README
Readme
readme
and both ":!sort" and ":!echo *" would sort them as
readme
Readme
README
I still don't think that a purely case-insensitive sort is what
should be used for readdirex() or readdir() because the results will
be in an order dependent upon the order in which the OS happened to
put them in the directory. Two directories containing the same
files could case-insensitively sort differently.
Making a case-insensitive file name comparison can be useful, but
for sorting a directory, we should pick a set of collations or
unsorted. Call one dictionary order or case-folded, but not
case-insensitive.
Regards,
Gary
|
|
Whatever the sorting order for readdirex() is, it should be possible to use Vimscript to return to that sorting order after first sorting by (e.g., filesize or time). That's why I have landed on "the same as >?". So unless the category Sketch of example: let foo be the result of calling readdirex(). It's a list of dicts. It has some sorting order to it, where what's sorted "on" are the filename entries in each dict. I might want to use Vimscript to sort foo by the "time" entry. And then having inspected that, I might want to use Vimscript return to sorting foo by filename. This last step ought to put foo back in the order it was in when it was first created by readdirex(). Or rather, the optional arg for readdirex() that I'm requesting should be able to take a value that allows for this putting-back. IIUC collation order will not allow this, because IIUC Now one might say "well, just sort foo by I used to think that Vimscript's glob() put things in my desired order (on macOS at least), but I now think I was wrong about that. |
|
On 2020-06-10, chdiza wrote:
Whatever the sorting order for readdirex() is, it should be possible to use
Vimscript to return to that sorting order after first sorting by (e.g.,
filesize or time). That's why I have landed on "the same as >?".
So unless the category expr5 has comparison operators that will follow
"collation", I do not want "collation order" as the only alternative to the
current "case sensitive" ordering.
Sketch of example: let foo be the result of calling readdirex(). It's a list of
dicts. It has some sorting order to it, where what's sorted "on" are the
filename entries in each dict. I might want to use Vimscript to sort foo by the
"time" entry. And then having inspected that, I might want to use Vimscript
return to sorting foo by filename. This last step ought to put foo back in the
order it was in when it was first created by readdirex(). Or rather, the
optional arg for readdirex() that I'm requesting should be able to take a value
that allows for this putting-back.
Exactly! You want a sorting method that sorts a randomly-ordered
list with repeatable, predictable results.
Here's an example. I have a directory that contains three files.
If I read the directory unsorted at all, using either "ls -U" or
"find", I get this list in this order.
readme
README
Readme
If I apply Vim's case-insensitive sort, ":sort i", to that list,
I get the same list, because Vim's :sort is stationary.
readme
README
Readme
Now let me get the files sorted by time by reading the directory
using "ls -t".
README
readme
Readme
If I apply Vim's case-insensitive sort to that list I get the
following.
README
readme
Readme
Note that the list order is unchanged, as expected. Note also that
this order is different from what it was the first time I applied
Vim's case-insensitive sort to the "raw" file list. There is no
way to get back to Vim's case-insensitive sort of the original "raw"
list order using any sorting method other than one specifically
crafted to do that.
If you want all the same letters to sort together, and have
repeatable results so that you can put a list back in that order,
you need to choose a collating order, putting either each of the
lower-case letters before its upper-case equivalent or vice versa.
That is, either
aAbBcC...qQrRsS...
or
AaBbCc...QqRrSs...
IIUC collation order will not allow this, because IIUC >? and friends don't
follow collation order.
See above.
Now one might say "well, just sort foo by >? right after it's created but
before you display any of its contents." But that puts me right back where I
was when I opened the GH issue: the time needed to perform that initial sorting
cancels out the time saved by using readdirex() instead of glob() in the first
place.
I agree that it would be much better to put the sorting in the
C code of the readdirex() function.
I used to think that Vimscript's glob() put things in my desired order (on
macOS at least), but I now think I was wrong about that.
I looked briefly at the glob() function but haven't figured it out
yet. I thought it used the shell, but it doesn't.
Regards,
Gary
|
|
reworked. |
runtime/doc/eval.txt
Outdated
| |locale| | ||
| Other values are silently ignored. | ||
|
|
||
| For example, to get a list of alle files in the current |
There was a problem hiding this comment.
Typo: "all", not "alle". Also, one line below: "entries", not "entires".
|
Hm, looks like there are some sorting failures on some CI builds. Not sure how to fix those. I especially find the S390 results strange. |
|
On 2020-06-11, Christian Brabandt wrote:
reworked.
Looks good. That should keep everybody happy.
I noticed a couple of typos.
runtime/doc/mlang.txt:49:
collation order is pringed. Technical: LC_COLLATE.
^^^^^^^
"pringed" should be "printed".
src/testdir/test_functions.vim:1946:
" Skipt tests on Mac OS X (does not allow several files with different caseing)
^^^^^ ^^^^^^^
"Skipt" should be "Skip".
"caseing" should be "casing".
Also, in the new :help entry for readdirex(), it is not clear what
is meant by "case sensitive" here (which is also misspelled
"senstive"):
sort How to sort the result returned from the system.
Valid values are:
none do not sort (fastest method)
case sort case senstive
icase sort case insensitive
collate sort using collation order of your
|locale|
Other values are silently ignored.
It would be more clear if it said something like "according to the
byte value of each character", or 'using the collation order of the
"POSIX" or "C" locales'. (See the strcoll(3) man page.)
Regards,
Gary
|
src/filepath.c
Outdated
| } | ||
|
|
||
| if (dict_find(tv->vval.v_dict, (char_u *)"sort", -1) != NULL) | ||
| compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE); |
There was a problem hiding this comment.
If the key "sort" is not found, the result is unpredictable.
| compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE); | |
| compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE); | |
| else | |
| { | |
| *cmp = READDIR_SORT_BYTE; | |
| return OK; | |
| } |
Or, should be an error?
There was a problem hiding this comment.
and make missing 'sort' key an error
There was a problem hiding this comment.
Currently, the optional dict only has the 'sort' key, making it mandatory may make sense.
But if another keys are added later, making it mandatory may not make sense.
The tests assume that the order in which the files are created are the order in which |
|
Thanks @k-takata and Gary for the reviews, I have adjusted it.
@jamessan indeed the test assumes this. Any suggestions on how to improve the test? Or should we skip test for s390? BTW: What feature does need to be checked for that architecture? It is not completely obvious to me if we want to adjust tests specifically for that architecture. |
Isn't the issue fixed by my suggested change? |
|
The CI tests aren't complete yet, but I assume some will still fail, because the newly added test depends on the order in which the files are written. And it looks like this can't be guaranteed everywhere. |
It has nothing to do with the architecture. You're just getting lucky on the other systems. The order of the items returned by It's completely valid for XFS, btrfs, and ext4 to return the files in different orders than each other even if the same steps were taken to create the files. |
|
@jamessan Understood and I did not mean to hint that the sorting order is dependent on the architecture. Sorry for that. I did realize that the sorting order depends on the underlying filesystem. I guess what I actually did want to say where 2 questions:
|
I would say that the However, you can probably use that data to determine how the rest of the test assertions should look. Although it's still not guaranteed, I think it's far less likely that the ordering will change among different
There shouldn't be any reason to do that. There's nothing unique about an s390 system that's relevant to this test. The particulars of the Travis setup for CI might be relevant, but that has no bearing on any other s390 system. If my above suggestion doesn't work, then maybe it would be worthwhile to change .travis.yml so it skips this test on s390. However, I would argue that's just hiding a flaky test that's going to cause problems downstream and cause people to waste time investigating false-negative test failures.
Right, because there's nothing different about how Vim is built. It's just a different architecture. There's nothing in Vim that let's you distinguish between i386, amd64, mips, m68k, etc. |
Agree. It depends on its underlying filesystem, not on OS. |
src/testdir/test_functions.vim
Outdated
| call writefile(['2'], 'Xdir2/Readme.txt') | ||
| call writefile(['3'], 'Xdir2/readme.txt') | ||
|
|
||
| " Skip tests on Mac OS X (does not allow several files with different caseing) |
|
The Implementation looks OK now, thanks. |
Maybe do the test differently if has('ebcdic')? (zArchitecture, but not Linux on Z, usually uses EBCDIC encoding IIUC.) |
src/testdir/test_functions.vim
Outdated
| let files = readdirex('Xdir2', 1, #{sort: 'none'})->map({-> v:val.name}) | ||
| " not possible to assert the order in which the files appear, so just make | ||
| " sure it matches in any order. | ||
| call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted') |
There was a problem hiding this comment.
How about comparing after sorting?
| call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted') | |
| call assert_match(['README.txt', 'Readme.txt', 'readme.txt'], sort(files), 'unsorted') |
There was a problem hiding this comment.
And here I am, thinking I was overly clever using assert_match() :/
Thanks, good suggestion 👍
src/testdir/test_functions.vim
Outdated
|
|
||
| " 4) sort by ignoring case | ||
| let files = readdirex('Xdir2', 1, #{sort: 'icase'})->map({-> v:val.name}) | ||
| call assert_equal(['README.txt', 'readme.txt', 'Readme.txt'], files, 'sort by icase') |
There was a problem hiding this comment.
I don't think the order of the files can be determined here.
'README.txt', 'readme.txt' and 'Readme.txt' have the same priority, then they can be sorted in any order.
Maybe, a good additional test is e.g. using 'a.txt', 'b.txt' and 'Z.txt':
case -> 'Z.txt', 'a.txt', 'b.txt'
icase -> 'a.txt', 'b.txt', 'Z.txt'
This test can be also executed on Windows and macOS.
There was a problem hiding this comment.
Yeah, but on EBCDIC systems, in byte-value sorting, lowercase comes before uppercase and digits come last, (a-i 0x81-0x89, j-r 0x91-0x99, s-z 0xA2-0xA9, A-I 0xC1-0xC9, J-R 0xD1-0xD9, S-Z 0xE2-0xE9, digits 0xF0-0xF9, with punctuation characters in-between), so you may need to test for has('ebcdic') and if true compare A.txt, B.txt and z.txt (and if sorted on non-case-folded byte values, z will come before A).
Best regards,
Tony.
src/testdir/test_functions.vim
Outdated
| lang collate de_DE | ||
| let files = readdirex('Xdir2', 1, #{sort: 'collate'})->map({-> v:val.name}) | ||
| call assert_equal(['readme.txt', 'Readme.txt', 'README.txt'], files, 'sort by de_DE collation') | ||
| catch |
There was a problem hiding this comment.
How about adding a skip message?
| catch | |
| catch | |
| throw 'Skipped: de_DE collation is not available' |
src/vim.h
Outdated
| #ifdef HAVE_STRCOLL | ||
| # define STRCOLL(d, s) strcoll((char *)(d), (char *)(s)) | ||
| # else | ||
| # define STRCOLL(d, s) strcmp((char *)(d), (char *)(s)) | ||
| # endif |
There was a problem hiding this comment.
| #ifdef HAVE_STRCOLL | |
| # define STRCOLL(d, s) strcoll((char *)(d), (char *)(s)) | |
| # else | |
| # define STRCOLL(d, s) strcmp((char *)(d), (char *)(s)) | |
| # endif | |
| #ifdef HAVE_STRCOLL | |
| # define STRCOLL(d, s) strcoll((char *)(d), (char *)(s)) | |
| #else | |
| # define STRCOLL(d, s) strcmp((char *)(d), (char *)(s)) | |
| #endif |
src/testdir/test_functions.vim
Outdated
| " Skip tests on Mac OS X (does not allow several files with different casing) | ||
| if !(has("osxdarwin") || has("osx") || has("macunix")) |
There was a problem hiding this comment.
Same for Cygwin.
| " Skip tests on Mac OS X (does not allow several files with different casing) | |
| if !(has("osxdarwin") || has("osx") || has("macunix")) | |
| " Skip tests on Mac OS X and Cygwin (does not allow several files with different casing) | |
| if !(has("osxdarwin") || has("osx") || has("macunix") || has("win32unix")) |
Maybe it's better to add a skip message?
src/filepath.c
Outdated
| semsg(_(e_no_dict_key), "sort"); | ||
| return FAIL; |
There was a problem hiding this comment.
I'm still wondering if this error is really needed. #6229 (comment)
There was a problem hiding this comment.
I don't mind either way. I leave it to @brammool, the error message also needs a proper number. I am on the fence, but I slightly prefer, if Vim would error if I had a typo like readdirex(..., 1, #{sorta: 'case'}.
src/vim.h
Outdated
| # define READDIR_SORT_NONE 0 // do not sort | ||
| # define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default | ||
| # define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp) | ||
| # define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll) |
There was a problem hiding this comment.
| # define READDIR_SORT_NONE 0 // do not sort | |
| # define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default | |
| # define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp) | |
| # define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll) | |
| #define READDIR_SORT_NONE 0 // do not sort | |
| #define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default | |
| #define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp) | |
| #define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll) |
There was a problem hiding this comment.
Opps, where did the space come from...
I was thinking of something like below. This does pass on Travis. diff --git a/src/testdir/test_functions.vim b/src/testdir/test_functions.vim
index 2316ac01d13..2463e3e5f67 100644
--- a/src/testdir/test_functions.vim
+++ b/src/testdir/test_functions.vim
@@ -1942,21 +1942,23 @@ func Test_readdirex_sort()
" 1) default
let files = readdirex('Xdir2')->map({-> v:val.name})
+ let default = copy(files)
call assert_equal(['README.txt', 'Readme.txt', 'readme.txt'], files, 'sort using default')
" 2) no sorting
let files = readdirex('Xdir2', 1, #{sort: 'none'})->map({-> v:val.name})
+ let unsorted = copy(files)
" not possible to assert the order in which the files appear, so just make
" sure it matches in any order.
call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted')
" 3) sort by case (same as default)
let files = readdirex('Xdir2', 1, #{sort: 'case'})->map({-> v:val.name})
- call assert_equal(['README.txt', 'Readme.txt', 'readme.txt'], files, 'sort by case')
+ call assert_equal(default, files, 'sort by case')
" 4) sort by ignoring case
let files = readdirex('Xdir2', 1, #{sort: 'icase'})->map({-> v:val.name})
- call assert_equal(['README.txt', 'readme.txt', 'Readme.txt'], files, 'sort by icase')
+ call assert_equal(unsorted->sort('i'), files, 'sort by icase')
" 5) Default Collation
let collate = v:collate |
|
thanks everybody, will rework and apply the suggestions later today. |
After some discussion on github, it was decided to rework how sorting
should work for the readdirex function call.
The readdirex function now takes an optional argument. Extract from the
help:
>The optional {dict} argument allows for further custom
>values. Currently this is used to specify if and how sorting
>should be performed. The dict can have the following members:
>
> sort How to sort the result returned from the system.
> Valid values are:
> "none" do not sort (fastest method)
> "casa" sort case senstive
> "icase" sort case insensitive
> "collate" sort using collation order of your |locale|
> Other values are silently ignored.
>
>By default, the entries will be sorted using STRCMP, but one can also
>using strcasecmp and strcoll (using the current locales collation
>order).
Add some tests to verify the behaviour.
Adjust code according to reviews.
|
okay reworked. |
runtime/doc/mlang.txt
Outdated
| With the "time" argument the language used for | ||
| strftime() is printed. Technical: LC_TIME. | ||
| With the "collate" argument the language used for | ||
| collation order is printed. Technical: LC_COLLATE. |
There was a problem hiding this comment.
This line uses spaces for indentation whereas nearby lines use tabs (inconsistent).
There was a problem hiding this comment.
thanks, fixed it as well.
also test directly for Cygwin and Mac OS X and skip the test directly
|
This line needs to be updated: https://github.com/vim/vim/pull/6229/files#diff-bea881dfa9626bab7141337b0fcdb23eR7921 |
Which one? It's not clear from that link. |
Codecov Report
@@ Coverage Diff @@
## master #6229 +/- ##
==========================================
- Coverage 87.48% 87.47% -0.01%
==========================================
Files 143 143
Lines 158424 158475 +51
==========================================
+ Hits 138590 138626 +36
- Misses 19834 19849 +15
Continue to review full report at Codecov.
|
|
all CI pass now |
|
I can squash into a single commit and force-push, if this is helpful. |
|
I suppose it's OK to include now. Thanks for fine-tuning it. |
|
Hi Christian,
On Tue, Jun 16, 2020 at 1:57 AM Christian Brabandt < ***@***.***> wrote:
I can squash into a single commit and force-push, if this is helpful.
It will be good to add some more tests for the error cases in the new
functions.
These show up as missing in the code coverage report:
https://codecov.io/gh/vim/vim/commit/84cf6bd81bec93b49166cd48fccc7087fdbaa6fc
For example, when the argument is not a dict or the "sort" key is not
present.
Regards,
Yegappan
|
|
Thanks, done: #6278 |
|
Hi Christian,
On Tue, Jun 16, 2020 at 1:08 PM Christian Brabandt < ***@***.***> wrote:
Thanks, done: #6278 <#6278>
Thanks for adding the tests.
After adding a lot of tests we have reached 87.5% code coverage recently.
The goal is to reach 90%. To maintain this, we need to make sure that the
newly introduced code doesn't reduce the code coverage.
Regards,
Yegappan
|
|
I think this todo item is no longer needed: Line 282 in caf73dc |
|
I'll update the todo list |
allow to use collation based sorting of strings.
Currently, we use
strcmp, which does a case sensitive comparison (almost) everywhere (we also use strcasecmp). For collation based sorting, we need to switch using strcoll(). I don't know whether this is available everywhere, so add a configure check and in case it is not available, fall back to strcmp. Apparently, on windows, we at least usestrcoll()for generating the prototypes, so it should work there already.For now, only apply the collation based sorting only for the
readdirex()vimscript function, as asked for in #6214. I guess if this is useful and works well, we can move to strcoll() function for other functions as well.Also, since it is now possible to use
strcoll(), make it possible to query and set the collation based locale, using:lang collateas well as:echo v:collate.Add a couple of tests to verify the behaviour.