Skip to content

Allow to used Collation based sorting of strings#6229

Closed
chrisbra wants to merge 7 commits intovim:masterfrom
chrisbra:collation
Closed

Allow to used Collation based sorting of strings#6229
chrisbra wants to merge 7 commits intovim:masterfrom
chrisbra:collation

Conversation

@chrisbra
Copy link
Member

@chrisbra chrisbra commented Jun 9, 2020

allow to use collation based sorting of strings.

Currently, we use strcmp, which does a case sensitive comparison (almost) everywhere (we also use strcasecmp). For collation based sorting, we need to switch using strcoll(). I don't know whether this is available everywhere, so add a configure check and in case it is not available, fall back to strcmp. Apparently, on windows, we at least use strcoll() for generating the prototypes, so it should work there already.

For now, only apply the collation based sorting only for the readdirex() vimscript function, as asked for in #6214. I guess if this is useful and works well, we can move to strcoll() function for other functions as well.

Also, since it is now possible to use strcoll(), make it possible to query and set the collation based locale, using :lang collate as well as :echo v:collate.

Add a couple of tests to verify the behaviour.

@chrisbra chrisbra changed the title Collation Allow to used Collation based sorting of strings Jun 10, 2020
@brammool
Copy link
Contributor

I can see that collation-ordered sorting can be useful, but for the readdir() function I thought the question was for case-ignored sorting. So README, Readme and readme sort together. And I would expect at least some users wanting byte-order sorting (like we do nearly everywhere currently).

@chrisbra
Copy link
Member Author

So README, Readme and readme sort together.

Yes, that's what I believe :lang collate de_DE would cause. If you do not want this, use :lang collate C

@brammool
Copy link
Contributor

brammool commented Jun 10, 2020 via email

@chrisbra
Copy link
Member Author

Okay, will rework.

@k-takata
Copy link
Member

A generic, but possibly complicated solution, is to add an argument to specify the sorting:
none - byte order
icase - current case-folding comparison
collate - using language from environment
collate:{lang} - collattion using specific language

How about:

none - unsorted
case - byte order (default)
icase - current case-folding comparison
...

Returning the result without sorting may have speed advantage.

@chdiza
Copy link

chdiza commented Jun 10, 2020

A generic, but possibly complicated solution, is to add an argument to specify the sorting

Yes, this corresponds with my original request and seems like the least likely to cause problems.

I was about to alter my feature request after realizing that I was asking for a set of conditions that are jointly unsatisfiable. I wanted the results of readdirex() to (i) match those of Vimscript glob(), and (ii) match what one would get by sorting using Vimscript '>?'

I also realized that the latter is probably what I really want for my plugin.

Thus an optional flag to sort in that fashion is most welcome! (I don't know whether that is what is meant by "icase - current case-folding comparison").

@vim-ml
Copy link

vim-ml commented Jun 10, 2020 via email

@chdiza
Copy link

chdiza commented Jun 10, 2020

Whatever the sorting order for readdirex() is, it should be possible to use Vimscript to return to that sorting order after first sorting by (e.g., filesize or time). That's why I have landed on "the same as >?".

So unless the category expr5 has comparison operators that will follow "collation", I do not want "collation order" as the only alternative to the current "case sensitive" ordering.

Sketch of example: let foo be the result of calling readdirex(). It's a list of dicts. It has some sorting order to it, where what's sorted "on" are the filename entries in each dict. I might want to use Vimscript to sort foo by the "time" entry. And then having inspected that, I might want to use Vimscript return to sorting foo by filename. This last step ought to put foo back in the order it was in when it was first created by readdirex(). Or rather, the optional arg for readdirex() that I'm requesting should be able to take a value that allows for this putting-back.

IIUC collation order will not allow this, because IIUC >? and friends don't follow collation order.

Now one might say "well, just sort foo by >? right after it's created but before you display any of its contents." But that puts me right back where I was when I opened the GH issue: the time needed to perform that initial sorting cancels out the time saved by using readdirex() instead of glob() in the first place.

I used to think that Vimscript's glob() put things in my desired order (on macOS at least), but I now think I was wrong about that.

@vim-ml
Copy link

vim-ml commented Jun 10, 2020 via email

@chrisbra
Copy link
Member Author

reworked.

|locale|
Other values are silently ignored.

For example, to get a list of alle files in the current
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "all", not "alle". Also, one line below: "entries", not "entires".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks fixed.

@chrisbra
Copy link
Member Author

Hm, looks like there are some sorting failures on some CI builds. Not sure how to fix those. I especially find the S390 results strange.

@vim-ml
Copy link

vim-ml commented Jun 11, 2020 via email

src/filepath.c Outdated
}

if (dict_find(tv->vval.v_dict, (char_u *)"sort", -1) != NULL)
compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key "sort" is not found, the result is unpredictable.

Suggested change
compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE);
compare = dict_get_string(tv->vval.v_dict, (char_u *)"sort", FALSE);
else
{
*cmp = READDIR_SORT_BYTE;
return OK;
}

Or, should be an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and make missing 'sort' key an error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the optional dict only has the 'sort' key, making it mandatory may make sense.
But if another keys are added later, making it mandatory may not make sense.

@jamessan
Copy link
Contributor

I especially find the S390 results strange.

The tests assume that the order in which the files are created are the order in which readdir() will return them. There are no guarantees about ordering. It's up to the individual filesystem implementation.

@chrisbra
Copy link
Member Author

Thanks @k-takata and Gary for the reviews, I have adjusted it.

The tests assume that the order in which the files are created are the order in which readdir() will return them. There are no guarantees about ordering. It's up to the individual filesystem implementation.

@jamessan indeed the test assumes this. Any suggestions on how to improve the test? Or should we skip test for s390? BTW: What feature does need to be checked for that architecture? It is not completely obvious to me if we want to adjust tests specifically for that architecture.

@k-takata
Copy link
Member

@jamessan indeed the test assumes this. Any suggestions on how to improve the test?

Isn't the issue fixed by my suggested change?

@chrisbra
Copy link
Member Author

The CI tests aren't complete yet, but I assume some will still fail, because the newly added test depends on the order in which the files are written. And it looks like this can't be guaranteed everywhere.

@jamessan
Copy link
Contributor

@jamessan indeed the test assumes this. Any suggestions on how to improve the test? Or should we skip test for s390?

It has nothing to do with the architecture. You're just getting lucky on the other systems. The order of the items returned by readdir() is unspecified, so there shouldn't be any assertion on a specific ordering.

It's completely valid for XFS, btrfs, and ext4 to return the files in different orders than each other even if the same steps were taken to create the files.

@chrisbra
Copy link
Member Author

@jamessan Understood and I did not mean to hint that the sorting order is dependent on the architecture. Sorry for that. I did realize that the sorting order depends on the underlying filesystem. I guess what I actually did want to say where 2 questions:

  1. So we are lucky, that the sorting in all builds except for s390 built works, I'd like to keep the test and I'd like to have at least some test that makes sure this feature works. So even so the test works by coincidence only, what would be a better and reliable way to test? I suppose your answer to this question is: There is no way, so don't bother?
  2. How can I detect from within Vim that we are running on s390? So even if would like to disable this test for the S390 built, I have no idea how to test that from Vimscript. It's not ebcdic, so I was just curious how to determine that we are running on S390? I did not see anything obvious at :h feature-list.

@jamessan
Copy link
Contributor

jamessan commented Jun 12, 2020

  1. So even so the test works by coincidence only, what would be a better and reliable way to test? I suppose your answer to this question is: There is no way, so don't bother?

I would say that the #{sort: 'none'} case can't be tested. This is exactly what's unspecified.

However, you can probably use that data to determine how the rest of the test assertions should look. Although it's still not guaranteed, I think it's far less likely that the ordering will change among different readdir() calls when there are no intervening changes to the filesystem.

  1. How can I detect from within Vim that we are running on s390?

There shouldn't be any reason to do that. There's nothing unique about an s390 system that's relevant to this test. The particulars of the Travis setup for CI might be relevant, but that has no bearing on any other s390 system.

If my above suggestion doesn't work, then maybe it would be worthwhile to change .travis.yml so it skips this test on s390. However, I would argue that's just hiding a flaky test that's going to cause problems downstream and cause people to waste time investigating false-negative test failures.

It's not ebcdic, so I was just curious how to determine that we are running on S390?

Right, because there's nothing different about how Vim is built. It's just a different architecture. There's nothing in Vim that let's you distinguish between i386, amd64, mips, m68k, etc.

@k-takata
Copy link
Member

I would say that the #{sort: 'none'} case can't be tested.

Agree. It depends on its underlying filesystem, not on OS.

call writefile(['2'], 'Xdir2/Readme.txt')
call writefile(['3'], 'Xdir2/readme.txt')

" Skip tests on Mac OS X (does not allow several files with different caseing)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: caseing -> casing

@brammool
Copy link
Contributor

The Implementation looks OK now, thanks.
We should have a test that also works on Mac, using different file names. Also, the test still fails on s390.

@tonymec
Copy link

tonymec commented Jun 14, 2020

The Implementation looks OK now, thanks.
We should have a test that also works on Mac, using different file names. Also, the test still fails on s390.

Maybe do the test differently if has('ebcdic')? (zArchitecture, but not Linux on Z, usually uses EBCDIC encoding IIUC.)

let files = readdirex('Xdir2', 1, #{sort: 'none'})->map({-> v:val.name})
" not possible to assert the order in which the files appear, so just make
" sure it matches in any order.
call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about comparing after sorting?

Suggested change
call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted')
call assert_match(['README.txt', 'Readme.txt', 'readme.txt'], sort(files), 'unsorted')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here I am, thinking I was overly clever using assert_match() :/

Thanks, good suggestion 👍


" 4) sort by ignoring case
let files = readdirex('Xdir2', 1, #{sort: 'icase'})->map({-> v:val.name})
call assert_equal(['README.txt', 'readme.txt', 'Readme.txt'], files, 'sort by icase')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order of the files can be determined here.
'README.txt', 'readme.txt' and 'Readme.txt' have the same priority, then they can be sorted in any order.

Maybe, a good additional test is e.g. using 'a.txt', 'b.txt' and 'Z.txt':
case -> 'Z.txt', 'a.txt', 'b.txt'
icase -> 'a.txt', 'b.txt', 'Z.txt'
This test can be also executed on Windows and macOS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but on EBCDIC systems, in byte-value sorting, lowercase comes before uppercase and digits come last, (a-i 0x81-0x89, j-r 0x91-0x99, s-z 0xA2-0xA9, A-I 0xC1-0xC9, J-R 0xD1-0xD9, S-Z 0xE2-0xE9, digits 0xF0-0xF9, with punctuation characters in-between), so you may need to test for has('ebcdic') and if true compare A.txt, B.txt and z.txt (and if sorted on non-case-folded byte values, z will come before A).

Best regards,
Tony.

lang collate de_DE
let files = readdirex('Xdir2', 1, #{sort: 'collate'})->map({-> v:val.name})
call assert_equal(['readme.txt', 'Readme.txt', 'README.txt'], files, 'sort by de_DE collation')
catch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a skip message?

Suggested change
catch
catch
throw 'Skipped: de_DE collation is not available'

src/vim.h Outdated
Comment on lines +1602 to +1606
#ifdef HAVE_STRCOLL
# define STRCOLL(d, s) strcoll((char *)(d), (char *)(s))
# else
# define STRCOLL(d, s) strcmp((char *)(d), (char *)(s))
# endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef HAVE_STRCOLL
# define STRCOLL(d, s) strcoll((char *)(d), (char *)(s))
# else
# define STRCOLL(d, s) strcmp((char *)(d), (char *)(s))
# endif
#ifdef HAVE_STRCOLL
# define STRCOLL(d, s) strcoll((char *)(d), (char *)(s))
#else
# define STRCOLL(d, s) strcmp((char *)(d), (char *)(s))
#endif

Comment on lines +1935 to +1936
" Skip tests on Mac OS X (does not allow several files with different casing)
if !(has("osxdarwin") || has("osx") || has("macunix"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for Cygwin.

Suggested change
" Skip tests on Mac OS X (does not allow several files with different casing)
if !(has("osxdarwin") || has("osx") || has("macunix"))
" Skip tests on Mac OS X and Cygwin (does not allow several files with different casing)
if !(has("osxdarwin") || has("osx") || has("macunix") || has("win32unix"))

Maybe it's better to add a skip message?

src/filepath.c Outdated
Comment on lines +1487 to +1488
semsg(_(e_no_dict_key), "sort");
return FAIL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering if this error is really needed. #6229 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way. I leave it to @brammool, the error message also needs a proper number. I am on the fence, but I slightly prefer, if Vim would error if I had a typo like readdirex(..., 1, #{sorta: 'case'}.

src/vim.h Outdated
Comment on lines +2679 to +2682
# define READDIR_SORT_NONE 0 // do not sort
# define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default
# define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp)
# define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# define READDIR_SORT_NONE 0 // do not sort
# define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default
# define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp)
# define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll)
#define READDIR_SORT_NONE 0 // do not sort
#define READDIR_SORT_BYTE 1 // sort by byte order (strcmp), default
#define READDIR_SORT_IC 2 // sort ignoring case (strcasecmp)
#define READDIR_SORT_COLLATE 3 // sort according to collation (strcoll)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps, where did the space come from...

@jamessan
Copy link
Contributor

I would say that the #{sort: 'none'} case can't be tested. This is exactly what's unspecified.

However, you can probably use that data to determine how the rest of the test assertions should look.

I was thinking of something like below. This does pass on Travis.

diff --git a/src/testdir/test_functions.vim b/src/testdir/test_functions.vim
index 2316ac01d13..2463e3e5f67 100644
--- a/src/testdir/test_functions.vim
+++ b/src/testdir/test_functions.vim
@@ -1942,21 +1942,23 @@ func Test_readdirex_sort()
 
     " 1) default
     let files = readdirex('Xdir2')->map({-> v:val.name})
+    let default = copy(files)
     call assert_equal(['README.txt', 'Readme.txt', 'readme.txt'], files, 'sort using default')
 
     " 2) no sorting
     let files = readdirex('Xdir2', 1, #{sort: 'none'})->map({-> v:val.name})
+    let unsorted = copy(files)
     " not possible to assert the order in which the files appear, so just make
     " sure it matches in any order.
     call assert_match('^\c\(readme\C\.txt\)\{3\}$', join(files, ''), 'unsorted')
 
     " 3) sort by case (same as default)
     let files = readdirex('Xdir2', 1, #{sort: 'case'})->map({-> v:val.name})
-    call assert_equal(['README.txt', 'Readme.txt', 'readme.txt'], files, 'sort by case')
+    call assert_equal(default, files, 'sort by case')
 
     " 4) sort by ignoring case
     let files = readdirex('Xdir2', 1, #{sort: 'icase'})->map({-> v:val.name})
-    call assert_equal(['README.txt', 'readme.txt', 'Readme.txt'], files, 'sort by icase')
+    call assert_equal(unsorted->sort('i'), files, 'sort by icase')
 
     " 5) Default Collation
     let collate = v:collate

@chrisbra
Copy link
Member Author

thanks everybody, will rework and apply the suggestions later today.

After some discussion on github, it was decided to rework how sorting
should work for the readdirex function call.

The readdirex function now takes an optional argument. Extract from the
help:

>The optional {dict} argument allows for further custom
>values. Currently this is used to specify if and how sorting
>should be performed. The dict can have the following members:
>
>    sort    How to sort the result returned from the system.
>            Valid values are:
>                "none"	do not sort (fastest method)
>                "casa"	sort case senstive
>                "icase" sort case insensitive
>                "collate" sort using collation order of your |locale|
>            Other values are silently ignored.
>
>By default, the entries will be sorted using STRCMP, but one can also
>using strcasecmp and strcoll (using the current locales collation
>order).

Add some tests to verify the behaviour.

Adjust code according to reviews.
@chrisbra
Copy link
Member Author

okay reworked.

With the "time" argument the language used for
strftime() is printed. Technical: LC_TIME.
With the "collate" argument the language used for
collation order is printed. Technical: LC_COLLATE.
Copy link
Member

@dpelle dpelle Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line uses spaces for indentation whereas nearby lines use tabs (inconsistent).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed it as well.

@k-takata
Copy link
Member

@chrisbra
Copy link
Member Author

This line needs to be updated: https://github.com/vim/vim/pull/6229/files#diff-bea881dfa9626bab7141337b0fcdb23eR7921

Which one? It's not clear from that link.

@k-takata
Copy link
Member

Oh, sorry. The description of readdir():

The list will be sorted (case sensitive).

It should be

 The list will by default be sorted by name (case sensitive), 
 see below for changing the sorting. 

like readdirex().

And readdirex() should be also updated?

vim/runtime/doc/eval.txt

Lines 7978 to 7979 in 3c741f5

The list will by default be sorted by name (case sensitive),
see below for changing the sorting.

Not "below" now, I think.

@codecov
Copy link

codecov bot commented Jun 16, 2020

Codecov Report

Merging #6229 into master will decrease coverage by 0.00%.
The diff coverage is 87.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6229      +/-   ##
==========================================
- Coverage   87.48%   87.47%   -0.01%     
==========================================
  Files         143      143              
  Lines      158424   158475      +51     
==========================================
+ Hits       138590   138626      +36     
- Misses      19834    19849      +15     
Impacted Files Coverage Δ
src/evalfunc.c 96.37% <ø> (ø)
src/evalvars.c 95.98% <ø> (ø)
src/filepath.c 86.23% <77.77%> (-0.19%) ⬇️
src/fileio.c 84.85% <93.75%> (+0.04%) ⬆️
src/cmdexpand.c 92.96% <100.00%> (+<0.01%) ⬆️
src/ex_cmds2.c 90.85% <100.00%> (+0.14%) ⬆️
src/gui.c 63.37% <0.00%> (-0.51%) ⬇️
src/mbyte.c 79.62% <0.00%> (-0.25%) ⬇️
src/sign.c 94.77% <0.00%> (-0.18%) ⬇️
src/window.c 89.55% <0.00%> (-0.11%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b340bae...51bdd1c. Read the comment docs.

@chrisbra
Copy link
Member Author

all CI pass now

@chrisbra
Copy link
Member Author

I can squash into a single commit and force-push, if this is helpful.

@brammool
Copy link
Contributor

I suppose it's OK to include now. Thanks for fine-tuning it.

@brammool brammool closed this in 84cf6bd Jun 16, 2020
@vim-ml
Copy link

vim-ml commented Jun 16, 2020 via email

@chrisbra chrisbra deleted the collation branch June 16, 2020 19:55
@chrisbra
Copy link
Member Author

Thanks, done: #6278

@vim-ml
Copy link

vim-ml commented Jun 16, 2020 via email

@lacygoill
Copy link

I think this todo item is no longer needed:

Patch to use collation based sorting. (Christian Brabandt, #6229)

@brammool
Copy link
Contributor

I'll update the todo list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants