Skip to content

String.split#626

Merged
alainfrisch merged 2 commits intoocaml:trunkfrom
alainfrisch:string_split
Jul 9, 2016
Merged

String.split#626
alainfrisch merged 2 commits intoocaml:trunkfrom
alainfrisch:string_split

Conversation

@alainfrisch
Copy link
Copy Markdown
Contributor

Not wanting to reopen old wounds (#10), but String.split would be a really useful addition to the stdlib.

I propose to add a form where the separator is a single character. Compared to a version with a string delimiter, the advantages of this simpler version are:

  • A self-explanatory type: no need to introduce labelled arguments to disambiguate between two string arguments.
  • A total function that never fails, no special support for the empty separator.
  • Higher-level specification: no need to specify the matching algorithm to specify the behavior when the separator overlaps with itself (nor to introduce two variants for left-to-right and right-to-left traversal).

It's worth noting that both JaneStreet Core and ExtLib/Batteries expose a similar feature. ExtLib's [String.nsplit] has a string separator, but I could not find a single use of it with a long separator ( https://github.com/search?l=ocaml&p=4&q=%22String.nsplit%22&ref=searchresults&type=Code&utf8=%E2%9C%93 ).

Also, a similar function is implemented multiple times in the compiler code base:

  • Misc.Stdlib.String.split.
  • Misc.split.
  • Dumpobj.read_primitive_table
  • Dll.split
  • Primitives.split_string (in debugger), although this seems to be dead code

Not talking about ocamldoc's split_string interesting implementation (it supports multiple character separator, but this shows what people can actually write...).

See also http://rosettacode.org/wiki/Tokenize_a_string#OCaml : several inefficient and uselessly complex implementations.

@mshinwell
Copy link
Copy Markdown
Contributor

mshinwell commented Jun 20, 2016

We already have an implementation of this function in Misc.Stdlib.String.split in the compiler distribution.
(This is not an argument against adding it to the stdlib. My point being that the implementations should be checked to determine which is superior.)

@alainfrisch
Copy link
Copy Markdown
Contributor Author

We already have an implementation of this function in Misc.Stdlib.String.split in the compiler distribution.

Yes, I mentioned it. We also have Misc.split and many more.

@dbuenzli
Copy link
Copy Markdown
Contributor

I have to admit that since I wrote Astring I no longer care about String's largely broken and hopeless design but I'd just mention that having the separator as a string makes it compose well with String.concat and allows to split UTF-8 strings at a given Unicode scalar value.

In any case I will not further comment or discuss this.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Have you ever seen some code splitting a UTF-8 string at given Unicode non-ASCII scalar value?

@dbuenzli
Copy link
Copy Markdown
Contributor

Have you ever seen some code splitting a UTF-8 string at given Unicode non-ASCII scalar value?

Yes. I did quite some one-shot legacy database migration programs in which fields had to be denormalized. It happened at least once that I had to cut on U+2014 (—, EM DASH) and another time on U+00B6 (¶, PILCROW SIGN). Besides I also had to deal line by line with files I knew they were encoded for Windows.

@bobzhang
Copy link
Copy Markdown
Member

how about such interface:

val split : ?keep_empty:bool -> ?at_most:int -> (char -> bool) -> string -> string list 

@alainfrisch
Copy link
Copy Markdown
Contributor Author

at_most seems rather ad hoc (we don't have such an argument e.g. on List.find_all), and would complexify the implementation (which currently traverses the string right-to-left) and the specification. What would be the use case? Another common operation is to split a string in 2 on a delimiter, but this is better served by another function.

keep_empty: this is a simple postprocessing on the result (|> List.filter ((<>) ""). A lot of other "simple" postprocessing actions (such as trimming whitespaces on each item) could be useful, but it's better done through composition than piling extra optional arguments.

The char predicate instead of the explicit char is interesting, as it covers other variants of the function (as Core's split_on_chars, and passing a function callback should be more efficient than a list of characters). What do other people think?

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jun 20, 2016

The char predicate instead of the explicit char is interesting, as it covers other variants of the function (as Core's split_on_chars, and passing a function callback should be more efficient than a list of characters). What do other people think?

This is basically Astring.String.fields which I would distinguish from split (cuts in Astring). In any case if you go down the line of char -> bool predicates, Char.equal should be added.

@dbuenzli
Copy link
Copy Markdown
Contributor

@alainfrisch Note that in general if you'd like to improve String I'd like to encourage you to have a look at Astring. It is the result of a long and painful design process that involved reviewing the String APIs of major programming languages trying to find the right balance between minimality and expressiveness while keeping in mind that this is neither a regexp nor a combinator parsing library.

@bobzhang
Copy link
Copy Markdown
Member

bobzhang commented Jun 20, 2016

in other languages, python/nodejs have an optional parameter for maximum split, for example:

> "xxyxxyx".split('y')
[ 'xx', 'xx', 'x' ]
> "xxyxxyx".split('y',2)
[ 'xx', 'xx' ]

I think when we design the library API, maybe we can just tweak based on the design of other industry languages

@c-cube
Copy link
Copy Markdown
Contributor

c-cube commented Jun 20, 2016

@bobzhang it would be better to return some kind of iterator, imho: you can then filter (to remove empty chunks), count, take at most n elements... Much more compositional.

@bobzhang
Copy link
Copy Markdown
Member

@c-cube which is better design is quite subjective, in my opinion, following the established API brings less surprise, btw, composition is not free...

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I might have missed the intended meaning of at_most. @c-cube's note implied that it would simply truncate the resulting list. I interpreted it rather at "stopping the split after N instances of the separator has been found" (so the last chunk can still contain the separator).

@bobzhang
Copy link
Copy Markdown
Member

even Java has such interface:

public String[] split(String regex,
             int limit)
Splits this string around matches of the given regular expression.

@trefis
Copy link
Copy Markdown
Contributor

trefis commented Jun 20, 2016

even Java has such interface

Oh well, clearly we have no choice then but to have it in OCaml as well.

@xavierleroy
Copy link
Copy Markdown
Contributor

The Java interface is interesting not because it has a limit parameter, but because it splits on a regular expression. It is my deep belief that splitting on a single character is useless and most actual uses of string splitting need a regexp instead, e.g. "split on whitespace" [ \t]+ or "split on comma followed by whitespace" ,[ \t]*. So, to me, this functionality belongs in a regexp library, not in String. The ancient Str library has it. I don't know about more modern regexp libraries, but if they don't have "split" they should.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

most actual uses of string splitting need a regexp instead

If this was the case, I'd buy the argument that it is not worth providing a simpler function for the case of a character delimiter. But is this really the case? Just in the core distribution, we have multiple implementations of split-on-char. And out of 19 uses of Str.split in ocamldoc, only 4 of them use a regexp. A very quick look at all Str.split on Github suggests that the case of splitting on one character is the most common; other cases include splitting on multiple possible characters (one variant suggested by @bobzhang), or on a "+" iteration (i.e. composing with List.filter ((<>) ""). Other cases are extremely rare (and probably buggy). The case for non-regexp split functions is also supported by the existence of such functions in Core, Batteries and even @dbuenzli's AString.

I don't see the point of forcing regexps on users (a sub-language to learn -- and it's not like there were a single standard syntax of regexps -- with its own escaping rules; a dependency to a library; and probably much slower), when one can provide a 10-line function that covers many useful cases.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

So, concerning at_most, the two semantics exist:

  • In Javascript, the second argument of .split truncates the result: "a b c d".split(" ", 2) returns [ "a", "b" ]. Remaining characters are discarded.
  • In Java or C#, the equivalent would return ["a","b c d"]. Same in Python (although the integer passed corresponds to the maximum number of matched occurrences of the delimiter; the returned list can have one more element).

@alainfrisch
Copy link
Copy Markdown
Contributor Author

"+" iteration (i.e. composing with List.filter ((<>) "")

Answering to myself: of course, this is not strictly speaking true, since splitting on e.g. [\t]+ can leave empty chunks in the result, but only possibly the first and/or last one (which sounds rather irregular); I guess most code that splits on such a regexp would be perfectly fine -- if not better -- if empty chunks were removed.

@xavierleroy
Copy link
Copy Markdown
Contributor

You raise another thorny issue: what to do with occurrences of the delimiter at the beginning or end of the string? When splitting on whitespace, it makes sense to ignore both. When splitting on "," (CSV-style), it makes sense to ignore them at the end but not at the beginning. I'm sure there are situations where all delimiters must be honored. For an example, here is how Perl handles things: http://perldoc.perl.org/functions/split.html

(See, it's hard to design library functions...)

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Jun 21, 2016

@xavierleroy: Agreed. And ideally we want to be consistent with other similar decisions made in the stdlib design to make it easier to remember how things work (ie. unlike C's standard library).

@alainfrisch
Copy link
Copy Markdown
Contributor Author

You raise another thorny issue: what to do with occurrences of the delimiter at the beginning or end of the string?

I don't see that as a thorny issue. The proposed function has a clear semantics, specified by very simple invariants: (i) the size of the result list is 1 + the number of occurrences of the separator character in the string (or alternatively : (i') no chunk in the result contains the separator character); (ii) concatenating the result list using the separator gives back the original string. I don't see why we should pile other orthogonal features -- filtering (removing leading/trailing/all empty chunks; truncating the list) or post-processing (trimming whitespaces) -- when they can easily be implemented outside this function.

@bobzhang
Copy link
Copy Markdown
Member

@alainfrisch I like Java/C# semantics of at_most

@mshinwell
Copy link
Copy Markdown
Contributor

I must say, I agree with @alainfrisch here. I use the function having that semantics on a pretty frequent basis and it is quite frustrating to be without it.

@ygrek
Copy link
Copy Markdown
Contributor

ygrek commented Jun 24, 2016

FTR, extlib has

# String.split;;
- : string -> string -> string * string = <fun>
# String.nsplit;;
- : string -> string -> string list = <fun>

and for speed purposes we also have

# nsplitc;;
- : string -> char -> string list = <fun>
# nsplitc_rev;;
- : string -> char -> string list = <fun>

For the frequency of usage nsplitc wins over split which wins over nsplit.

@mshinwell
Copy link
Copy Markdown
Contributor

Those [split] and [nsplit] functions have bad interfaces. Labelled arguments, for example, should be used to avoid confusion as to what the arguments mean.

I maintain that the functions splitting on a single character are still sufficiently useful to justify their inclusion in the stdlib.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

@ygrek Thanks for the statistics. I could not find nsplitc in Extlib. Are you referring to https://github.com/ahrefs/devkit?

@c-cube Considering that we are apparently not going to have standard iterators in the stdlib soon, do you have an opinion on the current proposal (returning a list)?

@xavierleroy Do you confirm your opposition to the feature ("useless, splitting on a regexp is the most common need")? Clearly a regexp-based version will not be added to the stdlib, so the question is really whether it's better to add a more limited version or nothing (which will likely lead to users reimplementing the simple version -- I doubt people would bring in a regexp engine for the simple case, except perhaps out of laziness if they already depend on such a library and don't care about performance).

@c-cube
Copy link
Copy Markdown
Contributor

c-cube commented Jun 28, 2016

I don't have a strong opinion, I've been using an overlay over string for years anyway. Iterators are better if you want to do things like counting or mapping (especially if the iterator returns slices instead of copies), but a simple list-based API will do in the common cases. I think it's useful to have even a char-splitting case, even if just to split on lines.

@paurkedal
Copy link
Copy Markdown

paurkedal commented Jul 8, 2016

@paurkedal keep_empty has a default value which is false or true, so it's fine for users to ignore that. can you elaborate c, limit means the output can be at most of length limit, why is it ambiguous?

Well, one thing I like about the OCaml standard library is the lack of arbitrary additions. Keeping the library minimal but sufficient makes it easier to find and remember what is needed for each task. On your second question cf Alain's reply.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 8, 2016

@alainfrisch I'm not sure you understand correctly what Astring.String.fields does. It doesn't split on characters that satisfy the predicate, it allows to define separators in terms of consecutive characters for which the predicate is true. In the particular case of Char.Ascii.is_white (the default), this will tokenize on white space. I stole that one (and the name) from SML.

Btw. I think that this discussion is hopeless. People always want different things for splitting functions and you won't be able to satisfy them all. A funny answer to that problem is Haskell's Data.List.Split which defines a few combinators in terms of which all splitting variants can be defined...

The approach I took in Astring is to simply provide only a few simple ones. Programmers that end up using complex splitting strategies are often like people who use regexps for parsing non-regular languages, they should not be doing this.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

alainfrisch commented Jul 8, 2016

I'm not sure you understand correctly what Astring.String.fields does.

Quite frankly, indeed, I cannot really make sense of:

fields ~empty ~is_sep s is the list of (possibly empty) substrings made of bytes that are not separated by a byte for which is_sep is true.

I don't see how this specifies the function's behavior (this would allow it to add arbitrary amounts of empty strings in the result -- they are all made of bytes that are not separated etc etc).

it allows to define separators in terms of consecutive characters for which the predicate is true

So a sequence of consecutive characters that satisfy the predicate count as one separator (i.e. does not produce empty intermediate strings)? Again, I don't see how this is specified in the text. And assuming this is the case, the only empty strings that could remain (and be discarded by ~empty:false) would be at the beginning or the end?

The approach I took in Astring is to simply provide only a few simple ones.

You were talking about API design, so I'm genuinely interested to hear about your design methodology, and how you reached the conclusions that lead to the design of AString, for instance. You mentioned looking at what other languages do, which is certainly a useful piece of information (although not likely to lead to the cleaner design). The limit argument, for instance, is rather wide-spread in other languages (with various different semantics), apparently more than the "drop empty chunks" feature. Why did you keep the latter and not the former?

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I stole that one (and the name) from SML.

Just checked, in SML, fields (as opposed to tokens) splits on invididual character delimiters, so consecutive such characters would produce empty chunks, which is exactly what I understood initially. As far as I understand the SML doc, tokens can be expressed as fields followed by removing empty chunks.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 8, 2016

You were talking about API design, so I'm genuinely interested to hear about your design methodology, and how you reached the conclusions that lead to the design of AString, for instance.

I can only tell you about the design goals. The actual design choices I tend to forget them once the work is done (I have bad memory). Though through discussion I can recover them, see below.

So the design goal was to devise a minimal set of composable functions to provide simple, index-free string processing while keeping in mind that this is neither a regexp nor a combinator parser library ---hence purposedly of limited nature so that you are not tempted to use the wrong tool for the wrong job.

The limit argument, for instance, is rather wide-spread in other languages (with various different semantics), apparently more than the "drop empty chunks" feature. Why did you keep the latter and not the former?

This was already partially answered in this message. empty is here in fact to provide the tokenization feature. The documentation of Astring.String.fields is wrong... see below

Just checked, in SML, fields (as opposed to tokens) splits on invididual character delimiters, so consecutive such characters would produce empty chunks, which is exactly what I understood initially. As far as I understand the SML doc, tokens can be expressed as fields followed by removing empty chunks.

Indeed and it seems that this is what Astring.String.fields does despite what its documentation says... (I guess it's a left over a previous misguided design that I forgot to update). See the test suite.

So to sum-up here could be one design rationale:

  • No limit because it's going to be a small number. In that case better turn to Astring.String.cut or
    Astring.String.span.
  • empty in the case of Astring.String.fields to be able to use the function to devise simple tokenizers without having to filter the list afterwards.

EDIT: I updated the doc in fact it was not wrong, it was only confusing to myself.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

Basically, your argument against limit is that the gain in terms of performance (even with substrings, you need to allocate at each step) is usually not worth the additional complexity in the API, while for empty it is. The choice to include one but not the other is based on your intuition about respective usage scenarios for those functions. I respect this intuition, but I also observe that other people disagree (they did not ask for limit just for fun) and other languages made the opposite choice. Designing an API is full of such decisions, but then I don't see why "designing a string API" would be simpler than discussion this specific function as you suggested earlier. "Designing AString" required to make some non-canonical choices and if this had been done through a community-driven process, it would not have been easier than the process we observe here. In other words, this is more about "single-man opinionated design" vs "collaborative design" than about "designing an API" vs "designing a single function".

@alainfrisch
Copy link
Copy Markdown
Contributor Author

alainfrisch commented Jul 8, 2016

So, let me summarize the current state of the discussion (and please don't hesitate to comment if I forgot an important point):

  • Everyone (except @xavierleroy) who participated to this discussion agree that a function to "split strings on delimiters" would be a nice addition to the stdlib.
  • Single-char vs string delimiters: no strong opponent to the single-char version (even @dbuenzli's AString has function that splits on single-byte characters).
  • Fixed char delimited vs char predicate: some support for the predicate version, although concerns are expressed that it would make the function less easy (and also slower) for common cases. Core's solution is to provide both variants, which is perhaps the best direction in my opinion. Is anyone against this approach?
  • limit or not: several people insisted that it is a useful addition, and many other languages support this feature on their "canonical" string splittting function (which is a good indication that this is a useful-enough feature), nobody argued against that "starting from the left" is the most important case. @dbuenzli maintains that this is a corner case use better left to the user. I'm personally undecided as whether this should go in the stdlib together with this current PR or wait later, with a slight preference for inclusion.
  • drop_empty or not: this can be trivially implemented by post-processing, but @bobzhang and @dbuenzli both insist that the feature is useful. It is available in some other languages, but not so many. Interaction with limit is not completely "canonical". My personal preference would be to leave this outside the current PR and see if there is enough demand later.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Jul 8, 2016

@alainfrisch I agree that a simple version would be nice to have. One problem with OCaml's optional arguments is function-feature-bloat. Every function can be turned into an amalgamation of many different ones, going against the general philosophy of keeping functions simple and doing one thing. This is aggravated by the lack of function overloading in OCaml.

Let's make the simple version, and add optional arguments later as needed.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 9, 2016

Designing an API is full of such decisions, but then I don't see why "designing a string API" would be simpler than discussion this specific function as you suggested earlier.

It wouldn't be simpler, you'd have more choices to perform. However you would be able to justify the lack or presence of features for a function in terms of other existing function -- which in your case you don't have in the API at the moment.

In other words, this is more about "single-man opinionated design" vs "collaborative design" than about "designing an API" vs "designing a single function".

No, see above.

I respect this intuition, but I also observe that other people disagree (they did not ask for limit just for fun) and other languages made the opposite choice.

Maybe not for fun but likely because they saw it in other APIs. It's not because it's there elsewhere that it's a useful or pertinent idea (you doubtful conclusion). Lot of design happens by looking at other design, which is normal and fine, but you have to question these designs otherwise you just end up copying errors or doubtful choices. Also API design is much like UI design you have to listen to users, but not trust them too much in knowing what they actually want ("faster horses").

Single-char vs string delimiters: no strong opponent to the single-char version (even @dbuenzli's AString has function that splits on single-byte characters).

Note: in a different way, since it uses a predicate to determine separators.

drop_empty or not: this can be trivially implemented by post-processing, but @bobzhang and @dbuenzli both insist that the feature is useful.

I do not insist, you saw it in astring and asked me about it. I don't think it is essential for cuts (separator is a string), I think it is useful for fields (separator is a byte determined by a predicate) since this give you a cheap tokenization function (e.g. split on white space). I think the reason why I have it in cuts is to be consistent with fields.

So rather than having my thought misrepresented I'll simply say that my opinion on this is that something that has the signatures of both #10 and #13 should be merged (if we can have labels in string.mli #10 should add a label for the separator). This would be useful additions that would be consistent with the current design of the string module.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I've spend some more time browsing (with Github search) uses of string splitting functions provided by existing OCaml string libraries, and this confirmed that the case of a fixed single-byte separator is common enough to justify a dedicated function in the stdlib.

I also still believe that splitting on a fixed multi-byte separator would be useful only in very limited cases. Since this form does not have an obvious "best" implementation (a very short separator would be better dealt with the naive algorithm, but a longer one would benefit from more complex algorithms), is not total (fails on empty separator) and is more difficult to specify in a non-algorithmic way (because of the case of self-overlapping separators), I still prefer the single-byte version.

The discussion has derived around possible extensions of this function (the limit argument, dropping empty elements, single-byte predicate instead of a fixed byte), but considering how frequent the simplest case is and how difficult it is to make progress on this topic, I prefer to move forward without these extensions; they can be introduced later, either with extra optional arguments or with new functions (which would not supersede the simple one).

@alainfrisch alainfrisch merged commit 0efbba1 into ocaml:trunk Jul 9, 2016
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 9, 2016

Would it be possible to name this function split_char or split_on_char rather than split? Batteries has split : string -> by:string -> string * string and I would rather avoid having to deal with the name conflict.

Re. other libraries: Batteries has {,r}split : string -> by:string -> string * string and nsplit : string -> by:string -> string list. Core uses split : string -> on:char -> string list and {l,r}split2_exn : string -> on:char -> string * string. CCString uses Split.list : by:string -> substring list, and AString uses cut : ?rev:bool -> sep:string -> string -> (string * string) option and cuts : ?rev:bool -> ?empty:bool -> sep:string -> string -> string list.

@alainfrisch
Copy link
Copy Markdown
Contributor Author

I'm fine with split_on_char. Waiting until Monday in case someone comes up with a better idea.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 11, 2016

Thanks!

@lefessan
Copy link
Copy Markdown
Contributor

I am personally for String.split, and I really dislike String.split_on_char. Why not String.split_on_a_separator_that_is_a_char_to_obtain_a_list_of_strings ? The stdlib is supposed to provide the shortest names for the most common functions, we shouldn't care about compatibility with Batteries, especially when Batteries has been warned for years that it shouldn't include String or advertize itself as a strict superset of the stdlib (something that it should never have promised to its users), because doing so is known to create incompatibilities with every new version of OCaml.

@hcarty
Copy link
Copy Markdown
Member

hcarty commented Jul 11, 2016

One reason to use split_on_char that has nothing to do with other libraries is because of the discussion in this thread regarding splits on char vs string vs predicate. If there's a reasonable possibility that other non-character-based split_* functions may be added to String then it makes sense to have the name be somewhat explicit.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 11, 2016

Also the name nicely suggests that the separator character is the first parameter; there would be a type error in terms of mistake, but it's still nice to guess without having to look back at the signature.

@Fourchaux
Copy link
Copy Markdown
Contributor

Fourchaux commented Jul 11, 2016

@lefessan : about your Batteries "warning/ OCaml incompatibility" criticism :
it seems rather that this function should exist in the Stdlib for YEARS and therefore Batteries (and many, many others) would never had to propose this kind of basic and very useful functions.

camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017
camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017
yallop pushed a commit to yallop/ocaml that referenced this pull request Mar 22, 2018
EduardoRFS pushed a commit to esy-ocaml/ocaml that referenced this pull request Dec 17, 2021
stedolan pushed a commit to stedolan/ocaml that referenced this pull request May 24, 2022
)

Co-authored-by: Luke Maurer <Luke.Maurer@alumni.carleton.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.