Conversation
|
We already have an implementation of this function in |
Yes, I mentioned it. We also have |
|
I have to admit that since I wrote In any case I will not further comment or discuss this. |
|
Have you ever seen some code splitting a UTF-8 string at given Unicode non-ASCII scalar value? |
Yes. I did quite some one-shot legacy database migration programs in which fields had to be denormalized. It happened at least once that I had to cut on U+2014 (—, EM DASH) and another time on U+00B6 (¶, PILCROW SIGN). Besides I also had to deal line by line with files I knew they were encoded for Windows. |
|
how about such interface: val split : ?keep_empty:bool -> ?at_most:int -> (char -> bool) -> string -> string list |
|
The char predicate instead of the explicit char is interesting, as it covers other variants of the function (as Core's |
This is basically |
|
@alainfrisch Note that in general if you'd like to improve |
|
in other languages, python/nodejs have an optional parameter for maximum split, for example: > "xxyxxyx".split('y')
[ 'xx', 'xx', 'x' ]
> "xxyxxyx".split('y',2)
[ 'xx', 'xx' ]I think when we design the library API, maybe we can just tweak based on the design of other industry languages |
|
@bobzhang it would be better to return some kind of iterator, imho: you can then filter (to remove empty chunks), count, take at most |
|
@c-cube |
|
I might have missed the intended meaning of |
|
even Java has such interface: public String[] split(String regex,
int limit)
Splits this string around matches of the given regular expression. |
Oh well, clearly we have no choice then but to have it in OCaml as well. |
|
The Java interface is interesting not because it has a |
If this was the case, I'd buy the argument that it is not worth providing a simpler function for the case of a character delimiter. But is this really the case? Just in the core distribution, we have multiple implementations of split-on-char. And out of 19 uses of I don't see the point of forcing regexps on users (a sub-language to learn -- and it's not like there were a single standard syntax of regexps -- with its own escaping rules; a dependency to a library; and probably much slower), when one can provide a 10-line function that covers many useful cases. |
|
So, concerning
|
Answering to myself: of course, this is not strictly speaking true, since splitting on e.g. |
|
You raise another thorny issue: what to do with occurrences of the delimiter at the beginning or end of the string? When splitting on whitespace, it makes sense to ignore both. When splitting on "," (CSV-style), it makes sense to ignore them at the end but not at the beginning. I'm sure there are situations where all delimiters must be honored. For an example, here is how Perl handles things: http://perldoc.perl.org/functions/split.html (See, it's hard to design library functions...) |
|
@xavierleroy: Agreed. And ideally we want to be consistent with other similar decisions made in the stdlib design to make it easier to remember how things work (ie. unlike C's standard library). |
I don't see that as a thorny issue. The proposed function has a clear semantics, specified by very simple invariants: (i) the size of the result list is 1 + the number of occurrences of the separator character in the string (or alternatively : (i') no chunk in the result contains the separator character); (ii) concatenating the result list using the separator gives back the original string. I don't see why we should pile other orthogonal features -- filtering (removing leading/trailing/all empty chunks; truncating the list) or post-processing (trimming whitespaces) -- when they can easily be implemented outside this function. |
|
@alainfrisch I like Java/C# semantics of |
|
I must say, I agree with @alainfrisch here. I use the function having that semantics on a pretty frequent basis and it is quite frustrating to be without it. |
|
FTR, extlib has and for speed purposes we also have For the frequency of usage |
|
Those [split] and [nsplit] functions have bad interfaces. Labelled arguments, for example, should be used to avoid confusion as to what the arguments mean. I maintain that the functions splitting on a single character are still sufficiently useful to justify their inclusion in the stdlib. |
|
@ygrek Thanks for the statistics. I could not find @c-cube Considering that we are apparently not going to have standard iterators in the stdlib soon, do you have an opinion on the current proposal (returning a list)? @xavierleroy Do you confirm your opposition to the feature ("useless, splitting on a regexp is the most common need")? Clearly a regexp-based version will not be added to the stdlib, so the question is really whether it's better to add a more limited version or nothing (which will likely lead to users reimplementing the simple version -- I doubt people would bring in a regexp engine for the simple case, except perhaps out of laziness if they already depend on such a library and don't care about performance). |
|
I don't have a strong opinion, I've been using an overlay over string for years anyway. Iterators are better if you want to do things like counting or mapping (especially if the iterator returns slices instead of copies), but a simple list-based API will do in the common cases. I think it's useful to have even a char-splitting case, even if just to split on lines. |
Well, one thing I like about the OCaml standard library is the lack of arbitrary additions. Keeping the library minimal but sufficient makes it easier to find and remember what is needed for each task. On your second question cf Alain's reply. |
|
@alainfrisch I'm not sure you understand correctly what Btw. I think that this discussion is hopeless. People always want different things for splitting functions and you won't be able to satisfy them all. A funny answer to that problem is Haskell's The approach I took in |
Quite frankly, indeed, I cannot really make sense of:
I don't see how this specifies the function's behavior (this would allow it to add arbitrary amounts of empty strings in the result -- they are all made of bytes that are not separated etc etc).
So a sequence of consecutive characters that satisfy the predicate count as one separator (i.e. does not produce empty intermediate strings)? Again, I don't see how this is specified in the text. And assuming this is the case, the only empty strings that could remain (and be discarded by
You were talking about API design, so I'm genuinely interested to hear about your design methodology, and how you reached the conclusions that lead to the design of AString, for instance. You mentioned looking at what other languages do, which is certainly a useful piece of information (although not likely to lead to the cleaner design). The limit argument, for instance, is rather wide-spread in other languages (with various different semantics), apparently more than the "drop empty chunks" feature. Why did you keep the latter and not the former? |
Just checked, in SML, |
I can only tell you about the design goals. The actual design choices I tend to forget them once the work is done (I have bad memory). Though through discussion I can recover them, see below. So the design goal was to devise a minimal set of composable functions to provide simple, index-free string processing while keeping in mind that this is neither a regexp nor a combinator parser library ---hence purposedly of limited nature so that you are not tempted to use the wrong tool for the wrong job.
This was already partially answered in this message.
Indeed and it seems that this is what So to sum-up here could be one design rationale:
EDIT: I updated the doc in fact it was not wrong, it was only confusing to myself. |
|
Basically, your argument against |
|
So, let me summarize the current state of the discussion (and please don't hesitate to comment if I forgot an important point):
|
|
@alainfrisch I agree that a simple version would be nice to have. One problem with OCaml's optional arguments is function-feature-bloat. Every function can be turned into an amalgamation of many different ones, going against the general philosophy of keeping functions simple and doing one thing. This is aggravated by the lack of function overloading in OCaml. Let's make the simple version, and add optional arguments later as needed. |
It wouldn't be simpler, you'd have more choices to perform. However you would be able to justify the lack or presence of features for a function in terms of other existing function -- which in your case you don't have in the API at the moment.
No, see above.
Maybe not for fun but likely because they saw it in other APIs. It's not because it's there elsewhere that it's a useful or pertinent idea (you doubtful conclusion). Lot of design happens by looking at other design, which is normal and fine, but you have to question these designs otherwise you just end up copying errors or doubtful choices. Also API design is much like UI design you have to listen to users, but not trust them too much in knowing what they actually want ("faster horses").
Note: in a different way, since it uses a predicate to determine separators.
I do not insist, you saw it in astring and asked me about it. I don't think it is essential for So rather than having my thought misrepresented I'll simply say that my opinion on this is that something that has the signatures of both #10 and #13 should be merged (if we can have labels in |
|
I've spend some more time browsing (with Github search) uses of string splitting functions provided by existing OCaml string libraries, and this confirmed that the case of a fixed single-byte separator is common enough to justify a dedicated function in the stdlib. I also still believe that splitting on a fixed multi-byte separator would be useful only in very limited cases. Since this form does not have an obvious "best" implementation (a very short separator would be better dealt with the naive algorithm, but a longer one would benefit from more complex algorithms), is not total (fails on empty separator) and is more difficult to specify in a non-algorithmic way (because of the case of self-overlapping separators), I still prefer the single-byte version. The discussion has derived around possible extensions of this function (the limit argument, dropping empty elements, single-byte predicate instead of a fixed byte), but considering how frequent the simplest case is and how difficult it is to make progress on this topic, I prefer to move forward without these extensions; they can be introduced later, either with extra optional arguments or with new functions (which would not supersede the simple one). |
|
Would it be possible to name this function Re. other libraries: Batteries has |
|
I'm fine with |
|
Thanks! |
|
I am personally for |
|
One reason to use |
|
Also the name nicely suggests that the separator character is the first parameter; there would be a type error in terms of mistake, but it's still nice to guess without having to look back at the signature. |
|
@lefessan : about your Batteries "warning/ OCaml incompatibility" criticism : |
String.split
Add Obj.drop_continuation
Not wanting to reopen old wounds (#10), but
String.splitwould be a really useful addition to the stdlib.I propose to add a form where the separator is a single character. Compared to a version with a string delimiter, the advantages of this simpler version are:
It's worth noting that both JaneStreet Core and ExtLib/Batteries expose a similar feature. ExtLib's [String.nsplit] has a string separator, but I could not find a single use of it with a long separator ( https://github.com/search?l=ocaml&p=4&q=%22String.nsplit%22&ref=searchresults&type=Code&utf8=%E2%9C%93 ).
Also, a similar function is implemented multiple times in the compiler code base:
Not talking about ocamldoc's
split_stringinteresting implementation (it supports multiple character separator, but this shows what people can actually write...).See also http://rosettacode.org/wiki/Tokenize_a_string#OCaml : several inefficient and uselessly complex implementations.