Skip to content

make string/bytes distinguishable in the underlying compiler implementation#596

Merged
damiendoligez merged 8 commits intoocaml:trunkfrom
rescript-lang:safe_string
Aug 8, 2016
Merged

make string/bytes distinguishable in the underlying compiler implementation#596
damiendoligez merged 8 commits intoocaml:trunkfrom
rescript-lang:safe_string

Conversation

@bobzhang
Copy link
Copy Markdown
Member

This is one part of the patch required for bucklescript compiler, but I think it would be generally helpful to other backends.

The patch will be composed of such components:

  1. All the lambda primtives (Pstring_, Pbytes_)
  2. Runtime c function (for example, caml_blit_[bytes|string]
  3. Adapt the stdlib to pick the right primitives in string/bytes module
  4. Propagate the primitives in native backend, for this revision, string|bytes will be treated in the same way in native backend

It is quite a large patch, but the patch itself is not complex

@bobzhang
Copy link
Copy Markdown
Member Author

@damiendoligez are you happy with this direction?
@chambart would you have a look at the propagation in middle_end/asmcomp? Thanks

@bobzhang bobzhang changed the title [WIP] make string/bytes distinguishable in the underlying compiler implementation make string/bytes distinguishable in the underlying compiler implementation May 25, 2016
@bobzhang
Copy link
Copy Markdown
Member Author

bobzhang commented May 25, 2016

ready for review.

two minor issues:

external caml_create_string : int -> bytes
external caml_fill_string : bytes -> int -> int -> char -> unit

The name is a bit misleading, however, if we rename it into caml_create_bytes, it might break backward compatiblity, @alainfrisch what do you think?

comp_init env sz decl_size
end
| Lprim((Pidentity | Popaque), [arg]) ->
| Lprim((Pidentity | Popaque | Pbytes_to_string | Pbytes_of_string), [arg]) ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you treat Pbytes_to_string & Pbytes_of_string specially ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this revision does do any change to the code gen, for both native and bytecode gen, but it is a step to propagate it into bytecode or native backend, you can send another PR after this is merged

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Then I don't understand why you have some changes in byterun/io.c and byterun/str.c

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal is propagating the information while not impacting the semantics of code generator for this revision(it already touched 25 files and needs bootstrapping), once we change the code generation, it might introduce some subtle bugs if it is not complete.
As I said, it is a good step to propagate it into bytecode/native backend once you have confidence that it is complete, but I won't do it in this revision

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Then I don't understand why you have some changes in byterun/io.c and byterun/str.c

I understand that you don't want to change the code generation, but in that case you probably don't want to change the runtime either. that would make this revision touch 23 files only

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those two primitives are used in Pervasives.unsafe_output, Pervasives.unsafe_output_string, String.unsafe_blit, Bytes.unsafe_blit
The changes are minimal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see there are used.
It just feels inconsistent to only partially add primitives for bytes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, what about external equal : t -> t -> bool = "caml_string_equal" used in bytes.ml ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I missed it, will add it -- this is the reason I don't want to change the codegen for this revision, in case I missed something

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't modify the runtime in this feature, and you won't miss anything.
Anyway, it does not really matter.

@damiendoligez
Copy link
Copy Markdown
Member

I'm OK with this idea, but I'd like to hear the opinions of back-end experts (@hhugo, @mshinwell @chambart, anyone else).

One remark: I don't understand why you keep Pstringsets and Pstringsetu : the whole point of the bytes/string distinction is to get rid of them, and indeed they are not used in the standard library and not reachable by user code.

@bobzhang
Copy link
Copy Markdown
Member Author

@damiendoligez I will remove Pstringsets and Pstringsetu

@bobzhang
Copy link
Copy Markdown
Member Author

@damiendoligez removing Pstringsets will break some backward compatibility, there might be some code using %string_safe_set, are you okay with that?

@damiendoligez
Copy link
Copy Markdown
Member

@damiendoligez removing Pstringsets will break some backward compatibility, there might be some code using %string_safe_set, are you okay with that?

Since it's provided as Bytes.set (and String.set) there's no excuse for using %string_safe_set directly, so I'm OK with breaking such code.

Or you might want to translate it into Pbytessets instead. In any case, a back-end that separates immutable string from mutables bytes will have to fail on Pstringsets (or map it to Pbytessets instead).

@bobzhang
Copy link
Copy Markdown
Member Author

bobzhang commented Jun 2, 2016

@damiendoligez removed Pstringset(s|u) per suggestions

@bobzhang
Copy link
Copy Markdown
Member Author

ping @damiendoligez , is it good for merge?

stdlib/bytes.ml Outdated
external set : bytes -> int -> char -> unit = "%string_safe_set"
external get : bytes -> int -> char = "%bytes_safe_get"
external set : bytes -> int -> char -> unit = "%bytes_safe_set"
external create : int -> bytes = "caml_create_string"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external create : int -> bytes = "caml_create_bytes" ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above:

external caml_create_string : int -> bytes
external caml_fill_string : bytes -> int -> int -> char -> unit

indeed, the name is a bit misleading, either we stick to the name for better backward compatibility or we change the c primitive names or we add a duplication, I am happy with either, @hhugo @damiendoligez what do you think

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you've started partially duplicating string primitives : caml_ml_output_bytes caml_blit_bytes caml_caml_bytes_equal.
I would duplicate all or none.

@bobzhang
Copy link
Copy Markdown
Member Author

@hhugo replaced caml_fill_string, caml_create_string per suggestions

@bobzhang
Copy link
Copy Markdown
Member Author

ping @damiendoligez @gasche ? I would say this is a safe internal change, let me know anything I can do to help it get merged, thanks!

byterun/str.c Outdated
{
mlsize_t size = Long_val(len);
if (size > Bsize_wsize (Max_wosize) - 1){
caml_invalid_argument("String.create");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Bytes.create"?

@bobzhang
Copy link
Copy Markdown
Member Author

bobzhang commented Aug 3, 2016

Hi all, I addressed Alain's comments and applied Hugo's patch. (I am not very confident to apply other patches in bytegen, but it is a good step in this direction)
Please review again, if everyone is happy, should I do the rebase against 4.04 or trunk?

@damiendoligez damiendoligez merged commit be65331 into ocaml:trunk Aug 8, 2016
@damiendoligez
Copy link
Copy Markdown
Member

Reviewed, slightly modified, merged, and cherry-picked to 4.04.

@mshinwell
Copy link
Copy Markdown
Contributor

I thought this was going to be deferred until 4.05?

@damiendoligez
Copy link
Copy Markdown
Member

It is quite a low-risk change, useful in the short term for some users and in the long term for js_of_ocaml.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 12, 2016

This PR breaks the compilation of Flowcaml, that uses

external set : ('a, 'b) t -> 'b int -> 'a char -{'b ||}-> unit 
               with 'b < 'a
= "%string_safe_set"

in a module (the weird types are FlowCaml types, the point is the external name) and include String as the implementation.

opam-builder report: http://opam.ocamlpro.com/builder/html/flowcaml/flowcaml.1.07/c9f316bcefc4331699e41ea036cce35c

One may want to debate whether this is a problematic breakage or not (possibly not). I do wonder whether keeping the legacy primitive names for the String functions, and using the new names for the Bytes function would be acceptable for the backends.

Note that this PR does not seem to have a proper Changes entry (it should be marked as compatibility-breaking), the only change entry in the PR is

+* GPR#442, caml_fill_string, and caml_create_string is deprecated 
+  and will be removed in the future, please use caml_fill_bytes 
+  and caml_create_bytes for migration 
+  (Hongbo Zhang, review by Damien Doligez, Alain Frisch, and Hugo Heuzard)

which does not list this PR number and does not mention the other primitive name changes.

@bobzhang
Copy link
Copy Markdown
Member Author

bobzhang commented Aug 12, 2016

hi, @gasche I did discuss this issue with @damiendoligez before, shall I create a new PR to enhance Changes entry?
Btw, I thought * means breaking change, did I miss anything?

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 17, 2016

Another package that is broken by this change is rml.

opam-builder report:
http://opam.ocamlpro.com/builder/html/rml/rml.1.09.04/c52b6f5e8a4b821200e5ae80b2e54cc7

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 17, 2016

wget is also broken by this change

opam-builder report: http://opam.ocamlpro.com/builder/html/wget/wget.0.1.0/b51a913249637bfbe31a754953ba2135

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 17, 2016

@damiendoligez, @bobzhang I'm getting concerned with the number of unrelated packages hit by this issue (Batteries was also hit, of course, but that is more to be expected).

Is there anything wrong with keeping the old primitive names caml_string_set and others for mutable string operations, and rewriting them into the corresponding Bytes operation in the compiler code? If I understand correctly this would still be reasonable for alternative backend (there is no confusion on which types these operations should have in a -safe-string world), and it would remove this source of breakage.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 17, 2016

If you were to both agree that this is a reasonable compromise, I could send a PR against the ocaml codebase reverting this part of the change (or @bobzhang might want to implement it directly), to be integrated in a second beta version for 4.04.

@bobzhang
Copy link
Copy Markdown
Member Author

@gasche If we are moving in the direction of -safe-string, we are going to break the existing code any way, I would see people to fix their code sooner than later.
see the signature https://github.com/samoht/ocaml-wget/blob/master/src/stringext.mli#L19

 external set : string -> int -> char -> unit = "%string_safe_set"

This signature does not make sense under -safe-string, right? How is it different that fix the code in 4.04 vs 4.05?

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 18, 2016

What you say is not unreasonable. I still think it's a pity that software that has been working for years is broken by a change that is not, in itself, necessary, and doesn't come with the advantages of making -safe-string the default, but I guess that at this point in release time in may be too late to change back anyway so maybe we should go with it.

@bobzhang
Copy link
Copy Markdown
Member Author

@gasche That's my argument but I don't have strong opinions here, I can send a PR which treats %string_safe_set the same as %bytes_safe_set so it will not break other people's code. My concerns are that this actually will make the transition from -unsafe-string to -safe-string slower. I think the consensus is that in the long term that we should only support -safe-string . @damiendoligez what do you think?

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Aug 18, 2016

I think the problem is that a compiler primitive is used outside the standard library. It gives a cent to @mshinwell idea in #724 :

However I am interested in pursuing a little this idea of having a flag [(set by default)] to disallow % primitives, and then just having a normal interface. Are there any opinions on that?

I would prefer if we could not break working code, but have a transition plan ("safe-string" and "no-primitive-outside-stdlib") with the release where the defaults will be changed known in advance.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 18, 2016

One of the reasons why I'm ready to concede this specific point is that the packages that are broken by this change appear to be maintained at a very low frequency (FlowCaml hasn't changed in eight years), so it's not clear that a well-defined, well-announced migration path would actually do a big difference compared to the current breakage.

(Well-handled transitions have the property that maintainers fix the issue before it starts breaking code. Low-frequency maintainers only fix issues when they break code, so transitions don't make much of a difference.)

@damiendoligez
Copy link
Copy Markdown
Member

@bobzhang Is is possible to simply make %string_safe_set an alias for %bytes_safe_set, so that the compilers would generate exactly the same code for both? This way, the compatibility with old code is maintained, but we can use the new name in the stdlib.

This means the legacy code won't be compatible with bucklescript or future versions of js_of_ocaml, but in a world where the back-ends need to know the difference between string and bytes, it isn't possible to mutate a string, so both primitives should operate on bytes anyway.

@bobzhang
Copy link
Copy Markdown
Member Author

Is is possible to simply make %string_safe_set an alias for %bytes_safe_set, so that the compilers would generate exactly the same code for both?

@damiendoligez Yes, I will send a pull request

@bobzhang bobzhang deleted the safe_string branch October 12, 2016 01:36
@dra27 dra27 mentioned this pull request Dec 13, 2016
EduardoRFS pushed a commit to esy-ocaml/ocaml that referenced this pull request Jul 23, 2021
systhreads: do st_thread_id after initializing the thread descriptor
stedolan pushed a commit to stedolan/ocaml that referenced this pull request May 24, 2022
* Export Cfg_with_layout.print_dot

* ocamlformat
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants