Skip to content

Attach documentation comments to Parsetree#149

Closed
lpw25 wants to merge 17 commits intoocaml:4.02from
lpw25:4.02+new-doc
Closed

Attach documentation comments to Parsetree#149
lpw25 wants to merge 17 commits intoocaml:4.02from
lpw25:4.02+new-doc

Conversation

@lpw25
Copy link
Copy Markdown
Contributor

@lpw25 lpw25 commented Mar 5, 2015

This patch is a new version of #51. There are a number of differences between this patch and the old one:

  • It handles documentation comments directly in the lexer and parser
  • It does not attempt to parse the contents of documentation comments
  • It does not attempt to be fully backward compatible with OCamldoc

The rules for attaching comments to nodes are as follows:

  • Documentation comments that immediately follow a constructor or field definition (i.e. no blank lines between the definition and the comment) are attached to that constructor or field definition. Only one comment can be associated with each constructor or field definition.
  • Documentation comments that immediately precede or follow an item (i.e. structure item, signature item, class field, class type field) are attached to that item. Comments that are attached to more than one item are considered ambiguous and are ignored. Only one preceding and one following comment can be associated with each item.
  • Documentation comments in structures, signatures, classes or class types that are not attached to anything are floating comments (the equivalent of [@@@ ...] attributes).

The main differences between these rules and the existing ocamldoc rules are:

  • OCamldoc allows comments that are attached to two definitions:
type t
(** t or s ? *)
type s
  • OCamldoc treats the first comment in a file specially
  • OCamldoc does not support following comments on structure items

This patch has only been tested very briefly, and needs more testing before merging, but I thought that I would make a pull request to get early feedback (and maybe some help testing).

Note that the change in approach means that this patch will not work for people using camlp4, there would need to be a corresponding patch for camlp4. Alternatively, the old version of the patch could be used to create a tool that would generate .cmt(i) files containing documentation comments from sources using camlp4.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 5, 2015

Note that this patch does not add the -doc command-line option that #51 added -- instead documentation comments are handled unconditionally.

Documentation comments in bad positions cause warning 50, which is off by default.

@alainfrisch
Copy link
Copy Markdown
Contributor

Thanks for preparing this new patch and for explaining the rules.

Is it a fair description of this new approach to say that it adds an intermediate layer between the lexer and the parser? This layer is a kind of state machine that monitors new lines and doc comments in the stream of tokens produced by the lexer, removes them from the stream seen by the parser and keeps relevant information "on the side" for later use by parsing actions. Inserting comments in the Parsetree is done more locally than with the previous approach, but otherwise, the idea is similar: doc comments are collected by the lexer but not directly considered as part of the yacc grammar (but inserted in the Parsetree by actions -- for the new version -- or by a postprocessing pass -- in the previous patch).

Documentation comments that immediately follow a constructor or field definition

Can you clarify what it means exactly? For a record field, can the comment appear after the ";"? after the } for the final field? For a constructor, can the comment for a constructor appear after the "|" that follows its declaration?

How is the ambiguity resolved for the case of a doc comment after the last constructor in a sum type? Is this case treated by:

Comments that are attached to more than one item are considered ambiguous and are ignored.

How is the following interpreted:

  type t = A (** ... *)
     (** ... *)

Formally, the first doc comments follows immediately the constructor but also the type definition, so according to the rules, it should be attached to both (and thus ignored?).

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 5, 2015

Is it a fair description of this new approach to say that it adds an intermediate layer between the lexer and the parser?

It does not add the layer -- the layer was already there for storing comments in cmt files. It extends the layer to store information about the relationship between newlines and comments. Otherwise that is a fair description of what is going on.

removes them from the stream seen by the parser and keeps relevant information "on the side" for later use by parsing actions

It is useful to think of the comment information as being passed to the parser in the same way that locations are given to the parser. Unfortunately, ocamlyacc does not let you change the information which is passed in this way so we have to use a side-channel instead.

Inserting comments in the Parsetree is done more locally than with the previous approach, but otherwise, the idea is similar

Actually, they are quite different. The old patch did not collect comments in the lexer, it collected them by re-lexing the source using a specialized second lexer.

Can you clarify what it means exactly? For a record field, can the comment appear after the ";"? after the } for the final field? For a constructor, can the comment for a constructor appear after the "|" that follows its declaration?

The ; is considered part of the previous record field, the | is considered part of the next variant constructor. So comments should be after the ; and before the |.

How is the ambiguity resolved for the case of a doc comment after the last constructor in a sum type? Is this case treated by:

The rules can be considered as given in order of precedence. First we try to attach comments to fields and constructors, then we try to attach comments to items, then we try to treat them as floating comments.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Mar 5, 2015

The rules can be considered as given in order of precedence. First we try to attach comments to fields and constructors, then we try to attach comments to items, then we try to treat them as floating comments.

Just to be sure I understand

type t = A | B
(** The type for bla *) 

will attach the comment to B and not t. If we want to attach to t in that case we should do

type t = A | B (** *) 
(** The type for bla *)

Which is fine by me, it seems to me that APIs exposing regular variants are not that widspread and the "error" can be corrected on a case by case basis.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Mar 5, 2015

(actually IIRC this behaviour the one implemented by ocamldoc).

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 5, 2015

Just to be sure I understand

type t = A | B
(** The type for bla *) 

will attach the comment to B and not t. If we want to attach to t in that case we should do

type t = A | B (** *) 
(** The type for bla *)

Yes. Another option is:

(** The type for bla *)
type t = A|B

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be a bit more careful with the copyright headers. I know it's annoying to have to pay attention to them, but it's not that hard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about those. I've fixed the headers of comments.ml and comments.mli in a new commit.

@alainfrisch
Copy link
Copy Markdown
Contributor

It does not add the layer -- the layer was already there for storing comments in cmt files.

The layer existed only to collect comments and store them in the .cmt file; there was no "logic" in it, and it did not impact the Parsetree (correct me if I'm wrong). Now it is a non-trivial component which defines part of the mapping from the source code to the Parsetree. I'm fine with the approach, but let's not hide the fact that it increases significantly the complexity of this mapping and the ways it could be documented (even semi-formally).

I still believe the future of the ecosystem would be brighter if we could converge -- after a migration period to a more rigid system -- (different syntaxes, stricter placement rules) that could be recognized by minor additions to the existing yacc grammar (as would be achieved by providing a lighter syntax for [@@doc "..."] and for [@doc "..."]), and which would remain whitespace- (or newline-) independent. In addition to reducing the complexity of the compiler and simplifying the documentation, this would simplify the life of other tools that operate on the concrete syntax (alternative parsers such as Merlin, caml4/camlp5/fan, editor modes, perhaps some refactoring tools).

But I've to recognize that I seem to be the only one concerned by the extra complexity added by your proposal... And that being said, I find the current version much more acceptable than the previous one.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 6, 2015

I still believe the future of the ecosystem would be brighter if we could converge -- after a migration period to a more rigid system

I think that is a reasonable position to take, but I think that it would still be very useful to be able to improve the tooling for existing code bases using the existing syntax -- hence the need for this patch.

@alainfrisch
Copy link
Copy Markdown
Contributor

but I think that it would still be very useful to be able to improve the tooling for existing code bases using the existing syntax

Then what about introducing the more rigid system now? We could support both modes in the compiler (and mark the legacy mode as deprecated at some point), and also provide a tool to automate the transition (the tool could recognize doc comments using the same code as the current ocamldoc, and patch the source just enough to remove those comments and re-insert them at the proper place and with the proper syntax, and perhaps adapt whitespaces/newlines with some heuristics).

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 6, 2015

Then what about introducing the more rigid system now?

I only said that I thought your position was reasonable -- not that I agreed with it. I am not convinced that it is worth the trouble of adding a new syntax just to avoid using comments for documentation. Why not just use [@@doc .. ] if you dislike using comments? Personally, I quite like documentation in comments, so I am not particularly keen on deprecating a reasonable and widely-used system just to avoid it.

Besides, agreeing on a new syntax takes time, and I would very much like to include this patch in 4.02.2 so that codoc can be used as soon as possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we factorize the very similar logic in the three cases above?

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 10, 2015

Forgetting the debate about the observable behavior for a while, I had a closer look at the implementation.

I'm not fond of the way the parser semantic actions are changed by this patch. I find the changes to the lexer extremely reasonable, but the parser change look somewhat invasive and a bit fragile (it's easy to forget to do the right thing when changing the grammar, and thus I would be worried that it could complicate grammar maintenance).

I must say that on this specific aspect I liked the approach of #51 better: traverse the constructed AST after-the-fact to implement the node-specific attachment rules. In some sense it is more declarative, while the current implementation is side-effect-based (and the effects are interleaved with the parsing actions).

I don't know whether the move from declarative to imperative attachment is related to your change in attachment rules: did the new approach become easier to implement? I see that the new patch also has better ways to warn the user when the comments are not at the right place; are those easier with the more imperative style?

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 10, 2015

'm not fond of the way the parser semantic actions are changed by this patch. I find the changes to the lexer extremely reasonable, but the parser change look somewhat invasive and a bit fragile (it's easy to forget to do the right thing when changing the grammar, and thus I would be worried that it could complicate grammar maintenance).

I must say that on this specific aspect I liked the approach of #51 better: traverse the constructed AST after-the-fact to implement the node-specific attachment rules. In some sense it is more declarative, while the current implementation is side-effect-based (and the effects are interleaved with the parsing actions).

Firstly, it is important to note that this patch makes only cosmetic changes the parser grammar (I changed some left-recursion to right-recursion to improve the consistency of the code). All important parser changes are to the parser actions.

These actions appear imperative, but are in essence declarative. They take the exact same approach as that used to attach locations to the AST. Unfortunately, you can't pass arbitrary information through ocamlyacc as location information so instead it goes via a side-channel (a table mapping locations to the comment information).

I don't know whether the move from declarative to imperative attachment is related to your change in attachment rules: did the new approach become easier to implement? I see that the new patch also has better ways to warn the user when the comments are not at the right place; are those easier with the more imperative style?

The new approach is -- in my opinion -- vastly superior to the old one because it is implemented in terms of the concrete syntax rather than the abstract syntax. This makes it much easier to understand and maintain.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 10, 2015

The kind of issues I'm thinking of is for example in this code, you have to remember to use post_doc_constructor in one case and post_doc_last_constructor in the other by reasoning on the surrounding parsing context, and that is messy. It would be nice if we could call the same function in both places, we could then move the call as a wrapper around the semantic action of constructor_declaration directly, and remove the redundancy.

Similarly, I wonder whether we could abstract away on this pattern where you have either just pre_doc_foo or sometimes pre_doc_foo followed by post_doc_foo.

Because it is implemented in terms of the concrete syntax rather than the abstract syntax

That is a good point, thanks.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 10, 2015

The kind of issues I'm thinking of is for example in this code, you have to remember to use post_doc_constructor in one case and post_doc_last_constructor in the other by reasoning on the surrounding parsing context, and that is messy.

That one is a bit unfortunate. There is essentially one remaining corner case with the new rules -- comments after the last constructor of a variant definition. This code handles that case, so I would expect it to be a little annoying.

Are there other examples which look hard to maintain? I suspect there are, but I reckon that most of them can be simplified with some thought (unlike the above example which seems inherently awkward).

@alainfrisch
Copy link
Copy Markdown
Contributor

I propose to evolve "docstrings" into more rigid placement rules, which would integrate more smoothly with the language grammar. A first proposal has been posted there:

http://caml.inria.fr/mantis/view.php?id=6811

The new mode would be optional, disabled by default for some OCaml releases, then enabled by default, and finally made mandatory at some point. In the Mantis ticket, I suggest two important migration paths:

  • Support this new mode in ocamldoc soon so that people can start switching to it without changing their current tool.
  • When the new mode is disabled, add some support for ocamldoc rules in the compiler so that people can start using codoc before adapting their entire code base. If it is agreed that the new mode should or might become the default at some point, it makes sense to aim at minimal interaction between the legacy-supporting pass and the actual parser. In this sense, the approach taken originally (in Support parsing documentation and storing it in attributes #51 , without parsing the comment contents) is more promising.

@dbuenzli
Copy link
Copy Markdown
Contributor

@alainfrisch As I told in mantis I'm not sure this is a very good idea. When I'm playing with code/designing I move things a lot around, comment out some parts, uncomment part of it back, etc. and when I do that I don't think I want to be bothered about rigid documentation placing rules that would make my parsing fail.

@alainfrisch
Copy link
Copy Markdown
Contributor

@dbuenzli: see Mantis for a discussion on this topic of error recovery to turn errors caused by badly located doc comments into warnings instead of fatal errors, even in strict-doc mode.

That said, I'm not sure how much code refactoring would actually suffer in practice from fatal errors caused by badly located doc comments (we are not talking about arbitrary comments, just doc comments, which typically don't occur within function bodies). It should be quite easy to fix such an error, and it's better done during refactoring than after the fact. At worst, users can always turn off strict-doc during an intensive refactoring if they are annoyed by error. But I agree that if we could get only warnings and not a fatal error on the first badly located doc comment, it's nice, even though this might not justify a much increased complexity in the system.

@dbuenzli
Copy link
Copy Markdown
Contributor

That said, I'm not sure how much code refactoring would actually suffer in practice from fatal errors caused by badly located doc comments (we are not talking about arbitrary comments, just doc comments, which typically don't occur within function bodies).

I don't know either, but I think this should be really experimented with to prove that it won't be annoying. For example I guess this would fail:

type t =  [ A (** doc a *) (* | B *) (** doc b *) | C (** doc c *) ]

I'm not sure I want to have to bother here that I need to comment the B case with its own docstring.

But I agree that if we could get only warnings and not a fatal error on the first badly located doc comment, it's nice, even though this might not justify a much increased complexity in the system.

I tend to compile with the compiler's default warning levels (except for a few exceptional sources tweaking warning levels is annoying, you want to use the compiler's defaults) and any triggered warning always gets all my attention (non exhaustive pattern matches are sadly a warning) so this would still be noise to me. Also IIRC the compiler doesn't report errors before warnings so when you use things like emacs' compilation mode you have to jump over all the warnings before getting to the error which is annoying. In general I'm suspicious of the whole warning idea and not really in favour of adding more of them.

@damiendoligez
Copy link
Copy Markdown
Member

except for a few exceptional sources tweaking warning levels is annoying, you want to use the compiler's defaults
I don't really get this. You just set your preferred warnings in your Makefile once, then forget about it. What's so annoying about it?

non exhaustive pattern matches are sadly a warning
This design mistake dates back to the early 1980's...

In general I'm suspicious of the whole warning idea and not really in favour of adding more of them.
It's hard to please everyone, but with warnings at least we can give them the option of being pleased.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 12, 2015

In response to feedback that the patch was too heavy on changes to parser.mly, I have updated the patch.

This update changes the failure behaviour of ambiguous comments like:

type t
(** foo *)
type s

Previously this would raise a warning and ignore the comment. Now this raises a warning and attaches the comment to both items.

This changes allows us to remove all context-dependent aspects of the attachment rules, significantly reducing the complexity of the code in parser.mly.

I think that the code in parser.mly should now be simple to read and maintain for anyone familiar with how the existing parser handles locations and attributes.

This updated patch has only had cursory testing, so it probably still needs a bit of work.

@avsm
Copy link
Copy Markdown
Member

avsm commented Mar 12, 2015

@lpw25, the updated patch seems to have a merge conflict against the 4.02 branch

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 12, 2015

@lpw25, the updated patch seems to have a merge conflict against the 4.02 branch

Fixed (I think)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this dead code?

@dbuenzli
Copy link
Copy Markdown
Contributor

I don't really get this. You just set your preferred warnings in your Makefile once, then forget about it. What's so annoying about it?

That's precisely the problem, I can't forget about it since I will have to put that preference it in each of my makefiles. Another annoyance is that warning preferences changes the experience of the language from one person to the other and lead to discussions like this.

@alainfrisch
Copy link
Copy Markdown
Contributor

@dbuenzli For the discussion of my "alternative" proposal, I suggest to continue on Mantis (the discussion about warnings also concern Leo's proposal). (I use quotes around "alternative", because supporting both proposals would provide a smooth migration strategy.)

but I think this should be really experimented with to prove that it won't be annoying.

Yes, clearly, this needs some experimentation.

type t = [ A (** doc a *) (* | B *) (** doc b *) | C (** doc c *) ]

This would work "by chance", with two doc comments attached to "A".

and any triggered warning always gets all

Doesn't it seem a good idea to get some feedback from the compiler for badly located doc comments? This can be annoying during refactoring, but especially if we go for stricter rules, it makes sense to have the compiler enforce them (at least as warnings).

the compiler doesn't report errors before warnings so when you use things like emacs' compilation mode you have to jump over all the warnings before getting to the error which is annoying

It should not be very difficult to collect warnings and report them after type-checking (in case of success or error).

bactrian pushed a commit that referenced this pull request Mar 26, 2015
git-svn-id: http://caml.inria.fr/svn/ocaml/trunk@15968 f963ae5c-01c2-4b8c-9fe0-0dff7051ff02
@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 26, 2015

I ran some quick tests of the performance of the patch by essentially doing time opam install core_kernel with the 4.02 branch and the patch. The results are:

   4.02      |   4.02+new-doc
-------------|------------------
  2m31.428s  |  2m31.394s
  2m31.581s  |  2m30.799s
  2m31.205s  |  2m31.252s

These are obviously not very accurate, but they indicate that the patch does not cause any major regressions in terms of compilation speed.

nojb pushed a commit to nojb/ocaml that referenced this pull request Apr 12, 2015
git-svn-id: http://caml.inria.fr/svn/ocaml/trunk@15968 f963ae5c-01c2-4b8c-9fe0-0dff7051ff02
bactrian pushed a commit that referenced this pull request Apr 18, 2015
git-svn-id: http://caml.inria.fr/svn/ocaml/trunk@15968 f963ae5c-01c2-4b8c-9fe0-0dff7051ff02
@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented May 11, 2015

Merged in 4.02 branch

@kit-ty-kate
Copy link
Copy Markdown
Member

Mmh I don't see it in the commits yet. Is there any problems ?

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented May 12, 2015

I'm not on the list of known commiters for the svn-to-git transformation. Perhaps this is breaking it.

@avsm might know how to fix it.

@avsm
Copy link
Copy Markdown
Member

avsm commented May 12, 2015

@lpw25 wasn't in the authors.txt; fixed in ocaml/ocaml.org-scripts@76854df

@yallop
Copy link
Copy Markdown
Member

yallop commented May 12, 2015

Merged in 4.02 branch

What's the plan for merging this in trunk?

@damiendoligez
Copy link
Copy Markdown
Member

As for all changes: by default I will merge it after the 4.02.2 release, but if you need it sooner ask @lpw25 to merge it now (and make my life easier in the process...)

@alainfrisch
Copy link
Copy Markdown
Contributor

If this can make your life easier, it seems fair to ask @lpw25 to do the merge sooner than later, since he insisted to push the change to 4.02.2. Merging in trunk will also reduce risks of future conflicts and give larger visibility to the change (for instance for people who track trunk, as LexiFi does).

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented May 12, 2015

I'll try and find time to push it to trunk this week

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Jun 17, 2015

Ping?

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 18, 2015

Ping?

This slipped down my list of things to do. I hope to find time in the next week or so.

@damiendoligez
Copy link
Copy Markdown
Member

@lpw25 I'll wait for that before I merge 4.02.2 into trunk.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 28, 2015

Committed to trunk (revision 16189).

@lpw25 lpw25 closed this Jun 28, 2015
stedolan added a commit to stedolan/ocaml that referenced this pull request Aug 3, 2017
The old domain termination logic was roughly:
  - turn off incoming interrupts
  - empty minor heap / mark stack
  - quit

This had a subtle race condition: after incoming interrupts are
turned off, this domain no longer participates in STW heap cycling.
So, should a major GC cycle end just after we turn off incoming
interrupts, the GC colours can flip as we're emptying the mark
stack, causing segfaults / corruption.

The fix is to have domain termination logic like this:
   loop {
     - empty minor heap / mark stack
     - handle incoming interrupts
   } until no more incoming interrupts
   - turn off incoming interrupts
   - quit
mshinwell added a commit to mshinwell/ocaml that referenced this pull request Jun 1, 2020
lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 23, 2020
lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 23, 2020
lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 24, 2020
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Aug 10, 2021
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Oct 5, 2021
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.