Skip to content

Support parsing documentation and storing it in attributes#51

Closed
lpw25 wants to merge 4 commits intoocaml:trunkfrom
ocamllabs:bin-doc
Closed

Support parsing documentation and storing it in attributes#51
lpw25 wants to merge 4 commits intoocaml:trunkfrom
ocamllabs:bin-doc

Conversation

@lpw25
Copy link
Copy Markdown
Contributor

@lpw25 lpw25 commented May 5, 2014

This patch adds a simplified reimplementation of OCamlDoc parsing to the compiler. This is used to provide a -doc argument to ocamlc and ocamlopt, which causes the compiler to parse documentation and store it in the AST as attributes. This means that documentation can be retrieved directly from .cmt files, enabling tools to easily obtain accurate documentation information.

The documentation is stored as attributes using a new kind of payload PDoc which containts a Documentation.t (based on OCamlDoc's Odoc_types.info type).

@alainfrisch
Copy link
Copy Markdown
Contributor

It's an interesting approach, which will hopefully yield a much more streamlined implementation of ocamldoc and related tools. But I'm a little bit concerned by the fact that the parsing of ocamldoc comments is now part of the compiler itself. I can see the benefits in doing so (in particular, getting warnings earlier), but it adds a significant amount of code and introduce two parsing modes (which is bound to create complexity when using preprocessors such as camlp4/camlp5).

Two variants I can think of:

  • Keep the parsing logic as an external tool invoked by -pp. This is a little bit heavy (extra process invocations) and does not play well with other preprocessors. But it gives a non-intrusive way to experiment and make progress on the proposed approach (and in particular implement the next ocamldoc).
  • Keep the logic to insert documentation comments in the compiler, but do not parse their content and represent them using existing forms of payload; leave the parsing logic of documentation content to ocamldoc/external tools. This makes the patch much lighter and does not hard-code syntactic choices made by ocamldoc (it does not prevent someone from experimenting with a tool using a different syntax). Also, I find it nice that the "documentation comments" form becomes just an alternative syntax; this increases the chance of having good roundtripping properties (from Parsetree to syntax with -dsource and back to Parsetree) and plays better with other front-ends (even if camlp4/camlp5 does not support -doc, it is always possible to use the direct form).

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Nov 11, 2014

This PR have been discussed during the last developers' meeting, where I had the pleasure to be. To sum-up:

  • Nice to not parse the comment in the compiler (just verbatim)
  • But should use a simpler attachment rule to the expr node. More like attributes, not like ocamldoc. It disallows it to be a droppin replacement for ocamldoc but it is a good occasion for a simpler semantic.

@bobot
Copy link
Copy Markdown
Contributor

bobot commented Nov 12, 2014

In my last comment I was not very precise: it would be better to not parse the comment in the compiler.

@jordwalke
Copy link
Copy Markdown

Looks cool, @lpw25. My question is, is there some way this can be used to allow the pretty printer -dsource to print comments along with the source code? Currently, the comments aren't even forwarded to the pretty printer. Though they are collected by the lexer and could easily be forwarded to the pretty printer, once the pretty printer gets them, it's definitely not trivial to print the comments along with the AST. I suspected that comments were parsed into attributes in the actual OCaml AST, then that would at least make pretty printing a much easier task, since it's likely that the same rules that govern breaks/lists of attributes would also apply to comments as well. One downside I could think of is that there are some bizarre places to write comments that aren't valid attribute locations. I don't know if this would be an issue in practice, or there is perhaps some compromise taken for those few cases (parses into the nearest attribute location etc).

@jordwalke
Copy link
Copy Markdown

@lpw25 I just noticed that this pull request also includes code that interleaves comments into an OCaml AST as attributes - just like my previous comment requested. I wish I had known about this diff before I had just spent considerable time doing the exact same thing! In my case, it was to print the comments in the pretty printer. Does (or will) this decorated AST get forwarded to the pretty printer so that comments can be printed?

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 24, 2015

Could we have user-visible support for comments as attribute nodes? I think that free-form comments are a nuisance and that Emacs' and Python's docstrings are the right way to go. I already told Leo that I thought a good approach to "new ocamldoc" would be to encourage people to migrate to documentation-comment-as-attributes, and see whathever parsing of existing comments as merely legacy support to help people adopt the new strategy. In particular, if we printed back comment that could be attached to AST nodes, I would vote for them to be printed in the form of attributes; the right way.

@jordwalke
Copy link
Copy Markdown

@gasche:

  1. Would the syntax for these comments look like traditional OCaml comments (* *), or more like attributes?
  2. I have also found that freeform comments are not very accessible. An accessible editing experience means creating automated formatting tools that manage every kind of white space formatting such as newlines, wrapping, spaces (at some configurable column width). Some people with mobility disabilities or injuries benefit from this automated tooling which prevents them from having to strain to meticulously manage white space continuously (why not have a computer automate that for all of us anyways?) This is actually feasible, and not at all difficult, except for one thing that I have found - freeform comments! I have found that they make it difficult to write the automated tooling that deterministically maps an AST to a formatted pretty printing, because it isn't always clear which line/token a comment should be tied to. Modeling specific valid comment locations in the grammar has this nice side effect: If it's unambiguous how it should be parsed, then it's unambiguous how it should be printed when it comes time to wrap at column widths.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 24, 2015

Does (or will) this decorated AST get forwarded to the pretty printer so that comments can be printed?

The next version of this patch will attach comments to the AST as attributes as part of regular parsing, so they will be available to the pretty printer. It will probably print them as attributes because that is easier than printing the correct whitespace for comments.

Note that all this only applies to documentation comments ((** ... *)), regular comments are ignored just like always.

Could we have user-visible support for comments as attribute nodes?

This patch and the new version support attributes of the form [@@doc "foo"] as documentation. I personally dislike attributes for documentation, since documentation is more similar to comments than the structured input of other attributes, but the support is there for users that prefer it.

I would vote for them to be printed in the form of attributes; the right way.

I disagree that this is the right way to print them. It is however the easier way to print them, so that's what it will probably do.

I have found that they make it difficult to write the automated tooling that deterministically maps an AST to a formatted pretty printing, because it isn't always clear which line/token a comment should be tied to.

Note that attaching regular comments to the AST is very difficult and will not be attempted as part of this work. Documentation comments are designed to be attached to nodes, and so there are only certain places where they make sense. Regular comments are designed to allow commenting out any section of code -- so they must be supported everywhere in the input.

Modeling specific valid comment locations in the grammar has this nice side effect: If it's unambiguous how it should be parsed, then it's unambiguous how it should be printed when it comes time to wrap at column widths.

Tools for manipulating OCaml source code would be better built from custom parsers (e.g. ocp-indent, merlin). The OCaml parser can only handle OCaml code that is valid, which is generally not the case when writing code.

The mapping between the OCaml parser and printer is not even close to one-to-one. Many constructs are de-sugared during parsing and cannot be reliably re-sugared. The syntax tree is an Abstract Syntax Tree not a concrete syntax tree: it removes many details that would be required to unambiguously print it.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 24, 2015

Tools for manipulating OCaml source code would be better built from custom parsers (e.g. ocp-indent, merlin). The OCaml parser can only handle OCaml code that is valid, which is generally not the case when writing code.

Disagreed. Specifying indentation options in a format ocp-indent (and having the pretty-printer inserting line breaks at the right time) almost allows us to reproduce the feat of go-fix without the unsolved headaches of correctly handling whitespace. The only standing issues is floating comments, and I would be more than willing to drop those to gain the ease of tooling -- being able to print code that was transformed by a typedtree -> astree pass, in a format the user accepts as the new version of the code.

The mapping between the OCaml parser and printer is not even close to one-to-one. Many constructs are de-sugared during parsing and cannot be reliably re-sugared.

We could add more concrete information in the AST, or resugar them in a canonical way (the most compact/elegant if it is possible to determine what it is). People would be happy if the identity transformation improved their code (instead of the garbage-in-practice camlp4 outputs) and that could become a convention of style -- just as was done in the Go community with go fmt.

Of course, I'm not making you in particular responsible for such a large change, and I would already be happy to have reliable support of documentation comments as attributes.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 24, 2015

People would be happy if the identity transformation improved their code (instead of the garbage-in-practice camlp4 outputs) and that could become a convention of style -- just as was done in the Go community with go fmt.

Such a transform cannot improve someone's code. They chose a specific form because it better represented what they were doing. The reason we accept multiple forms of things is because different places are better served by different forms. The same goes for whitespace. We treat whitespace as insignificant because different indentations are better suited for different situations.

We could add more concrete information in the AST

The only way to do such tools properly is to add all concrete information to the AST (making it a concrete syntax tree) including all white space and comments. I believe the project Roslyn compilers do this kind of thing. This allows them to provide very good editor support (along the lines of Merlin). However, it is a huge undertaking that would need large changes to the lexer, the parser, and parts of the type checker.

@jordwalke
Copy link
Copy Markdown

The syntax tree is an Abstract Syntax Tree not a concrete syntax tree: it removes many details that would be required to unambiguously print it.

My only claim was that it makes it easier to create a deterministic mapping from an AST to formatted output. We don't need to print the exact same text as what was input (modulo white space differences) because there is still utility in having a deterministic mapping. Of course, if we add more and more information to the AST then it approaches the ability to deterministically format only white space, but even without that, each project can determine how to format particular expressions that would be currently ambiguous.

My parser might parse both:

let add a b = a + b
let add = fun a b -> a + b

But each project may choose their own individual preferences for how to express a binding of that form. For example, in my project, I might choose to have the canonical formatter configured to print all occurrences of the later form as let a b = a + b. Regardless of how much is eventually added to the AST, I would love it if every project I contributed to had repo-level formatting configuration that will perform all formatting according to the maintainer's conventions. It sounds like a good idea, but the effects are even better than it sounds - a repo that includes the required formatting config (and a way to automate the formatting) allows me to have my own formatting conventions inside of my editor, while still collaborating (and sending pull requests) with the maintainer's conventions. I'm sure you can see where this is going!

Such a transform cannot improve someone's code. They chose a specific form because it better represented what they were doing.

For things that aren't yet represented in the AST, yet are frequently used to convey meaning, I agree - but for things that are style preferences, I'm suggesting that a deterministic formatter allows the best of both worlds - not only a way to automate the formatting (accessibility etc) according to some convention, but for the individual writing the code to use their personal preference instead of the convention.

All of this is way beyond this particular diff, but I'm very interested in the topic, so I'd be glad to open up another discussion elsewhere.

@alainfrisch
Copy link
Copy Markdown
Contributor

After reading the OCaml Platform blog, I'm glad to see that: (i) it is now planned to move the parsing logic into the external tool and represent documentation comments as unparsed text in the AST; (ii) it is now considered acceptable to be more restrictive concerning the placement of documentation comments than what ocamldoc currently support. For (ii), if backward compatibility with ocamldoc is not a requirement, I would really suggest considering using exactly the same rules as for attributes, esp. now that people are working to move attributes on constructors after the argument (instead of just after the constructor name). This would allow to keep all the logic in the main parser and described by a proper whitespace-insensitive grammar. This seems both simpler and more robust w.r.t. to manual or automated code refactoring.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 25, 2015

For (ii), if backward compatibility with ocamldoc is not a requirement, I would really suggest considering using exactly the same rules as for attributes

Some breakage with OCamldoc is acceptable, but not much. It is really only dropping support for the very ambiguous cases, such as:

type t
(** Which type is this for? *)
type s

The attribute syntax is very restrictive, and I see no real benefit in using it for things which are not attributes. (For example, I always write comments before the thing they refer to, and would rather not have to rewrite them).

Note that the new implementation does keep the logic in the lexer/parser and leaves the grammar white-space insensitive. This allows for the comment association to depend on white-space without the grammar of the rest of the language ever being affected by white-space. This also allows us to give warnings for incorrect documentation comments rather than the full errors which would be required if they were handled by the parser.

Basically, using (almost) the existing placement is no more complicated to implement than using the attribute placement (under the assumption that incorrect placement should be a warning not an error).

@alainfrisch
Copy link
Copy Markdown
Contributor

I still find it largely preferable to have documentation text parsed by the same technology (i.e. the grammar implemented by the yacc parser), not another pass. The idea that what goes into the .cmt files (except location information) depend on white-space seem counter-intuitive to me. Yes, it might be inconvenient to impose a specific before/after style, but it can also be argued that it improves readability of third-party code to support a single style (in addition to technical benefits of staying white-space independent).

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 25, 2015

but it can also be argued that it improves readability of third-party code to support a single style

I don't think that the rules for attributes are any easier to read than the rules for comments. Anyway, if you prefer that style you can simply use:

type t
[@@doc "The doc for type t"]

This parses the same as the version using comments, and has the benefit of clearly being an attribute -- so users should expect the attribute attachment rules.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 25, 2015

The idea that what goes into the .cmt files (except location information)

I think this is a key point. Lexical information such as locations and comments are handled by the lexer not the parser.

@alainfrisch
Copy link
Copy Markdown
Contributor

My argument was about the value of imposing a more uniform style, so just saying that one can just prefer one form over the other doesn't really address this point. If you get used to writing comments before declarations, you will be confused when you browse code written with comments after the declaration (or need to modify such code).

I think this is a key point. Lexical information such as locations and comments are handled by the lexer not the parser.

This is indeed a key point: I don't see why documentation text should be considered as "lexical information" (which tend to have weak roundtrip properties), and not part of the real grammar.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 25, 2015

This is indeed a key point: I don't see why documentation text should be considered as "lexical information" (which tend to have weak roundtrip properties), and not part of the real grammar.

You can't have it both ways. Either documentation is part of the language and should be parsed fully (including the contents of the doc string) or it is auxillary to the language and is therefore lexical information.

Anyway this is all academic, we need good support for the thousands (millions?) of lines of code that use the existing documentation format. The new patch provides this and the implementation is no more complicated than using the attribute rules for attachment of comments (since you should not give an error for an improperly placed comment).

@Drup
Copy link
Copy Markdown
Contributor

Drup commented Feb 25, 2015

@alainfrisch Comments after/before types in .mli may be a matter of preferences, but it's not the case in .ml. I remind you that ocamldoc supports a pseudo-literate programming mode when you document the implementation. Putting the docstring after the implementation (which means: potentially more than one screen away) seems like a terrible idea.

On the other hand, maybe it's an argument for http://caml.inria.fr/mantis/view.php?id=6586&nbn=9, to have a docstring-like syntax. :)

@alainfrisch
Copy link
Copy Markdown
Contributor

Either documentation is part of the language and should be parsed fully (including the contents of the doc string) or it is auxillary to the language and is therefore lexical information.

I don't follow the reasoning. The contents of string literals are not parsed, but the positions where such strings are allowed in source code is specified by the grammar. Why is this different for documentation strings?

(since you should not give an error for an improperly placed comment).

Again, that's a matter of choice (and backward compatibility) to consider documentation strings as being comments or not.

Let's face it: ocamldoc rules are completely broken (and I don't know at this point how much you're going to fix them). Consider for instance an mli made of:

type t =
  | A (** X *)
  | B (** Y *)
(** Z *)

Not even mentioning that "X" is also considered to be the "head title" for this module, the fact that Y is attached to B and Z to t, but if you omit (** Y *), then Z gets attached to B -- and I don't think one can add a documentation on t without adding a blank one on B -- is rather ugly, isn't it?

What's your view on such cases?

@alainfrisch
Copy link
Copy Markdown
Contributor

I remind you that ocamldoc supports a pseudo-literate programming mode when you document the implementation.

Do people actually use this feature? (This is a genuine question, not a rhetorical one.)

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 26, 2015

type t =
  | A (** X *)
  | B (** Y *)
(** Z *)

This is the one ambiguity that is left in the new version. It is a bit hard to avoid, and not that hard to understand, so I left it in. Personally, I always put the type's comment first so I never hit this problem.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Feb 26, 2015

Let's face it: ocamldoc rules are completely broken

Other than the above ambiguity, they are now very simple. Comments are attached to the thing they are next to (i.e. no blank lines between the comment and the thing). If a comment is not attached to anything, then it is a floating comment (the equivalent of [@@@foo] attributes).

@Drup
Copy link
Copy Markdown
Contributor

Drup commented Feb 26, 2015

Do people actually use this feature? (This is a genuine question, not a rhetorical one.)

I would, if the generated html was less god damn horrible. (and I tend to comment my code as if it was working).

@dbuenzli
Copy link
Copy Markdown
Contributor

Personally, I always put the type's comment first so I never hit this problem.

I always put them after for usability reasons: if you scan a file for definitions and their documentation your eyes and scrolling only have to follow one natural (reading) direction, your eyes typically anchor on the val, types and modules keywords and then look below for the documentation. If you put them before you have, once you found an anchor, to go back up and possibly scroll back to see the full documentation of the item, performing this on each anchor is much slower.

If the only ambiguity with comments after that's left is the one you show I don't really see it as problematic, the following rule seems simple enough: if no case is documented, the last comment is for the type, if there is a case that is documented you are forced to document the last case if you want to document the type.

If it was possible to literate the mlis and make that accessible through a single click with codoc (and even turn that to a full code browsing experience) I would certainly do it in certain cases. For example React's implementation has quite a good deal of information about how it works internally, but I never used that with ocamldoc.

@jordwalke
Copy link
Copy Markdown

I thought I'd just double check, but is the vision to (at least) allow doc comments that appear just before the item being documented? I understand why it's sometimes better to have it after, but I think there's value in having it be done the way it is done in just about every other language/community (before the item). There's enough unfamiliar things about OCaml that we don't need to go creating more. The flexibility to do either seems great too.

@dbuenzli
Copy link
Copy Markdown
Contributor

I always put them after for usability reasons:

One more argument for after is that there is then no order discrepancy between generated documentation (all generators I know always put docstrings after definitions) and your sources. As such the documentation only becomes a mere pretty-printed and hyperlinked version of your sources. Which is again better from a usability point of view: it's less disorienting when you switch from one view to the other.

@jordwalke Arguments like we should do it like the other languages are moot. We are neither here to make things like the others do nor differently from the others, we are here to do things so that they are the best.

@jordwalke
Copy link
Copy Markdown

@jordwalke Arguments like we should do it like the other languages are moot. We are neither here to make things like the others do nor differently from the others, we are here to do things so that they are the best.

I was actually saying the reverse: In cases where the difference in quality is moot, do it like other languages. If all else is equal, why not do what people are most familiar with?

I'm sure you agree with that - but disagree on what is "best" in this case - which is reasonable. I could be convinced that comments after elements are superior - but it also seems like the kind of superficial thing that has room for well-reasoned disagreement. My question was simply: will this diff will allow both forms?

@damiendoligez
Copy link
Copy Markdown
Member

I think it's time to refocus on the patch at hand.

Can we agree on the following points?

  1. the code for parsing the inside of the doc comments does not belong in the compiler
  2. whatever the best placement might be, we have to be mostly compatible with ocamldoc, so we just can't force the doc comments to be in the same place as attributes.

If you strongly disagree, speak up. If you think the above is not enough to specify the next version of the patch, speak up. If you think this change should not go into 4.02.2, speak up.

@alainfrisch
Copy link
Copy Markdown
Contributor

It's difficult to focus on the patch without also talking about codoc. This seems to be a great project, which will bring a lot of improvements over ocamldoc. It also seems to be a great time to revisit some choices made by ocamldoc, and considering the central role of a documentation tool in the community, I wished there were some more open discussion about what to revisit or not.

My concern is not just about the placement of documentation text, but also on the choice of which elements can be annotated (e.g. we could discuss allowing documentation on arguments of n-ary constructor, on function arguments, on sub type expressions) and on whether all these forms should share the same concrete syntax (if they have to, this either introduce ambiguity or more complex placement rules). And even for the decision between allowing documentation only before declaration or only after or either before and after is not consensual. For people spending more time browsing source .mli files than generated documentation, having stricter rules would help, as Daniel mentioned.

Maintaining compatibility with ocamldoc is a valuable goal, but one could also try to design something better and implement a compatibility mode for a transition period (and/or automated migration tools).

I still feel uncomfortable with the idea that adding or removing whitespace to a source file can change the interpretation of the source code by the compiler (except for concrete locations), and that this interpretation can no longer be described by a yacc grammar.

Nevertheless, if there is enough support for the current proposal, can we at least get a description of the (new, slighlty less ambiguous) placement rules/algorithm (e.g. the piece of text that would go into the language manual)?

@damiendoligez
Copy link
Copy Markdown
Member

To all: Alain makes some very good points, and if we want to integrate this proposal we need two things:

  1. A clear migration path for a change of placement rules in case the syntax discussion eventually produces a consensus.
  2. Documentation.

@alainfrisch :

I still feel uncomfortable with the idea that adding or removing whitespace to a source file can change the interpretation of the source code by the compiler (except for concrete locations), and that this interpretation can no longer be described by a yacc grammar.

I don't have a problem with that: these are still comments, and they don't influence the generated code in any way. In a sense, this new option is just a helper for external tools, not a part of the compilation process. But I see your point about yacc-parsability: if the language can't be described with yacc, it had better be well documented and easy to understand, and the implementation should be straightforward.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 5, 2015

I suggest moving this discussion to #149, which is the pull request for the new version of the patch. The pull request includes a description of the new comment attachment rules.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 5, 2015

(Should I close this PR then?)

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Mar 5, 2015

(Should I close this PR then?)

Good point

@lpw25 lpw25 closed this Mar 5, 2015
anmolsahoo25 pushed a commit to anmolsahoo25/ocaml that referenced this pull request Aug 25, 2020
Disable interrupts in major_gc slices to avoid high pause times
poechsel pushed a commit to poechsel/ocaml that referenced this pull request Jul 2, 2021
Remove the `fun_tailrec_entry_point_label` field from `Cfg.t`, and explicitly store the destination label in the `Self` constructor of `Cfg_intf.tail_call_operation`.
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants