Proposal: Replacing the PromQL Parser with a generated parser

# Replacing the PromQL Parser with a Generated Parser

## Motivation

### Current State of the Parser

Currently, the PromQL query engine utilizes a handwritten, recursive-descent parser, that was written five years ago and hasn't seen large changes since then.
It seems that none of the active members of the Prometheus project has a deep understanding of the parser code and is willing to maintain it or review complex PRs on it (based on the discussion in <https://github.com/prometheus/prometheus/pull/6061>).

There also seem to be some hidden features and edge cases that nobody is aware of (see <https://github.com/prometheus/prometheus/pull/6198> and <https://github.com/prometheus/prometheus/issues/6065>).

The code seems to be far more complex than necessary and would require significant refactoring before continuing to develop new features on it.

### Using the Parser for the Upcoming Language Server 

The upcoming PromQL language Server (see [proposal](https://github.com/slrtbtfs/promql-lsp/blob/master/doc/proposals/2019_promql-language-server.md)) requires a parser in order to analyze PromQL queries.

For purposes of compatibility, maintainability, and code reuse, it is intended for the language server to use the same PromQL parse as Prometheus.
To enable that use case, some enhancements to the existing parser have been [proposed](https://github.com/slrtbtfs/promql-lsp/blob/master/doc/proposals/2019_promql-parser-improvements.md).
These mainly concern the _output_ of the parser and have already been partly implemented.
However, some of the features described there, such as error recovery, would be difficult to implement in the existing parser.

#### Producing a Deeper Syntax Tree

Consider the following PromQL query:

    metric{label=="value", foo~="ba."}[5m] offset 3h

The corresponding syntax tree produced by the current parser looks something like the following:

    |---- MatrixSelector: metric{label=="value", foo~="ba."}[5m] offset 3h

This syntax tree is sufficient if the only goal is evaluating PromQL queries.
In fact, the PromQL query engine expects this simplified format.

However, it is insufficient for more involved use cases, such as: auto-completion that works almost anywhere in the code, generating helpful error messages, showing function signatures and documentation while editing queries, etc.
Instead, a syntax tree structured as follows would be more appropriate:

    |---- OffsetModifier: metric{label=="value", foo~="ba."}[5m] offset 3h
    . . . |---- MatrixSelector: metric{label=="value", foo~="ba."}[5m]
    . . . . . . |---- VectorSelector: metric{label=="value", foo~="ba."}
    . . . . . . . . . |---- LabelMatcher: label=="value"
    . . . . . . . . . . . . |---- Label: label
    . . . . . . . . . . . . |---- Value: "value"
    . . . . . . . . . |---- LabelMatcher: foo~="ba."
    . . . . . . . . . . . . |---- Label: foo 
    . . . . . . . . . . . . |---- Value: "ba."

### Advantages of a Generated Parser

#### It is easier to maintain

The complex task of recognizing the PromQL syntax is performed by the generated code.
Humans only have to deal with a comparably readable formal grammar and the code that builds up a syntax tree from already well structured input.

#### It explicitly documents the formal grammar of PromQL

A formal grammar provides a human- and machine-readable definition of what is a correct PromQL query.
This would eliminate edge cases and unknown features such as the ones described above.

#### It is possible to handle incomplete expressions

Consider the following incomplete PromQL input from a user:

    some_metric{la

By adding syntax rules for incomplete expressions, the parser could figure out that it got an incomplete vector selector with a certain structure.
A language server could then use this information to offer completions for all labels that occur together with `some_metric` and start with `la`.

#### It is easier to provide helpful error messages

All that needs to be done in order to add new error messages, is to add a grammar rule that recognizes this specific error class and add some code that handles this error when building up the syntax tree.

## Proposal

This document proposes replacing the current PromQL parser with one that is generated by [goyacc](https://godoc.org/golang.org/x/tools/cmd/goyacc).
Goyacc was chosen because yacc-generated parsers have been successfully used for decades.
Further, of the few alternatives for generating go parsers, none seem to be on par with goyacc.

The new parser should able to output two syntax tree formats:

* a deeper tree, where almost every token is assigned a node, designed to be used by applications that require additional context, such as the language server; and

* a simplified tree that looks almost like the current one, intended to be used by the PromQL query engine and any other application that does not need the additional context.

The parser should able to parse incomplete expressions.

The changes proposed in a [previous proposal](https://github.com/slrtbtfs/promql-lsp/blob/master/doc/proposals/2019_promql-language-server.md) remain valid.
The changes to the syntax tree and lexer that have already been implemented can be still used with the new parser.
The remaining changes, i.e. incomplete expression parsing and having an error list, will be implemented along with the newly proposed parser.

## Architecture

### lexer

Yacc relies on receiving a stream of tokens from  a lexer.
The new parser will continue to use the logic of the existing lexer.
The lexer output will be modified to satisfy the interfaces required by goyacc.
To be consistent with the usual yacc conventions, token names will be capitalized.

### yacc grammar

The core of the parser is a formal grammar of PromQL, which is the input for the parser generator and provides a precise and explicit definition of the PromQL syntax.
An extra rule must be declared for every class of error and incomplete expressions that the parser should handle.

The grammar is able to recognize all valid PromQL queries as well as common cases of incomplete or otherwise incorrect queries.

As an example, the part of the grammar recognizing an instant vector selector will look roughly like:

    selector:
        IDENTIFIER                          {$$ = newVectorSelector(nil, $1}
      | IDENTIFIER LBRACE          RBRACE   {$$ = newVectorSelector(nil, $1, $2, nil, $3)}
      | IDENTIFIER LBRACE matchers RBRACE   {$$ = newVectorSelector(nil, nil, $1, $2, $3}
      |            LBRACE          RBRACE   {$$ = newVectorSelector(nil, nil, $1, nil, $2)}
      |            LBRACE matchers RBRACE   {$$ = newVectorSelector(nil, $1, $2, $3,  $4)}
      | IDENTIFIER LBRACE matchers anything {$$ = newVectorSelector([]ErrCodes{
                                                                            ErrNoClosingBrace,
                                                                        }, $1, $2, $3,  $4)}
      |            LBRACE matchers anything {$$ = newVectorSelector([]ErrCodes{
                                                                            ErrNoClosingBrace,
                                                                        }, nil, $1, $2, $3)}

    matchers:
        matchers COMMA matcher              {$$ = newMatchers(nil, $1, $2, $3)}
      | matcher                             {$$ = newMatchers(nil, $1)}

    matcher:
        IDENTIFIER matchOp STRING           {$$ = newMatcher(nil, $1, $2, $3)}
      | IDENTIFIER matchOp anything         {$$ = newMatcher([]ErrCodes{
                                                                        ErrWrongType,
                                                                }, $1, $2, $3)}
      | IDENTIFIER anything                 {$$ = newMatcher([]ErrCodes{
                                                                        ErrNoMatchOp,
                                                                }, $1, $2)}

    matchOp:
        EQ                                  {$$ = $1}
      | NEQ                                 {$$ = $1}
      | REGEQ                               {$$ = $1}
      | REGNEQ                              {$$ = $1}

    
    anything:
        // All valid PromQL Syntax Elements
        ...
      | EOF                                 {$$ = $1}
      | error                               {$$ = $1}

### Building up the Abstract Syntax Tree (AST)

Every match in the grammar calls a function of the following type or simply returns the token if no special action is required:

    func ([]ErrCodes, ...*Expr) *Expr

Tokens implement the Expr interface, so no special distinction has to be made here.

These functions gradually build up a syntax tree and perform all error and type checking along the way, using the hints provided by the given `ErrCodes`

### The Expr Interface

All Syntax tree nodes must satisfy the `Expr` interface and implement at least the following functions:

    type Expr interface {
        // A string representation
        String()     string
        // Position
        Pos()        token.Pos
        EndPos()     token.Pos
        // Errors in the respective subtree
        Errors()     []*ParseErr
        // The Type of the Expr
        Type()       ExprType
        // Simplifies the subtree to allow fast Evaluation by the PromQL engine
        Simplify()   *Expr
        // Reports if the subtree is already simplified. If yes, the subtree is 
        // skipped by Simplify() for performance reasons
        Simplified() bool
    }

### AST Simplifying and Backwards Compatibility

The syntax tree produced by the new parser will provide an accurate and detailed representation of the query.
However, when evaluating queries in the PromQL query engine, it is better to have a less complex syntax tree.
In fact, the PromQL query engine expects matrix and vector selectors to be represented as a single node rather and as a complex tree.

In order to avoid completely breaking the query engine and any other existing users of the current PromQL parser, the `Expr` interface will have a function `Simplify()` that recurses through the subtree, tries to reduce the depth of the tree and changes matrix and vector selector to the format that the query engine currently expects. 

To produce this simplified, backwards-compatible syntax tree format, it suffices to call `Simplify()` on the root node.

It is possible to implement additional optimizations like pre-evaluation of scalar expressions in the `Simplify()` method.


### Scope of the change

The parser will be completely rewritten.

The output format and token naming conventions in the lexer will be changed to fit the conventions and interfaces of yacc.
Most of the actual logic will remain unchanged.

The changes to the query engine will be minimized by the AST Simplification mechanism.
The changes will be confined to some name changes in types and interfaces.

### Possible Regressions

By Rice's theorem, it may be impossible be certain whether the new parser accepts exactly the same syntax as the old one.

Less academically, it is possible that there exist edge cases in the current parser that the new parser does not handle because no one knows about them.
It is likely impossible to completely mitigate the risk of accidentally introducing a breaking incompatibility of this type.
This is a consequence of the fact that no formal definition of PromQL exists yet.

In order to reduce this risk, a large sampling of queries that work with the current parser should be collected and used to test the new parser.

---
Source of truth for this proposal: <https://github.com/slrtbtfs/promql-lsp/blob/master/doc/proposals/2019_generated%20promql%20parser.md>

---

**TL;DR:** This document proposes rewriting the current parser and replacing it with a generated one. This should make it more maintainable and easier to implement some features needed for the upcoming PromQL language server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Replacing the PromQL Parser with a generated parser #6256

Replacing the PromQL Parser with a Generated Parser

Motivation

Current State of the Parser

Using the Parser for the Upcoming Language Server

Producing a Deeper Syntax Tree

Advantages of a Generated Parser

It is easier to maintain

It explicitly documents the formal grammar of PromQL

It is possible to handle incomplete expressions

It is easier to provide helpful error messages

Proposal

Architecture

lexer

yacc grammar

Building up the Abstract Syntax Tree (AST)

The Expr Interface

AST Simplifying and Backwards Compatibility

Scope of the change

Possible Regressions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Replacing the PromQL Parser with a generated parser #6256

Description

Replacing the PromQL Parser with a Generated Parser

Motivation

Current State of the Parser

Using the Parser for the Upcoming Language Server

Producing a Deeper Syntax Tree

Advantages of a Generated Parser

It is easier to maintain

It explicitly documents the formal grammar of PromQL

It is possible to handle incomplete expressions

It is easier to provide helpful error messages

Proposal

Architecture

lexer

yacc grammar

Building up the Abstract Syntax Tree (AST)

The Expr Interface

AST Simplifying and Backwards Compatibility

Scope of the change

Possible Regressions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions