Skip to content

Improve source location supports#232

Closed
hsiaosiyuan0 wants to merge 15 commits intoquickjs-ng:masterfrom
hsiaosiyuan0:feat/column
Closed

Improve source location supports#232
hsiaosiyuan0 wants to merge 15 commits intoquickjs-ng:masterfrom
hsiaosiyuan0:feat/column

Conversation

@hsiaosiyuan0
Copy link
Copy Markdown

Hi, this PR tries to improve the source location supports.

The source location correctness is useful I think, not for better error reports but also for later debugger to support breakpoints, so I make below changes, request for review.

Unicode

handle unicodes included in strings, identifiers, comments and regexp literals.

let a = "😀" + 1;

current token stream is:

1:1 ident: 'let'
1:4 ident: 'a'
1:1 ident: 'let'
1:4 ident: 'a'
1:6 token: '='
1:8 string: '😀'
1:15 token: '+'
1:17 number: 1
1:17 token: ';'
2:1 eof

loc of + is miscalculated, the correct one is 1:13

Leading spaces

source location is miscalculated if the line starts with spaces.

  let a = "😀" + 1;

current token stream is:

1:2 ident: 'let'
1:6 ident: 'a'
1:2 ident: 'let'
1:6 ident: 'a'
1:8 token: '='
1:10 string: '😀'
1:17 token: '+'
1:19 number: 1
1:19 token: ';'
2:1 eof

the loc of let should be 1:3 to comply with editor's behaviors.

Regression tests

some tests under the tests directory to cover changes in this PR.

the tests base on:

  1. two new options DUMP_QJS_TOKEN and DUMP_QJS_BYTECODE to tell qjs
    to print the token streams and bytecode streams to stdout

  2. then the test runner run-test.js will assert the expected snapshots present in stdout.

Algorithm

Things start from next_token:

static __exception int next_token(JSParseState *s)
{
 // ...
 redo:
    s->token.line_num = s->line_num;
    s->token.col_num = s->col_num;         // 1
    s->token.ptr = p;                      // 2
    c = *p;
}
  1. redo starts to tokenize a new token so the s->col_num required to be computed well to assign, it's done by previous next_token call

  2. suppose we are the previous next_token call, we need to advance the column number right for the next token

For example:

"😀" + 1

next_token being called to tokenize "😀":

  • its start column s->col_num is computed by previous next_token, so a simple assignment is performed at point 1

  • at point 2, s->token.ptr records the start byte offset for the token of "😀"

then the process runs to js_parse_string to parse string literal:

static __exception int next_token(JSParseState *s)
{
    case '\"':
        if (js_parse_string(s, c, TRUE, p + 1, &s->token, &p))
            goto fail;
        break; // 1
}

if the string is well typed then process runs to the break at point 1, which leads the process to the end of next_token:

static __exception int next_token(JSParseState *s)
{
    // ...
    advance_col(s, p, s->token.ptr); 
    s->buf_ptr = p;

#ifdef DUMP_TOKEN
    dump_token(s, &s->token);
#endif
}

the advance_col macro will perform the column caculation:

#define advance_col(s, cur_ofst, last_ofst)                                    \
    do {                                                                       \
        s->col_num += (intptr_t)cur_ofst - (intptr_t)last_ofst - s->over_cols; \
        s->over_cols = 0;                                                      \
    } while(0)

first part (intptr_t)cur_ofst - (intptr_t)last_ofst is to compute delta bytes, it can't be delta columns if there are unicodes in it, so s->over_cols was introduced to do an adjustment.

s->over_cols is computed like below:

static __exception int js_parse_string(JSParseState *s, int sep,
                                       BOOL do_throw, const uint8_t *p,
                                       JSToken *token, const uint8_t **pp)
{
            case '\n':
                /* ignore escaped newline sequence */
                p++;
                if (sep != '`') {
                    s->line_num++;
                    s->col_num = 1;
                    s->over_cols = 0;                                       // 1
                }
                continue;
            // ...
            default:
                if (c >= '0' && c <= '9') {
                    // ...
                } else if (c >= 0x80) {
                    const uint8_t *p_next;
                    c = unicode_from_utf8(p, UTF8_CHAR_LEN_MAX, &p_next);
                    s->over_cols += (intptr_t)p_next - (intptr_t)p - 1;     // 2
                    if (c > 0x10FFFF) {
                        goto invalid_utf8;
                    }
                }
}
  • reset s->over_cols to zero if a newline starts, otherwise

  • (intptr_t)p_next - (intptr_t)p to compute the number of bytes of the unicode codepoint been read into c, substract 1 since its included in the bytes delta

@hsiaosiyuan0
Copy link
Copy Markdown
Author

Put another way, it's important to count code points, not graphemes. Surrogate pairs count as two code points.

I am not sure what this mean so I make a test with typescript compiler and this patch, below code:

let a = '😀' + b

will produce these es5 output:

"use strict";
var a = '😀' + b;
//# sourceMappingURL=data:application/json;base64,eyJ2ZXJzaW9uIjozLCJmaWxlIjoiaW5wdXQuanMiLCJzb3VyY2VSb290IjoiIiwic291cmNlcyI6WyJpbnB1dC50c3giXSwibmFtZXMiOltdLCJtYXBwaW5ncyI6IjtBQUFBLElBQUksQ0FBQyxHQUFHLElBQUksR0FBRyxDQUFDLENBQUEifQ==

the base64 decoded sourcemap is:

{"version":3,"file":"input.js","sourceRoot":"","sources":["input.tsx"],"names":[],"mappings":";AAAA,IAAI,CAAC,GAAG,IAAI,GAAG,CAAC,CAAA"}

then use this patch to perform the es5 output, got this error message:

ReferenceError: 'b' is not defined
    at <eval> (tests/test_language.js:2:15)

then use this tool ScriptMapper to resolve the original source location:

image

please point out the mistakes

@bnoordhuis
Copy link
Copy Markdown
Contributor

I believe the column number is off by one in your example of var a = '😀' + b; because of the following:

  1. Column numbers start from one, and

  2. The 😀 character is made up of two surrogates, U+D83D and U+DE00, i.e.:

qjs > "var a = '😀' + b;".split("")
[ "v", "a", "r", " ", "a", " ", "=", " ", "'", "\ud83d", "\ude00", "'", " ", "+", " ", "b", ";" ]

Note how 'b' is at position 15 in the array. Arrays count from 0 but columns from 1 so the column number should be 15 + 1 = 16.

15 is correct visually because that's what people see in their editor, but it's off by one for tools/libraries that parse stack traces. It's for such tools that I added the column number in the first place.

As a reference point, the column number is 16 in node too.

@bnoordhuis
Copy link
Copy Markdown
Contributor

Something else: is there a way to make the diff for quickjs.c a little smaller? I've looked at the changes briefly and it feels like it should be possible to achieve what you're trying to do with fewer lines of code.

Correct me if I'm wrong but you're essentially trying to make the tokenizer track column number across UTF-8-sequences, right?

@hsiaosiyuan0
Copy link
Copy Markdown
Author

hsiaosiyuan0 commented Dec 22, 2023

Thank you, I understand your example, it's counting columns base on utf16 surrogate pairs, rather than the unicode codepoints in this patch. The thing makes me confused is why this is required, I think the key is what the tools/libraries you're using to parse stack traces, do you have more details about that tools like their names or link of code?

Counting columns base on unicode codepoints is easy to understand I think and it can work with the sourcemaps come from upstream tools like the transpiler, above typescript example try to show this.

@hsiaosiyuan0
Copy link
Copy Markdown
Author

Something else: is there a way to make the diff for quickjs.c a little smaller? I've looked at the changes briefly and it feels like it should be possible to achieve what you're trying to do with fewer lines of code.

Could you please give me some suggestions as a starting point?

Some details about this patch:

c = unicode_from_utf8(p - 1, UTF8_CHAR_LEN_MAX, &p_next);
s->over_cols += (intptr_t)p_next - (intptr_t)(p - 1) - 1;

It computes s->over_cols after unicode_from_utf8 to avoid counting unicode codepoints twice and other places there is no need to take care of unicodes.

loc = LOC(s->token.line_num, s->token.col_num);

It records the LoC of the leading token of the expressions or statements, since the bytecode stream is not always produced as their source code order.

s->loc = loc;
// ...

static void emit_op(JSParseState *s, uint8_t val)
{
    JSFunctionDef *fd = s->cur_func;
    DynBuf *bc = &fd->byte_code;

    if (s->loc != 0) {
        dbuf_putc(bc, OP_source_loc);
        dbuf_put_u64(bc, s->loc);
        s->loc = 0;
    }

    fd->last_opcode_pos = bc->size;
    dbuf_putc(bc, val);
}

manually set s->loc to control whether to produce LoC for next emit_op call

@bnoordhuis
Copy link
Copy Markdown
Contributor

I was just about to step away from my PC so I'm afraid I don't have time to go into great detail right now but what I think you need to do is locate the few places where col_num is actually assigned (looks like s->token.col_num = s->mark - s->eol;) and make sure those take UTF-8 sequences into account.

You can keep it simple if you want and simply scan from s->eol to s->mark again to see if there are non-ASCII characters. The common case is that there aren't so the inefficiency of scanning the string twice is likely negligible.

@hsiaosiyuan0
Copy link
Copy Markdown
Author

hsiaosiyuan0 commented Dec 22, 2023

As a reference point, the column number is 16 in node too.

Thank you for this ref, confirmed it on d8, that's because v8 use utf16 stream for parsing I guess, rather then qjs uses UTF-8 sequences.

there aren't so the inefficiency of scanning the string twice is likely negligible

Maybe it's not true, since every token will be required to compute from the line beginning offset to grab their columns, it maybe worse if the js source is a MB minified bundle.

@bnoordhuis
Copy link
Copy Markdown
Contributor

The reason I'm not overly worried about performance is because V8 does something similar, and if performance-wise that's good enough for them, it sure as heck is good enough for us. :-)

What V8 does is this:

  1. store source locations as byte offsets (not line:column pairs)
  2. store newline locations in one big array, also as byte offsets
  3. binary-search source locations in that array; presto, there's your line number
  4. scan from the newline's location to the source location to get the column number

I'm reasonably sure they picked that approach because it's memory efficient while not being super expensive CPU-wise. It's just a linear scan and does very little per byte processed (unlike the tokenizer that does a lot of work per byte.)

@hsiaosiyuan0
Copy link
Copy Markdown
Author

I take your tips to look into v8’s source code, then found below information, please correct me if I’m wrong.

The location used in v8 parsing stage is not the line:column pair which can be used directly to report source-code location info to developers, instead, it’s begin_pos:end_pos buffer offset of the token which can be used to access the source code under the token.

For reporting source code location, a algorithm introduced in v8 to do a transformation to make the location eye friendly, the time to perform that algorithm is delayed to when the source code location is needed such as building strack traces.

The v8 algorithm has these benifits:

  • avoid to both record token souce-code and its line:column(two location related infos) to reduce memory footprints

  • begin_pos:end_pos is code unit width independent, it looks like v8 supports one-byte stream for utf8 and two-byte stream for utf16

  • line:column is lazy compute, compute it while parsing is performance sensitive I think. It's more acts as the algorithm itself not a benifit

back to qjs:

  1. maybe it won't supoort source stream with various code unit width in the near future

  2. line has already been used in qjs instread of the begin_pos, qjs's token has another field JSToken::ptr which records the begin_pos

so add new field JSToken::col_num as a pair of JSToken::line_num is easy to understand I think. We can still imitate v8 algorithm by save line:end_pos, and lazy the end_pos-to-column calculation as v8 does, however it will increase complexity I think.

@bnoordhuis
Copy link
Copy Markdown
Contributor

Yes, you've pretty much summarized it. I was initially tempted to copy V8's approach but it's more complex and it would have made the pull request even bigger than it already was.

I suggested doing a second simple forward scan in #232 (comment). To keep the overhead linear, cache the result from the previous scan so you don't start from s->eol every time.

In other words, count the non-ASCII characters since the last token and add that count to the current token's column number.

@hsiaosiyuan0
Copy link
Copy Markdown
Author

hsiaosiyuan0 commented Dec 27, 2023

For the scan-twice strategy, I did it in my fork before this CL, for comparing the performance differences between them I move it into branch feat/column-twice-scan the regression tests for this CL is also passed.

The manner to do benchmark is using this CL and the scan-twice version to execute test.js which is a big js file consist of famous js frameworks(acron does same benchmark).

For this CL:

# hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs  test.js
  Time (mean ± σ):     286.8 ms ±   8.5 ms    [User: 272.8 ms, System: 12.0 ms]
  Range (min … max):   272.5 ms … 311.2 ms    20 runs

for the scan-twice:

# hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs  test.js
  Time (mean ± σ):     309.4 ms ±   8.3 ms    [User: 294.9 ms, System: 12.2 ms]
  Range (min … max):   299.3 ms … 333.5 ms    20 runs

scan-twice reduces a few changes, but lost ≈7% mean performance.

@bnoordhuis
Copy link
Copy Markdown
Contributor

7% on a 4 MB file is acceptable to me. That's negligible or downright immeasurable on most real-world inputs.

@saghul
Copy link
Copy Markdown
Contributor

saghul commented Dec 27, 2023

Sounds pretty good indeed!

@hsiaosiyuan0
Copy link
Copy Markdown
Author

switched to the scan-twice strategy, please take a review.

@TooTallNate
Copy link
Copy Markdown
Contributor

TooTallNate commented Dec 28, 2023

This branch improves the issue I reported in #236.

input

function r(){return e()}function e(){throw new Error("bad")}r()
//                  ↑↑                     ↑   ↑            ↑↑
// QuickJS          |22                    |   48           |62
// Node.js          21                     44               61

quickjs

Error: bad
    at e (min.js:1:48)
    at r (min.js:1:22)
    at <eval> (min.js:1:62)

node

Error: bad
    at e (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:44)
    at r (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:21)
    at Object.<anonymous> (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:61)

One thing that seems off to me is the 22 and 62 reported by quickjs, which points to the opening ( of the e() and r() function invocations. Node.js points to the name of the function rather than the opening parens. I've marked the column numbers that QuickJS and Node are reporting in the input file above, for a visual comparison.

@hsiaosiyuan0
Copy link
Copy Markdown
Author

for e(), this CL treats ( as the call site, because the opcode under e is used to get the object which bound to e, however this behavior can be changed base on our discussion.

Copy link
Copy Markdown
Contributor

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why the source location has to be recorded in so many places. It's supposed to sort of fall out naturally inside the tokenizer.

My suggestion of scanning twice should be a fairly minimal change to the tokenizer: once the token has been scanned, scan the bytes again to correct for multi-byte sequences.

Comment thread quickjs.c Outdated
@hsiaosiyuan0
Copy link
Copy Markdown
Author

hsiaosiyuan0 commented Dec 29, 2023

set s->loc to adjust the location for next opcode emitting, for example:

let b = 1;
let c = 2;
let a = b + c;

this CL will produce bytecode stream:

check_define_var b,128
check_define_var c,128
check_define_var a,128
define_var b,130
define_var c,130
define_var a,130
push_i32 1
source_loc 1:9
put_var_init b
push_i32 2
source_loc 2:9
put_var_init c
source_loc 3:9
get_var b
source_loc 3:13
get_var c
add
source_loc 3:9
put_var_init a
get_loc 0: "<ret>"
return

source_loc 3:9 will not exist without s->loc set at here

static __exception int js_parse_var(JSParseState *s, int parse_flags, int tok,
                                    BOOL export_flag)
{
    // ...
                    s->loc = loc;  // <-- here
                    emit_op(s, (tok == TOK_CONST || tok == TOK_LET) ?
                        OP_scope_put_var_init : OP_scope_put_var);

the reason is when emitting OP_scope_put_var_init, the lexer has already moved to the end of the var declaration.

source_loc 3:9 is required for putting a breakpoint for it when debugging, below screen is taken from nodejs(see the breakpoint before b):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants