Improve source location supports by hsiaosiyuan0 · Pull Request #232 · quickjs-ng/quickjs

hsiaosiyuan0 · 2023-12-22T07:03:08Z

Hi, this PR tries to improve the source location supports.

The source location correctness is useful I think, not for better error reports but also for later debugger to support breakpoints, so I make below changes, request for review.

Unicode

handle unicodes included in strings, identifiers, comments and regexp literals.

let a = "😀" + 1;

current token stream is:

1:1 ident: 'let'
1:4 ident: 'a'
1:1 ident: 'let'
1:4 ident: 'a'
1:6 token: '='
1:8 string: '😀'
1:15 token: '+'
1:17 number: 1
1:17 token: ';'
2:1 eof

loc of + is miscalculated, the correct one is 1:13

Leading spaces

source location is miscalculated if the line starts with spaces.

  let a = "😀" + 1;

current token stream is:

1:2 ident: 'let'
1:6 ident: 'a'
1:2 ident: 'let'
1:6 ident: 'a'
1:8 token: '='
1:10 string: '😀'
1:17 token: '+'
1:19 number: 1
1:19 token: ';'
2:1 eof

the loc of let should be 1:3 to comply with editor's behaviors.

Regression tests

some tests under the tests directory to cover changes in this PR.

the tests base on:

two new options DUMP_QJS_TOKEN and DUMP_QJS_BYTECODE to tell qjs
to print the token streams and bytecode streams to stdout
then the test runner run-test.js will assert the expected snapshots present in stdout.

Algorithm

Things start from next_token:

static __exception int next_token(JSParseState *s)
{
 // ...
 redo:
    s->token.line_num = s->line_num;
    s->token.col_num = s->col_num;         // 1
    s->token.ptr = p;                      // 2
    c = *p;
}

redo starts to tokenize a new token so the s->col_num required to be computed well to assign, it's done by previous next_token call
suppose we are the previous next_token call, we need to advance the column number right for the next token

For example:

"😀" + 1

next_token being called to tokenize "😀":

its start column s->col_num is computed by previous next_token, so a simple assignment is performed at point 1
at point 2, s->token.ptr records the start byte offset for the token of "😀"

then the process runs to js_parse_string to parse string literal:

static __exception int next_token(JSParseState *s)
{
    case '\"':
        if (js_parse_string(s, c, TRUE, p + 1, &s->token, &p))
            goto fail;
        break; // 1
}

if the string is well typed then process runs to the break at point 1, which leads the process to the end of next_token:

static __exception int next_token(JSParseState *s)
{
    // ...
    advance_col(s, p, s->token.ptr); 
    s->buf_ptr = p;

#ifdef DUMP_TOKEN
    dump_token(s, &s->token);
#endif
}

the advance_col macro will perform the column caculation:

#define advance_col(s, cur_ofst, last_ofst)                                    \
    do {                                                                       \
        s->col_num += (intptr_t)cur_ofst - (intptr_t)last_ofst - s->over_cols; \
        s->over_cols = 0;                                                      \
    } while(0)

first part (intptr_t)cur_ofst - (intptr_t)last_ofst is to compute delta bytes, it can't be delta columns if there are unicodes in it, so s->over_cols was introduced to do an adjustment.

s->over_cols is computed like below:

static __exception int js_parse_string(JSParseState *s, int sep,
                                       BOOL do_throw, const uint8_t *p,
                                       JSToken *token, const uint8_t **pp)
{
            case '\n':
                /* ignore escaped newline sequence */
                p++;
                if (sep != '`') {
                    s->line_num++;
                    s->col_num = 1;
                    s->over_cols = 0;                                       // 1
                }
                continue;
            // ...
            default:
                if (c >= '0' && c <= '9') {
                    // ...
                } else if (c >= 0x80) {
                    const uint8_t *p_next;
                    c = unicode_from_utf8(p, UTF8_CHAR_LEN_MAX, &p_next);
                    s->over_cols += (intptr_t)p_next - (intptr_t)p - 1;     // 2
                    if (c > 0x10FFFF) {
                        goto invalid_utf8;
                    }
                }
}

reset s->over_cols to zero if a newline starts, otherwise
(intptr_t)p_next - (intptr_t)p to compute the number of bytes of the unicode codepoint been read into c, substract 1 since its included in the bytes delta

…there is no os.pipe support on it. 2. Add source location support for unary-expr and update the related column tests. 3. Fix ubsan error in find_line_num when function does not have pc2line info, this could happen when there are only var-dec statements declared by `var` and without initializers.

hsiaosiyuan0 · 2023-12-22T07:08:10Z

Put another way, it's important to count code points, not graphemes. Surrogate pairs count as two code points.

I am not sure what this mean so I make a test with typescript compiler and this patch, below code:

let a = '😀' + b

will produce these es5 output:

"use strict";
var a = '😀' + b;
//# sourceMappingURL=data:application/json;base64,eyJ2ZXJzaW9uIjozLCJmaWxlIjoiaW5wdXQuanMiLCJzb3VyY2VSb290IjoiIiwic291cmNlcyI6WyJpbnB1dC50c3giXSwibmFtZXMiOltdLCJtYXBwaW5ncyI6IjtBQUFBLElBQUksQ0FBQyxHQUFHLElBQUksR0FBRyxDQUFDLENBQUEifQ==

the base64 decoded sourcemap is:

{"version":3,"file":"input.js","sourceRoot":"","sources":["input.tsx"],"names":[],"mappings":";AAAA,IAAI,CAAC,GAAG,IAAI,GAAG,CAAC,CAAA"}

then use this patch to perform the es5 output, got this error message:

ReferenceError: 'b' is not defined
    at <eval> (tests/test_language.js:2:15)

then use this tool ScriptMapper to resolve the original source location:

please point out the mistakes

bnoordhuis · 2023-12-22T11:17:23Z

I believe the column number is off by one in your example of var a = '😀' + b; because of the following:

Column numbers start from one, and
The 😀 character is made up of two surrogates, U+D83D and U+DE00, i.e.:

qjs > "var a = '😀' + b;".split("")
[ "v", "a", "r", " ", "a", " ", "=", " ", "'", "\ud83d", "\ude00", "'", " ", "+", " ", "b", ";" ]

Note how 'b' is at position 15 in the array. Arrays count from 0 but columns from 1 so the column number should be 15 + 1 = 16.

15 is correct visually because that's what people see in their editor, but it's off by one for tools/libraries that parse stack traces. It's for such tools that I added the column number in the first place.

As a reference point, the column number is 16 in node too.

bnoordhuis · 2023-12-22T11:22:30Z

Something else: is there a way to make the diff for quickjs.c a little smaller? I've looked at the changes briefly and it feels like it should be possible to achieve what you're trying to do with fewer lines of code.

Correct me if I'm wrong but you're essentially trying to make the tokenizer track column number across UTF-8-sequences, right?

hsiaosiyuan0 · 2023-12-22T11:43:11Z

Thank you, I understand your example, it's counting columns base on utf16 surrogate pairs, rather than the unicode codepoints in this patch. The thing makes me confused is why this is required, I think the key is what the tools/libraries you're using to parse stack traces, do you have more details about that tools like their names or link of code?

Counting columns base on unicode codepoints is easy to understand I think and it can work with the sourcemaps come from upstream tools like the transpiler, above typescript example try to show this.

hsiaosiyuan0 · 2023-12-22T12:04:37Z

Something else: is there a way to make the diff for quickjs.c a little smaller? I've looked at the changes briefly and it feels like it should be possible to achieve what you're trying to do with fewer lines of code.

Could you please give me some suggestions as a starting point?

Some details about this patch:

c = unicode_from_utf8(p - 1, UTF8_CHAR_LEN_MAX, &p_next);
s->over_cols += (intptr_t)p_next - (intptr_t)(p - 1) - 1;

It computes s->over_cols after unicode_from_utf8 to avoid counting unicode codepoints twice and other places there is no need to take care of unicodes.

loc = LOC(s->token.line_num, s->token.col_num);

It records the LoC of the leading token of the expressions or statements, since the bytecode stream is not always produced as their source code order.

s->loc = loc;
// ...

static void emit_op(JSParseState *s, uint8_t val)
{
    JSFunctionDef *fd = s->cur_func;
    DynBuf *bc = &fd->byte_code;

    if (s->loc != 0) {
        dbuf_putc(bc, OP_source_loc);
        dbuf_put_u64(bc, s->loc);
        s->loc = 0;
    }

    fd->last_opcode_pos = bc->size;
    dbuf_putc(bc, val);
}

manually set s->loc to control whether to produce LoC for next emit_op call

bnoordhuis · 2023-12-22T12:28:32Z

I was just about to step away from my PC so I'm afraid I don't have time to go into great detail right now but what I think you need to do is locate the few places where col_num is actually assigned (looks like s->token.col_num = s->mark - s->eol;) and make sure those take UTF-8 sequences into account.

You can keep it simple if you want and simply scan from s->eol to s->mark again to see if there are non-ASCII characters. The common case is that there aren't so the inefficiency of scanning the string twice is likely negligible.

hsiaosiyuan0 · 2023-12-22T12:49:57Z

As a reference point, the column number is 16 in node too.

Thank you for this ref, confirmed it on d8, that's because v8 use utf16 stream for parsing I guess, rather then qjs uses UTF-8 sequences.

there aren't so the inefficiency of scanning the string twice is likely negligible

Maybe it's not true, since every token will be required to compute from the line beginning offset to grab their columns, it maybe worse if the js source is a MB minified bundle.

bnoordhuis · 2023-12-22T21:02:43Z

The reason I'm not overly worried about performance is because V8 does something similar, and if performance-wise that's good enough for them, it sure as heck is good enough for us. :-)

What V8 does is this:

store source locations as byte offsets (not line:column pairs)
store newline locations in one big array, also as byte offsets
binary-search source locations in that array; presto, there's your line number
scan from the newline's location to the source location to get the column number

I'm reasonably sure they picked that approach because it's memory efficient while not being super expensive CPU-wise. It's just a linear scan and does very little per byte processed (unlike the tokenizer that does a lot of work per byte.)

hsiaosiyuan0 · 2023-12-24T07:45:27Z

I take your tips to look into v8’s source code, then found below information, please correct me if I’m wrong.

The location used in v8 parsing stage is not the line:column pair which can be used directly to report source-code location info to developers, instead, it’s begin_pos:end_pos buffer offset of the token which can be used to access the source code under the token.

For reporting source code location, a algorithm introduced in v8 to do a transformation to make the location eye friendly, the time to perform that algorithm is delayed to when the source code location is needed such as building strack traces.

The v8 algorithm has these benifits:

avoid to both record token souce-code and its line:column(two location related infos) to reduce memory footprints
begin_pos:end_pos is code unit width independent, it looks like v8 supports one-byte stream for utf8 and two-byte stream for utf16
line:column is lazy compute, compute it while parsing is performance sensitive I think. It's more acts as the algorithm itself not a benifit

back to qjs:

maybe it won't supoort source stream with various code unit width in the near future
line has already been used in qjs instread of the begin_pos, qjs's token has another field JSToken::ptr which records the begin_pos

so add new field JSToken::col_num as a pair of JSToken::line_num is easy to understand I think. We can still imitate v8 algorithm by save line:end_pos, and lazy the end_pos-to-column calculation as v8 does, however it will increase complexity I think.

bnoordhuis · 2023-12-25T22:33:24Z

Yes, you've pretty much summarized it. I was initially tempted to copy V8's approach but it's more complex and it would have made the pull request even bigger than it already was.

I suggested doing a second simple forward scan in #232 (comment). To keep the overhead linear, cache the result from the previous scan so you don't start from s->eol every time.

In other words, count the non-ASCII characters since the last token and add that count to the current token's column number.

hsiaosiyuan0 · 2023-12-27T03:53:24Z

For the scan-twice strategy, I did it in my fork before this CL, for comparing the performance differences between them I move it into branch feat/column-twice-scan the regression tests for this CL is also passed.

The manner to do benchmark is using this CL and the scan-twice version to execute test.js which is a big js file consist of famous js frameworks(acron does same benchmark).

For this CL:

# hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs  test.js
  Time (mean ± σ):     286.8 ms ±   8.5 ms    [User: 272.8 ms, System: 12.0 ms]
  Range (min … max):   272.5 ms … 311.2 ms    20 runs

for the scan-twice:

# hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs  test.js
  Time (mean ± σ):     309.4 ms ±   8.3 ms    [User: 294.9 ms, System: 12.2 ms]
  Range (min … max):   299.3 ms … 333.5 ms    20 runs

scan-twice reduces a few changes, but lost ≈7% mean performance.

bnoordhuis · 2023-12-27T10:58:12Z

7% on a 4 MB file is acceptable to me. That's negligible or downright immeasurable on most real-world inputs.

saghul · 2023-12-27T11:28:28Z

Sounds pretty good indeed!

hsiaosiyuan0 · 2023-12-27T12:54:21Z

switched to the scan-twice strategy, please take a review.

TooTallNate · 2023-12-28T02:44:57Z

This branch improves the issue I reported in #236.

input

function r(){return e()}function e(){throw new Error("bad")}r()
//                  ↑↑                     ↑   ↑            ↑↑
// QuickJS          |22                    |   48           |62
// Node.js          21                     44               61

quickjs

Error: bad
    at e (min.js:1:48)
    at r (min.js:1:22)
    at <eval> (min.js:1:62)

node

Error: bad
    at e (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:44)
    at r (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:21)
    at Object.<anonymous> (/Users/nrajlich/Code/quickjs-ng/quickjs/min.js:1:61)

One thing that seems off to me is the 22 and 62 reported by quickjs, which points to the opening ( of the e() and r() function invocations. Node.js points to the name of the function rather than the opening parens. I've marked the column numbers that QuickJS and Node are reporting in the input file above, for a visual comparison.

hsiaosiyuan0 · 2023-12-28T04:00:47Z

for e(), this CL treats ( as the call site, because the opcode under e is used to get the object which bound to e, however this behavior can be changed base on our discussion.

bnoordhuis

It's not clear to me why the source location has to be recorded in so many places. It's supposed to sort of fall out naturally inside the tokenizer.

My suggestion of scanning twice should be a fairly minimal change to the tokenizer: once the token has been scanned, scan the bytes again to correct for multi-byte sequences.

hsiaosiyuan0 · 2023-12-29T11:29:37Z

set s->loc to adjust the location for next opcode emitting, for example:

let b = 1;
let c = 2;
let a = b + c;

this CL will produce bytecode stream:

check_define_var b,128
check_define_var c,128
check_define_var a,128
define_var b,130
define_var c,130
define_var a,130
push_i32 1
source_loc 1:9
put_var_init b
push_i32 2
source_loc 2:9
put_var_init c
source_loc 3:9
get_var b
source_loc 3:13
get_var c
add
source_loc 3:9
put_var_init a
get_loc 0: "<ret>"
return

source_loc 3:9 will not exist without s->loc set at here

static __exception int js_parse_var(JSParseState *s, int parse_flags, int tok,
                                    BOOL export_flag)
{
    // ...
                    s->loc = loc;  // <-- here
                    emit_op(s, (tok == TOK_CONST || tok == TOK_LET) ?
                        OP_scope_put_var_init : OP_scope_put_var);

the reason is when emitting OP_scope_put_var_init, the lexer has already moved to the end of the var declaration.

source_loc 3:9 is required for putting a breakpoint for it when debugging, below screen is taken from nodejs(see the breakpoint before b):

hsiaosiyuan0 added 8 commits December 21, 2023 00:48

Improve source location supports also add some regression tests.

9d43673

Add regression tests for source locations in the bytecode stream.

e511f5c

Extend makefile target 'test' to include 'test-op_loc'.

7608da0

Merge branch 'master' into feat/colnum

2bfc2fa

Remove useless include <_types/_uint8_t.h>

7a1c979

Advance reading offset when newline met to fix json_next_token broken.

8655fe8

Update code comments to describe the steps in run-test.js

9e1bd28

hsiaosiyuan0 mentioned this pull request Dec 22, 2023

Improve source location supports #231

Closed

TooTallNate mentioned this pull request Dec 26, 2023

Incorrect column number in stack traces repro #236

Closed

hsiaosiyuan0 added 3 commits December 26, 2023 23:33

Do not reset loc to cc.loc to report the original call site

f2b5763

Scan the source buffer twice to compute the column

ddabca7

Add test.js for testing

6787070

Remove temporary test file test.js

484969d

bnoordhuis reviewed Dec 29, 2023

View reviewed changes

Comment thread quickjs.c Outdated

hsiaosiyuan0 added 3 commits December 29, 2023 18:26

Remove tailing spaces

5d6f995

Remove tailing spaces

8792a04

Remove tailing spaces

5965e46

hsiaosiyuan0 closed this Jan 3, 2024

Conversation

hsiaosiyuan0 commented Dec 22, 2023

Unicode

Leading spaces

Regression tests

Algorithm

Uh oh!

hsiaosiyuan0 commented Dec 22, 2023

Uh oh!

bnoordhuis commented Dec 22, 2023

Uh oh!

bnoordhuis commented Dec 22, 2023

Uh oh!

hsiaosiyuan0 commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsiaosiyuan0 commented Dec 22, 2023

Uh oh!

bnoordhuis commented Dec 22, 2023

Uh oh!

hsiaosiyuan0 commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnoordhuis commented Dec 22, 2023

Uh oh!

hsiaosiyuan0 commented Dec 24, 2023

Uh oh!

bnoordhuis commented Dec 25, 2023

Uh oh!

hsiaosiyuan0 commented Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnoordhuis commented Dec 27, 2023

Uh oh!

saghul commented Dec 27, 2023

Uh oh!

hsiaosiyuan0 commented Dec 27, 2023

Uh oh!

TooTallNate commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

input

quickjs

node

Uh oh!

hsiaosiyuan0 commented Dec 28, 2023

Uh oh!

bnoordhuis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsiaosiyuan0 commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hsiaosiyuan0 commented Dec 22, 2023 •

edited

Loading

hsiaosiyuan0 commented Dec 22, 2023 •

edited

Loading

hsiaosiyuan0 commented Dec 27, 2023 •

edited

Loading

TooTallNate commented Dec 28, 2023 •

edited

Loading

hsiaosiyuan0 commented Dec 29, 2023 •

edited

Loading