Improve source location supports#232
Improve source location supports#232hsiaosiyuan0 wants to merge 15 commits intoquickjs-ng:masterfrom hsiaosiyuan0:feat/column
Conversation
…there is no os.pipe support on it. 2. Add source location support for unary-expr and update the related column tests. 3. Fix ubsan error in find_line_num when function does not have pc2line info, this could happen when there are only var-dec statements declared by `var` and without initializers.
I am not sure what this mean so I make a test with typescript compiler and this patch, below code: let a = '😀' + bwill produce these es5 output: "use strict";
var a = '😀' + b;
//# sourceMappingURL=data:application/json;base64,eyJ2ZXJzaW9uIjozLCJmaWxlIjoiaW5wdXQuanMiLCJzb3VyY2VSb290IjoiIiwic291cmNlcyI6WyJpbnB1dC50c3giXSwibmFtZXMiOltdLCJtYXBwaW5ncyI6IjtBQUFBLElBQUksQ0FBQyxHQUFHLElBQUksR0FBRyxDQUFDLENBQUEifQ==the base64 decoded sourcemap is: {"version":3,"file":"input.js","sourceRoot":"","sources":["input.tsx"],"names":[],"mappings":";AAAA,IAAI,CAAC,GAAG,IAAI,GAAG,CAAC,CAAA"}then use this patch to perform the es5 output, got this error message: ReferenceError: 'b' is not defined
at <eval> (tests/test_language.js:2:15)then use this tool ScriptMapper to resolve the original source location: please point out the mistakes |
|
I believe the column number is off by one in your example of
Note how 'b' is at position 15 in the array. Arrays count from 0 but columns from 1 so the column number should be 15 + 1 = 16. 15 is correct visually because that's what people see in their editor, but it's off by one for tools/libraries that parse stack traces. It's for such tools that I added the column number in the first place. As a reference point, the column number is 16 in node too. |
|
Something else: is there a way to make the diff for quickjs.c a little smaller? I've looked at the changes briefly and it feels like it should be possible to achieve what you're trying to do with fewer lines of code. Correct me if I'm wrong but you're essentially trying to make the tokenizer track column number across UTF-8-sequences, right? |
|
Thank you, I understand your example, it's counting columns base on utf16 surrogate pairs, rather than the unicode codepoints in this patch. The thing makes me confused is why this is required, I think the key is what the tools/libraries you're using to parse stack traces, do you have more details about that tools like their names or link of code? Counting columns base on unicode codepoints is easy to understand I think and it can work with the sourcemaps come from upstream tools like the transpiler, above typescript example try to show this. |
Could you please give me some suggestions as a starting point? Some details about this patch: c = unicode_from_utf8(p - 1, UTF8_CHAR_LEN_MAX, &p_next);
s->over_cols += (intptr_t)p_next - (intptr_t)(p - 1) - 1;It computes loc = LOC(s->token.line_num, s->token.col_num);It records the LoC of the leading token of the expressions or statements, since the bytecode stream is not always produced as their source code order. s->loc = loc;
// ...
static void emit_op(JSParseState *s, uint8_t val)
{
JSFunctionDef *fd = s->cur_func;
DynBuf *bc = &fd->byte_code;
if (s->loc != 0) {
dbuf_putc(bc, OP_source_loc);
dbuf_put_u64(bc, s->loc);
s->loc = 0;
}
fd->last_opcode_pos = bc->size;
dbuf_putc(bc, val);
}manually set |
|
I was just about to step away from my PC so I'm afraid I don't have time to go into great detail right now but what I think you need to do is locate the few places where col_num is actually assigned (looks like You can keep it simple if you want and simply scan from |
Thank you for this ref, confirmed it on d8, that's because v8 use utf16 stream for parsing I guess, rather then qjs uses UTF-8 sequences.
Maybe it's not true, since every token will be required to compute from the line beginning offset to grab their columns, it maybe worse if the js source is a MB minified bundle. |
|
The reason I'm not overly worried about performance is because V8 does something similar, and if performance-wise that's good enough for them, it sure as heck is good enough for us. :-) What V8 does is this:
I'm reasonably sure they picked that approach because it's memory efficient while not being super expensive CPU-wise. It's just a linear scan and does very little per byte processed (unlike the tokenizer that does a lot of work per byte.) |
|
I take your tips to look into v8’s source code, then found below information, please correct me if I’m wrong. The location used in v8 parsing stage is not the For reporting source code location, a algorithm introduced in v8 to do a transformation to make the location eye friendly, the time to perform that algorithm is delayed to when the source code location is needed such as building strack traces. The v8 algorithm has these benifits:
back to qjs:
so add new field |
|
Yes, you've pretty much summarized it. I was initially tempted to copy V8's approach but it's more complex and it would have made the pull request even bigger than it already was. I suggested doing a second simple forward scan in #232 (comment). To keep the overhead linear, cache the result from the previous scan so you don't start from In other words, count the non-ASCII characters since the last token and add that count to the current token's column number. |
|
For the scan-twice strategy, I did it in my fork before this CL, for comparing the performance differences between them I move it into branch feat/column-twice-scan the regression tests for this CL is also passed. The manner to do benchmark is using this CL and the scan-twice version to execute test.js which is a big js file consist of famous js frameworks(acron does same benchmark). For this CL: # hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs test.js
Time (mean ± σ): 286.8 ms ± 8.5 ms [User: 272.8 ms, System: 12.0 ms]
Range (min … max): 272.5 ms … 311.2 ms 20 runsfor the scan-twice: # hyperfine --runs 20 './build/qjs test.js'
Benchmark 1: ./build/qjs test.js
Time (mean ± σ): 309.4 ms ± 8.3 ms [User: 294.9 ms, System: 12.2 ms]
Range (min … max): 299.3 ms … 333.5 ms 20 runsscan-twice reduces a few changes, but lost ≈7% mean performance. |
|
7% on a 4 MB file is acceptable to me. That's negligible or downright immeasurable on most real-world inputs. |
|
Sounds pretty good indeed! |
|
switched to the scan-twice strategy, please take a review. |
|
This branch improves the issue I reported in #236. inputfunction r(){return e()}function e(){throw new Error("bad")}r()
// ↑↑ ↑ ↑ ↑↑
// QuickJS |22 | 48 |62
// Node.js 21 44 61quickjsnodeOne thing that seems off to me is the |
|
for |
bnoordhuis
left a comment
There was a problem hiding this comment.
It's not clear to me why the source location has to be recorded in so many places. It's supposed to sort of fall out naturally inside the tokenizer.
My suggestion of scanning twice should be a fairly minimal change to the tokenizer: once the token has been scanned, scan the bytes again to correct for multi-byte sequences.
|
set let b = 1;
let c = 2;
let a = b + c;this CL will produce bytecode stream: check_define_var b,128
check_define_var c,128
check_define_var a,128
define_var b,130
define_var c,130
define_var a,130
push_i32 1
source_loc 1:9
put_var_init b
push_i32 2
source_loc 2:9
put_var_init c
source_loc 3:9
get_var b
source_loc 3:13
get_var c
add
source_loc 3:9
put_var_init a
get_loc 0: "<ret>"
return
static __exception int js_parse_var(JSParseState *s, int parse_flags, int tok,
BOOL export_flag)
{
// ...
s->loc = loc; // <-- here
emit_op(s, (tok == TOK_CONST || tok == TOK_LET) ?
OP_scope_put_var_init : OP_scope_put_var);the reason is when emitting
|


Hi, this PR tries to improve the source location supports.
The source location correctness is useful I think, not for better error reports but also for later debugger to support breakpoints, so I make below changes, request for review.
Unicode
handle unicodes included in strings, identifiers, comments and regexp literals.
current token stream is:
loc of
+is miscalculated, the correct one is1:13Leading spaces
source location is miscalculated if the line starts with spaces.
current token stream is:
the loc of
letshould be1:3to comply with editor's behaviors.Regression tests
some tests under the
testsdirectory to cover changes in this PR.the tests base on:
two new options
DUMP_QJS_TOKENandDUMP_QJS_BYTECODEto tell qjsto print the token streams and bytecode streams to stdout
then the test runner
run-test.jswill assert the expected snapshots present in stdout.Algorithm
Things start from
next_token:redostarts to tokenize a new token so thes->col_numrequired to be computed well to assign, it's done by previousnext_tokencallsuppose we are the previous
next_tokencall, we need to advance the column number right for the next tokenFor example:
next_tokenbeing called to tokenize"😀":its start column
s->col_numis computed by previousnext_token, so a simple assignment is performed at point1at point
2,s->token.ptrrecords the start byte offset for the token of"😀"then the process runs to
js_parse_stringto parse string literal:if the string is well typed then process runs to the
breakat point1, which leads the process to the end ofnext_token:the
advance_colmacro will perform the column caculation:first part
(intptr_t)cur_ofst - (intptr_t)last_ofstis to compute delta bytes, it can't be delta columns if there are unicodes in it, sos->over_colswas introduced to do an adjustment.s->over_colsis computed like below:reset
s->over_colsto zero if a newline starts, otherwise(intptr_t)p_next - (intptr_t)pto compute the number of bytes of the unicode codepoint been read intoc, substract1since its included in the bytes delta