Use the TOML stdlib in code loading#36018
Conversation
|
Does this make Pkg operations faster? Thinking of #27414 (comment) |
If you mean loading packages, then I don't think it was bottlenecked on reading the TOML files, even though it did it quite inefficiently. |
|
Any measurements on how this affects code loading speed? One thought I’d had for that was to specify what values you’d like to extract from a file and only return that data instead of constructing the whole data structure. So if you’re just looking for a single value, it could be done with no allocation at all. |
Yeah, it is possible, but the time to fully parse my pretty big v1.4 Manifest.toml file is ~300 us (including reading it from disc). Compared to the cost of actually loading a package that is negligible. I don't think this really needs to be optimized further. |
|
I was thinking about the issue of parsing the same file over and over. But maybe caching is better for that. |
I think it is good to completely avoid opening and reading through the file which a cache should help with. Also, it is simpler in the sense that you don't need a second version of the parser that does something more akin lazy parsing. |
3f9ecfd to
741cff8
Compare
base/toml_parser.jl
Outdated
| # [a.b.c.d] doesn't "define" the table [a] | ||
| # so keys can later be added to [a], therefore | ||
| # we need to keep track of what tables are | ||
| # actualyl "defined |
There was a problem hiding this comment.
| # actualyl "defined | |
| # actually defined |
base/toml_parser.jl
Outdated
| @eval macro $(Symbol("try"))(expr) | ||
| :( | ||
| v = $(esc(expr)); | ||
| v isa $ParserError && return v; | ||
| v; | ||
| ) | ||
| end |
There was a problem hiding this comment.
| @eval macro $(Symbol("try"))(expr) | |
| :( | |
| v = $(esc(expr)); | |
| v isa $ParserError && return v; | |
| v; | |
| ) | |
| end | |
| @eval macro $(:var"try")(expr) | |
| return quote | |
| v = $(esc(expr)) | |
| v isa ParserError && return v | |
| v | |
| end | |
| end |
base/toml_parser.jl
Outdated
| function recurse_dict!(l::Parser, d::Dict, dotted_keys::AbstractVector{String}, check=true)::Err{TOMLDict} | ||
| for i in 1:length(dotted_keys) | ||
| key = dotted_keys[i] | ||
| d = get!(() -> TOMLDict(), d, key) |
There was a problem hiding this comment.
| d = get!(() -> TOMLDict(), d, key) | |
| d = get!(TOMLDict, d, key) |
base/toml_parser.jl
Outdated
|
|
||
| isvalid_hex(c::Char) = isdigit(c) || ('a' <= c <= 'f') || ('A' <= c <= 'F') | ||
| isvalid_oct(c::Char) = '0' <= c <= '7' | ||
| isvalid_binary(c::Char) = '0' <= c <= '2' |
There was a problem hiding this comment.
| isvalid_binary(c::Char) = '0' <= c <= '2' | |
| isvalid_binary(c::Char) = '0' <= c <= '1' |
base/toml_parser.jl
Outdated
| contains_underscore |= read_underscore | ||
| end | ||
| if !ok_end_value(peek(l)) | ||
| error() |
base/toml_parser.jl
Outdated
| subs = take_substring(l) | ||
| # Need to pass a AbstractString to `parse` so materialize it in case it | ||
| # contains underscore. | ||
| # vvvvvvv <- this looksl like a dude flipping the bird |
There was a problem hiding this comment.
| # vvvvvvv <- this looksl like a dude flipping the bird |
| e isa Base.OverflowError && return(ParserError(ErrOverflowError)) | ||
| error("internal parser error: did not correctly discredit $(repr(s)) as an int") | ||
| end | ||
| end |
There was a problem hiding this comment.
| end | |
| return v | |
| end |
| set_marker!(l) | ||
| @try accept_two(l, isdigit) | ||
| day = parse_int(l, false) | ||
| # Verify the real range in the constructor below |
There was a problem hiding this comment.
This isn't true--as stated at the top of the file, no validity checking is done
There was a problem hiding this comment.
it is true when you have the Dates stdlib available:
julia> Date(1989, 02, 29)
ERROR: ArgumentError: Day: 29 out of range (1:28)
base/toml_parser.jl
Outdated
| if ok_end_value(peek(l)) | ||
| if (read_space = accept(l, ' ')) | ||
| if !isdigit(peek(l)) | ||
| return Date(year, month, day) |
There was a problem hiding this comment.
Do you intend for this to throw errors (whereas everything else returns them)?
There was a problem hiding this comment.
Should be a try catch here as well.
base/toml_parser.jl
Outdated
| accept_two(l, f::F) where {F} = accept_n(l, 2, f) || return(ParserError(ErrParsingDateTime)) | ||
| function parse_datetime(l)::Err{Union{DateTime, Date}} | ||
| # Year has already been eaten when we reach here | ||
| year = parse_int(l, false) |
There was a problem hiding this comment.
| year = parse_int(l, false) | |
| year = parse_int(l, false)::Int |
base/loading.jl
Outdated
| entry = first(d[name]) | ||
| if found_name | ||
| uuid = get(entry, "uuid", nothing) | ||
| return PkgId(UUID(uuid), name) |
There was a problem hiding this comment.
ERROR: MethodError: Cannot convert an object of type Nothing to an object of type UInt128
base/loading.jl
Outdated
| error("expected a single entry for $(repr(name)) in $(repr(project_file))") | ||
| end | ||
| entry = first(d[name]) | ||
| if found_name |
| elseif deps isa Dict | ||
| for (dep, uuid) in deps | ||
| if dep === name | ||
| return PkgId(UUID(uuid), name) |
There was a problem hiding this comment.
uuid might not be of the proper type if the file is malformed
base/loading.jl
Outdated
| return nothing | ||
| found_where || return nothing | ||
| found_name || return PkgId(name) | ||
| # Only here is deps was not a dict which mean we have a unique name for the dep |
There was a problem hiding this comment.
| # Only here is deps was not a dict which mean we have a unique name for the dep | |
| # Only here if deps was not a dict which mean we have a unique name for the dep |
base/toml_parser.jl
Outdated
| # In case we do not have the Dates stdlib available | ||
| # we parse DateTime into these internal structs, | ||
| # note that these do not do any argument checking | ||
| struct Date | ||
| year::Int | ||
| month::Int | ||
| day::Int | ||
| end | ||
| struct Time | ||
| hour::Int | ||
| minute::Int | ||
| second::Int | ||
| ms::Int | ||
| end | ||
| struct DateTime | ||
| date::Date | ||
| time::Time | ||
| end | ||
| DateTime(y, m, d, h, mi, s, ms) = | ||
| DateTime(Date(y,m,d), Time(h, mi, s, ms)) |
There was a problem hiding this comment.
Would be so much better if we could either use delayed definitions here (Requires.jl?) or make these a parser error (unsupported) rather than re-implementing incompatible structs with the same names.
There was a problem hiding this comment.
We can't make it a parser error (we need to support arbitrary TOML files) and we want code loading to work without Dates. So this was the best I could come up with.
|
Added the docs and tests labels since the new toml_parser.jl file will need both. Otherwise, most looks pretty straight-forward here and should be in pretty good shape to merge. |
base/loading.jl
Outdated
| p::TOML.Parser | ||
| d::Dict{String, Dict{String, Any}} | ||
| end | ||
| TOMLCache() = TOMLCache(TOML.Parser(), Dict()) |
There was a problem hiding this comment.
I guess it doesn't matter, but maybe:
| TOMLCache() = TOMLCache(TOML.Parser(), Dict()) | |
| TOMLCache() = TOMLCache(TOML.Parser(), Dict{String,Dict{String,Any}}()) |
?
|
Thanks for pointing out the better alternative. It seems to be identical to the difference between x::T = yand x = y::TI'll switch to the style you demonstrated for future commits. |
b1d6fc8 to
240a190
Compare
|
I've rebased this and changed the target branch to https://github.com/JuliaLang/julia/tree/kc/load_toml for a simpler review. |
|
@timholy If you want to do some invalidation sanity checks, you could use this branch together with JuliaLang/Pkg.jl#1984 and see how things look. |
240a190 to
47119ee
Compare
47119ee to
031448d
Compare
|
This is ready to go but it breaks Revise and I don't want to break people's workflow so I'll wait a bit with merging. It should be a fairly small change to https://github.com/timholy/Revise.jl/blob/8336b10a5e7d0e4b1bb0726cf426469c8f866d2e/src/pkgs.jl#L459-L488 @timholy. I didn't really know what the function does so I had a hard time updating it myself. |
… Base TOML parser
031448d to
c601fbf
Compare
|
Revise has been updated to deal with this PR now. |
… Base TOML parser (JuliaLang#36018)
… Base TOML parser (JuliaLang#36018)
The current way TOML files are parsed in Base during code loading has a few issues:
Therefore, it would make sense to try:
This PR starts this process by putting a newly written TOML parser into Base and re-implements code loading to take advantage of it. The parser is available as a package with proper API, docs, tests, etc (https://github.com/KristofferC/TOMLX.jl) but the only thing this PR adds is the parser part (with a more low-level API) since that is what is needed in Base and I wanted to make this PR have as little "noise" as possible.
It has a significant number of tests (https://github.com/KristofferC/TOMLX.jl/blob/master/test/readme.jl, https://github.com/KristofferC/TOMLX.jl/blob/master/test/toml_test.jl) and I have benchmarked it pretty thoroughly.
For this PR, I think it makes sense to mostly comment on the code loading parts (@StefanKarpinski for review). For the TOML parsing itself, perhaps it makes sense to focus that discussion to the https://github.com/KristofferC/TOMLX.jl/ repo?.
After this step is done, we update Pkg to start using the parser here for it's parsing.
Then we can try to finally make a TOML stdlib and have Pkg depend on that. This will remove the last bundled dependency for Pkg.
There is a TODO in this PR and that is that I right now use a global cache object to store the
filename => Dictmapping in code loading which gets empited inBase.require. This should be moved to a function local cache that gets created inBase.requireand threaded through all functions. I wanted to put up this PR anyway for people to get a chance to review things as soon as possible.One possible question is why this parser is not just the one that currently lives in the
extfolder in Pkg. The reason for this is that theext- parser (forked from https://github.com/JuliaLang/TOML.jl) was written a quite long time ago for an old spec of TOML and requires some significant updates. I was also a bit unhappy with the performance of it (it allocates scratch buffers a bit too much and makes copies of strings etc) so I felt a clean implementation might lead to a better end result.Fixes #27414.