Skip to content

[ATfE] Support command line options in LLVM libc#705

Merged
voltur01 merged 7 commits intoarm:arm-softwarefrom
voltur01:add_libc_cmd_line_opts
Feb 10, 2026
Merged

[ATfE] Support command line options in LLVM libc#705
voltur01 merged 7 commits intoarm:arm-softwarefrom
voltur01:add_libc_cmd_line_opts

Conversation

@voltur01
Copy link
Contributor

Add support in LLVM libc startup code to get command line options (argc/argv) from semihosting.

Provide an empty handler that can be overridden for no-host environment.

Semihosting passes the command line options as one string, thus basic logic to split it into individual arguments implemented:

  • Dummy program name is provided as argv[0].
  • Arguments are split by whitespace.
  • Quoted text is copied as-is: "a b c " or 'a b c ' will keep all spaces. Not closed quote will run till the end of the provided line.
  • Escape sequences: \, ' , " and \ to put , ', " and space respectively.

@voltur01 voltur01 requested a review from a team as a code owner January 30, 2026 15:00
// - Dummy program name is provided as argv[0].
// - Arguments are split by whitespace.
// - Quoted text is copied as-is: "a b c " or 'a b c ' will keep all spaces.
// Not closed quote will run till the end of the provided line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just a code comment, it's a specification for the syntax expected by the program you end up compiling. Someone who wants to run the program with a particular complicated argv list will need to understand these rules in order to unparse their intended argument list so that it will be parsed into the right thing. So it should live in the documentation, not just here.

Also, I think it's worth considering these rules at least a little bit, because changing them later will break existing command lines. Why these rules? They don't exactly match any existing scheme:

  • POSIX sh uses both single and double quotes, but treats them differently (e.g. \ inside single quotes isn't special, whereas inside double quotes it is)
  • Windows (really MSVC crt0) command line splitting doesn't treat ' as a quote character at all, and treats \\ as an escaped literal \ only in some circumstances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks! I will document in llvmlibc.md when we conclude the logic.

I thought about shell-like behavior however I had really difficult time trying to pass anything like that through QEMU - most of the special symbols are handled by the shell that is invoking QEMU before it reaches QEMU or the emulated program.

So I decided that a better reference would be existing embedded libraries:

So Arm libc was the bar for the proposed implementation.

Happy to consider if there is any other good reference or self-consistent model of what it makes sense to handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I documented the behavior which is a simple three statements: args are split by space, if you need a space use quotations, if you need a quotation mark escape it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to support two quote characters, they ought to do something different, to justify having both of them at all. Each one imposes inconvenience on the user, by requiring them to quote it if they want to use it literally, so each one ought to provide some benefit in return for that inconvenience.

How about taking one step further towards POSIX sh syntax, by saying that \ is not special inside single quotes, although it is inside double quotes? That means you can use single quotes for a situation where you have a lot of backslashes and don't want to have to escape them all:

myprogram "c:\\foo\\bar\\baz\\quux"      # ugh, what a pain
myprogram 'c:\foo\bar\baz\quux'          # ah, much nicer

This means that if you want a literal single quote inside your single-quoted word it's a bit awkward: you have to close the single quotes, then write a \', then reopen them:

myprogram 'here'\''s a single quote'

But POSIX sh users know this already and it will be familiar to them. And Windows users will probably prefer to use " in any case, and then ' inside that doesn't need escaping at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the idea - I removed the estimation function and simplified the escaping logic: now single quotes do not treat \ as a special escape symbol and in general they support escaping usual shell symbols like $ and ?, however this is not very useful since semihosting or embedded targets are not supposed to interpret them anyway.

return 0;

// Provide a dummy program name
argv[0] = "program";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACfE's handling of the string returned from SYS_GET_CMDLINE considers it to include the program name, but here you're considering it not to. So running semihosting binaries compiled by both toolchains, you'd have to pass them different command line strings.

The specification of SYS_GET_CMDLINE is vague about the meaning of the command line string. Nonetheless, given that there's an existing convention, wouldn't it be better to stick to it?

(Also, argv is supposed to contain a list of writable strings, so it's not advisable to put a string literal directly into it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like there is a difference in behavior between ACfE and picolibc.

I would prefer not to have the dummy name, indeed, however that by default (without any args passed to QEMU) argc will always be 0, which is different from hosted behavior where the program name is always (?) available.

If this seems a better choice, happy to remove the dummy name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the dummy program name.

exit(main(0, 0));

char cmdline[256];
const char *argv[8];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're about to call main, then that presumably means the rest of libc has already been set up, including the heap? So we could malloc the space for the command line, and then malloc the space for argv too – perhaps by walking along the command line string twice, first deciding how many argument pointers you need, then generating them on the second pass.

That would avoid at least one of these arbitrary limits (I think 7 arguments other than the command name is too few even for a 256-byte limit on the command line itself – I was dealing with a benchmark program just this week which has a lot of short command line options). It also shifts large blocks of memory from the stack to the heap, which I think is preferable, since stacks are often small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me see if we always have the heap, if yes, then it may be a better strategy, agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point, but for helper functions like is_space - could we use the llvmlibc ones at this point? And maybe strtok?

I agree with using malloc for the arguments array - we've been bitten by a fixed length buffer used for this exact purpose.
We should probably consider error handling - exit with a nice error message if the semihosting or malloc fails? Otherwise the program runs not as the user intended...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added use of malloc() for memory. Unfortunately, I cannot find any API to get the actual size of the command line from semihosting, so kept one hardcoded buffer size.

I replaced local is_space() with C standard one, however strtok() seems to be less compelling because of the need to handle quotations and escape sequences.

// - Dummy program name is provided as argv[0].
// - Arguments are split by whitespace.
// - Quoted text is copied as-is: "a b c " or 'a b c ' will keep all spaces.
// Not closed quote will run till the end of the provided line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to support two quote characters, they ought to do something different, to justify having both of them at all. Each one imposes inconvenience on the user, by requiring them to quote it if they want to use it literally, so each one ought to provide some benefit in return for that inconvenience.

How about taking one step further towards POSIX sh syntax, by saying that \ is not special inside single quotes, although it is inside double quotes? That means you can use single quotes for a situation where you have a lot of backslashes and don't want to have to escape them all:

myprogram "c:\\foo\\bar\\baz\\quux"      # ugh, what a pain
myprogram 'c:\foo\bar\baz\quux'          # ah, much nicer

This means that if you want a literal single quote inside your single-quoted word it's a bit awkward: you have to close the single quotes, then write a \', then reopen them:

myprogram 'here'\''s a single quote'

But POSIX sh users know this already and it will be familiar to them. And Windows users will probably prefer to use " in any case, and then ' inside that doesn't need escaping at all.

// Found start of an argument
++argc;

// Skip non-whitespace until next whitespace separator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this will get the wrong answer, because it's not paying attention to whether the whitespace is in quotes? I don't really understand why this new piece of code is separate from parse_cmdline_buf at all, let alone separate and completely different.

Wouldn't it be better to use the same code for both passes, with writing of output conditionalized? Then you could trust them to agree with each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function overestimates by using very simple logic that does not track quotes.

The problem with parse_cmdline_buf is that it modifies cmdline in place to create null-terminated arguments, thus if run second time over the same cmdline buffer it would stop after the first, now null-terminated, argument.

So a few trade-off here between how complex the parsing logic is vs using 2x memory for the buffer to avoid updating in pace vs using a bit more memory for argv. I guess, most of the time quotes will not be used, e.g. in testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a try at implementing single quotes, otherwise stick with only one type of quotes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, by "writing of output conditionalized" I meant conditionalizing both writing to the argv array and overwriting the output string.

Although another possibility is that pass 1 processes the output string in place to leave the arguments dequoted and packed together at the start of the string separated by '\0 (which requires no allocation because it guarantees to use at most as much space as the original string did), and it returns the total number of arguments it found. Then pass 2 just has to find that many NUL-terminated strings and write pointers to them into the argv array. Then the two passes split the work between them, without having to duplicate the same analysis as each other.

statham-arm
statham-arm previously approved these changes Feb 5, 2026
Copy link
Contributor

@statham-arm statham-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've made two further suggestions, but I don't think they affect correctness, so I'll leave it up to you to decide whether you like them.

return -1;

int argc = 0;
char *p = buf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style suggestion:

Suggested change
char *p = buf;
const char *p = buf;
char *w = buf;

If you make p a const char *, then the compiler would catch accidental writes through what's supposed to be the read pointer. That means you can't initialize w straight from p in the loop handling each argument, so instead you could initialize it to the start of buf here.

(You would also have to increment w when writing the trailing NUL on line 181; assign argv[argc] = w rather than = p; and skip_spaces would now take a const char *&.)

This way, you end up with the arguments packed at the start of the buffer, instead of having dead space between them. But that makes no difference to what the strings are.

(This is also the basis of my suggestion yesterday about not even doing the parsing twice. If you did this and didn't pass argv to this function at all, then you could have a much simpler pass 2 that just looks for '\0' and sets the argv pointers.)

return argc;
}

return argc + 1; // Extra slot for the terminating nullptr in argv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style suggestion: changing the semantics of the return value depending on which pass you're in is confusing! It even confused me when I pasted this function into a quick standalone test program.

Perhaps it would be better to just return argc unconditionally, and move the +1 to where the malloc happens.

Add support in LLVM libc startup code to get command line options (argc/argv) from semihosting.

Provide an empty handler that can be overridden for no-host environment.

Semihosting passes the command line options as one string, thus basic logic to split it into individual arguments implemented:
- Dummy program name is provided as argv[0].
- Arguments are split by whitespace.
- Quoted text is copied as-is: "a b c " or 'a b c ' will keep all spaces.
  Not closed quote will run till the end of the provided line.
- Escape sequences: \\, \' , \" and \  to put \, ', " and space respectively.
This needs a separate follow up commit to move the list fo functions into a separate section and explain each of them.
- Docs added
- Dummy program name removed
- malloc() used for buffers
- error reporting through the debug printing added
- local is_space() replaced with C lib isspace()
- Uses of NULL replaced with nullptr
@voltur01 voltur01 force-pushed the add_libc_cmd_line_opts branch from 2a3ecf0 to 327c46c Compare February 9, 2026 14:30
Copy link
Contributor

@statham-arm statham-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the revised organisation. But then, I would say that 🙂

@voltur01
Copy link
Contributor Author

Thank you for the reviews and good ideas to shape it!

@voltur01 voltur01 merged commit 75faad2 into arm:arm-software Feb 10, 2026
2 checks passed
voltur01 added a commit to voltur01/arm-toolchain that referenced this pull request Feb 20, 2026
Add support in LLVM libc startup code to get command line options
(argc/argv) from semihosting.

Provide an empty handler that can be overridden for no-host environment.

Semihosting passes the command line options as one string, thus basic
logic to split it into individual arguments implemented:
- Arguments are split by whitespace.
- Quoted text is copied as-is: "a b c " or 'a b c ' will keep all
spaces. Not closed quote will run till the end of the provided line.
- Escape sequences: \ escapes (copies) the next symbol, unless
  inside single quotes or at the end of the string.
voltur01 added a commit that referenced this pull request Feb 20, 2026
This cherry-picks changes relevant to llvmlibc-support from current
arm-software into the 22.x release branch:

12a4f2c [ATfE] Document _platform_debug_putc() (#723)
75faad2 [ATfE] Support command line options in LLVM libc (#705)
2ae2b6c [ATfE] Fix libc v7-R no-fpregs attribute build failure
(#720)
bc8a58b [ATfE] Provide LLVM libc handlers for hardware and runtime
setup (#706)
852e751 [ATfE] Provide debug output handler in LLVM libc crt0
9be5d72 (origin/add_unhosted_exit, add_unhosted_exit) Simplified
implementation of exit to avoid issues with privileged assembly
instructions
983f37c [ATfE] Provide nohost init and exit in llvmlibc startup
code
6859f67 [ATfE] Replace call to abort with __llvm_libc_exit in libc
startup code (#678)
a284205 (origin/update_exit_comment, update_exit_comment) [ATfE]
Update comment about handling cleanup for exit()
fb865b8 (origin/remove_abort_redefinition,
remove_abort_redefinition) [ATfE] Replace call to abort with
__llvm_libc_exit in libc startup code
28c43df (origin/use_sys_readc_for_stdin) Merge branch
'arm-software' into use_sys_readc_for_stdin
53e062f (use_sys_readc_for_stdin) Added a comment with rationale
and TODO
bbf19b6 (origin/add_semihosting_abort, add_semihosting_abort) Mark
internal semihosting function as static
52d1d90 Provide semihosting_call_exit() and add TODO for exit()
99c608d Replace exit() with direct call to __llvm_libc_exit()
07fe14b [ATfE] Use semihosting SYS_READC for stdin with llvm libc
ab08655 [ATfE] Redirect libc abort() to semihosting exit()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants