Skip to content

Replace --log-errors-to-tty with --report-errors-to-stderr#3428

Merged
sporksmith merged 7 commits intoshadow:mainfrom
sporksmith:errs-stderr
Oct 23, 2024
Merged

Replace --log-errors-to-tty with --report-errors-to-stderr#3428
sporksmith merged 7 commits intoshadow:mainfrom
sporksmith:errs-stderr

Conversation

@sporksmith
Copy link
Copy Markdown
Contributor

@sporksmith sporksmith commented Oct 22, 2024

Trying to detect whether stdout and stderr are terminals and deduplicating based on that can be tricky, e.g. in CI environments..

Instead:

  • Make the stderr version more of a human-consumable message instead of the raw log line, which also makes the duplication easier to understand in the case where stdout and stderr are merged.
  • Disable this feature in our cmake tests, where stdout and and stderr are merged.

New output on a process failure in the unmerged case (where stdout is written to a file and stderr is the terminal):

** Starting Shadow 3.2.0
Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
Error: Failed to run the simulation
Error:
Error: Caused by:
Error:     1 managed processes in unexpected final state
** Shadow did not complete successfully: Failed to run the simulation
**   1 managed processes in unexpected final state
** See the log for details

The merged case is a little ugly, but at least it's harder to mistake the duplicated message on stderr for a second distinct log line:

...
00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process 'testnode.test_sleep.1000' has completed or is otherwise no longer running
Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016660 [507267:shadow-worker] 00:00:01.000000000 [ERROR] [testnode:11.0.0.1] [process.rs:1571] [shadow_rs::host::process] process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016710 [507267:shadow-worker] 00:00:01.000000000 [TRACE] [testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not notifying parent of exit: no signal specified
...

Progress on #3426.

@sporksmith sporksmith self-assigned this Oct 22, 2024
@github-actions github-actions bot added Component: Testing Unit and integration tests and frameworks Component: Main Composing the core Shadow executable Component: Build Build/install tools and dependencies labels Oct 22, 2024
@github-actions github-actions bot added the Component: Documentation In-repository documentation, under docs/ label Oct 22, 2024
@sporksmith sporksmith changed the title Errs stderr Don't try to deduplicate across stdout and stderr Oct 22, 2024
@sporksmith sporksmith changed the title Don't try to deduplicate across stdout and stderr Replace --log-errors-to-tty with --report-errors-to-stderr Oct 22, 2024
@sporksmith sporksmith marked this pull request as ready for review October 22, 2024 15:34
@sporksmith
Copy link
Copy Markdown
Contributor Author

cc @robgjansen to join in on the UX bikeshedding ;)

@stevenengler
Copy link
Copy Markdown
Contributor

stevenengler commented Oct 22, 2024

My primary shadow use case is running it in the terminal without any file redirection, so I think the duplicated error messages aren't the greatest, but the new message formatting looks good and I think makes that much better. My main concern is that they'll be interleaved at a non-line boundary. Since stderr isn't buffered (and even the buffered stdout is unlikely to be written in a single write syscall), I don't see why we wouldn't occasionally get things like:

00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process 'testnode.test_sleep.1000' has completed or is otherwise no longer running
Error: process 'testnode.test_sleep.1000' exited with status 00:00:00.016660 [507267:shadow-worker] 00:00:01.000000000 [ERROR] [testnode:11.0.0.1] [shadow_rs::host::process] process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016710 [507267:shadow-worker] 00:00:01.000000000 [TRACE] [testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not notifying parent of exit: no signal specified
Normal(101); expected end state was exited: 0 but was exited: 101

And since stdout is buffered and delayed, we might get:

00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process 'testnode.test_sleep.1000' has completed or is otherwise no longer running
Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [...]
00:00:00.016604 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [...]
00:00:00.016605 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [...]
00:00:00.016606 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [...]
00:00:00.016607 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [...]
[...]
00:00:01.046660 [507267:shadow-worker] 00:00:01.000000000 [ERROR] [testnode:11.0.0.1] [process.rs:1571] [shadow_rs::host::process] process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:01.046710 [507267:shadow-worker] 00:00:01.000000000 [TRACE] [testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not notifying parent of exit: no signal specified

which makes it seem like the error occurs much earlier than it did relative to the logs.

@sporksmith
Copy link
Copy Markdown
Contributor Author

My primary shadow use case is running it in the terminal without any file redirection

Huh; I almost never do this. If this is a common use-case for others though, another idea is to turn this config option into a ternary of true/false/auto, where the auto option is the default and does either the old terminal-detection behavior or uses fstat to try to detect if stderr and stdout is the same destination. My inclination is to stick with my current simple and predictable approach, but I'm willing to be outvoted.

Since stderr isn't buffered, I don't see why we wouldn't occasionally get things like:

I'll think about this a bit and see if I can make it less likely with a bit more explicit flushing on error messages....

@sporksmith
Copy link
Copy Markdown
Contributor Author

Since stderr isn't buffered, I don't see why we wouldn't occasionally get things like:

I added a stdout flush before we write to stderr. Also an explicit stderr flush before we write to stdout again, though there was already an implicit one.

I think this addresses the mid-line interleaving.

I also moved the stdout write to always be before the stderr write, so we won't be notified about an error on stderr before it was actually logged to stdout.

@robgjansen
Copy link
Copy Markdown
Member

OK you summoned me, so here is my take :)

My main use case is to always redirect stdout to a log file, because I don't want to lose that information to a terminal buffer. I want that log file to contain all log messages. I think this is how things work now and after this PR.

Then, I can see how it would be useful to duplicate errors to stderr, so that those are not missed by users. And I could see how that would be useful even if running in a terminal and logging stdout to a file. IIUC, when running in a terminal, those errors would not currently be printed to stderr, but after this PR they will be. So I think this PR is an improvement.

But for Steve's use case, errors injected mid-line into stdout sure would be annoying.

Maybe a good situation would be to have two log streams. The user could set the log level independently on each, and could set the destination (file, stderr, stdout) independently on each. With that, you could set one stream as the small firehose: a sort of progress indicator about how the simulation is going (similar to Tor's notice log level), and the other stream as the big firehose: containing all messages. Or you could configure them as you wish, and print them where you wish. That's a bit more work though. Also, I don't think necessarily solves Steve's problem of wanting the failed processes printed out again at the end of the simulation, does it?

@sporksmith
Copy link
Copy Markdown
Contributor Author

My main use case is to always redirect stdout to a log file, because I don't want to lose that information to a terminal buffer. I want that log file to contain all log messages.

Same

I think this is how things work now and after this PR.

Right

Then, I can see how it would be useful to duplicate errors to stderr, so that those are not missed by users. And I could see how that would be useful even if running in a terminal and logging stdout to a file. IIUC, when running in a terminal, those errors would not currently be printed to stderr, but after this PR they will be. So I think this PR is an improvement.

Prior to this PR what we tried to do was duplicate error-level logs to stderr if stdout is not a terminal, but stderr is a terminal. i.e. when stdout is being logged to a file, and stderr wasn't, we write to stderr.

But terminal detection is tricky, especially in CI web pseudo-terminal situations. In the Tor CI I think we end up deciding that neither stdout nor stderr is a terminal, so we don't mirror errors to stderr, so we don't get to see the errors in the main job output.

But for Steve's use case, errors injected mid-line into stdout sure would be annoying.

I think I've ensured now that there won't be mid-line interleaving, but yes the merged output might still be a bit ugly/annoying. It can still be explicitly disabled though with --report-errors-to-stderr=false. (Conversely before this PR there was no way to force the write to stderr in the case where shadow decides that it isn't a terminal)

Maybe a good situation would be to have two log streams. The user could set the log level independently on each, and could set the destination (file, stderr, stdout) independently on each. With that, you could set one stream as the small firehose: a sort of progress indicator about how the simulation is going (similar to Tor's notice log level), and the other stream as the big firehose: containing all messages. Or you could configure them as you wish, and print them where you wish. That's a bit more work though.

I could see this being potentially useful, but it implies using the full logging format for both outputs, which reintroduces the issue of full-on-duplication in the case where stdout and stderr are merged. (vs. this MR changes the stderr version to be more of a human-readable message than a "log line")

Also, I don't think necessarily solves Steve's problem of wanting the failed processes printed out again at the end of the simulation, does it?

Oh good point, I missed that #3426 wanted both for them to be printed to stderr and repeated in the summary at the end. This PR only addresses the first part of printing them to stderr. I think I'd rather keep this PR to making that happen; if we decide to also print them again at the end that could be a separate PR.

@sporksmith
Copy link
Copy Markdown
Contributor Author

and could set the destination (file, stderr, stdout) independently on each.

Oh, I missed this part. So I guess that would probably ensure that we weren't accidentally writing the same log line multiple times to the same destination. But I think it'd be more difficult to have a nice default behavior for the use-case of redirecting stdout to a file but not stderr; the user would have to explicitly configure the stderr logger to get their summary. (Or else we go back to trying to be clever and figure out the right thing dynamically)

@robgjansen
Copy link
Copy Markdown
Member

robgjansen commented Oct 22, 2024

But I think it'd be more difficult to have a nice default behavior for the use-case of redirecting stdout to a file but not stderr; the user would have to explicitly configure the stderr logger to get their summary.

I think the default would be:

Stream 1:

  • log-level=debug
  • destination=file

Stream 2:

  • log-level=info (or maybe just warn, but in both cases to include ERROR and WARN messages)
  • destination=stderr

But in my world the log messages are formatted as log lines in both of the above, which is arguably more useful than the "simpler" messages with the ip address, process, filename:line, etc. prefix removed as we do for the stderr messages now.

@robgjansen
Copy link
Copy Markdown
Member

In fact, you should be able to just add as many log streams as you want, where each has a log level and a destination. This is how c-tor logging works too.

@sporksmith
Copy link
Copy Markdown
Contributor Author

I think the default would be:

Stream 1:

* log-level=debug

* destination=file

Stream 2:

* log-level=info (or maybe just warn, but in both cases to include ERROR and WARN messages)

* destination=stderr

I'd be happy with something like this, with "stream 1" going to something like shadow.data/log.txt instead of the current stdout. I think it'd be a semver-breaking change, though.

Looking at https://shadow.github.io/docs/guide/semver.html, we do allow ourselves to create new files in the data directory. So we could create a new log file there.

But we also promise not to change default values of existing non-experimental options, which I think means we couldn't change the default logging to stdout to be less verbose (nor stop logging there by default at all in favor of stderr)

I suppose we could just do a major version bump, though it feels a little silly for just this.

@sporksmith
Copy link
Copy Markdown
Contributor Author

Oh, actually the default level is already info, which is what you proposed for stderr. https://shadow.github.io/docs/guide/shadow_config_spec.html#generallog_level

So I think without breaking semver we could start logging (e.g. at debug) to a file in the data directory, and add new options for disabling it, changing that log level, etc.

Likewise we could keep logging at info to stdout by default.

info is a bit more verbose than I'd generally want to stdout if we're not redirecting it; I think it'll make it easy for error-level messages to get lost. I think it's not too bad to live with for now though, and we could change it at the next major version bump.

@sporksmith
Copy link
Copy Markdown
Contributor Author

So I think without breaking semver we could start logging (e.g. at debug)

On the other (other other other) hand, while this doesn't technically break semver, it could result in surprising changes in storage usage for large simulations

@robgjansen
Copy link
Copy Markdown
Member

Right. Here is the current behavior that we would need to maintain until a major version bump:

-l, --log-level <level>                                                                                                                           
          Log level of output written on stdout. If Shadow was built in release mode, then log messages at level 'trace' will always be dropped     
          [default: "info"]

If we had the "add a log stream" mechanism, in addition to the two I noted above, we could add a third:

Stream 3:

  • log-level=<level from the --log-level option>
  • destination=stdout

And then on the next major version bump we could remove Stream 3 and the --log-level option in favor of a default set of options that enable the Stream 1 and Stream 2 I mentioned above..

Or something. But I'm not sure that anyone is asking for this. So maybe this is just a brainstorm for now and we file the idea away in an issue? Unless there is motiviation to handle it now while we're thinking about it :)

@sporksmith
Copy link
Copy Markdown
Contributor Author

Yeah, I like this idea, but I'd be in favor of getting something like this PR in for now to solve some of our pain in the Tor CI and continuing to think about this multi-logging feature a bit more

@stevenengler
Copy link
Copy Markdown
Contributor

Okay there's a lot for me to read here, so I will make a proper comment here shortly :) But I just wanted to mention that I don't think we should log at "debug" level anywhere by default. My opinion is that shadow is already too noisy at the default info level, and its not a good experience for new users when shadow dumps a lot of irrelevant info by default. I think "debug" should only be enabled when people are debugging something, for both disk and performance reasons (lots of logging will significantly slow down your simulation).

@stevenengler
Copy link
Copy Markdown
Contributor

stevenengler commented Oct 23, 2024

I added a stdout flush before we write to stderr. Also an explicit stderr flush before we write to stdout again, though there was already an implicit one.

Thanks, that looks good to me!

I think this addresses the mid-line interleaving.

Thinking about this some more, I think the mid-line interleaving is more complicated than just application buffering. The terminal just gets a stream of bytes for each of stdout/stderr, and shadow could write to both of those before the terminal wakes up to read them. In that case it wouldn't know whether to display stdout first or stderr. And it would have to chunk the reads from stdout/stderr, and possibly interleave the reads at some point so that stderr doesn't get "left behind" (I can't think of the right word here).

But I'm okay with ignoring this. Hopefully it will be rare in practice, and if it ever causes noticeable problems then we can revisit.

Oh good point, I missed that #3426 wanted both for them to be printed to stderr and repeated in the summary at the end.

I think #3426 was meant to be mostly that we have a "X managed processes in unexpected final state" at the end of the simulation, but we don't actually say which processes those were. If the simulation fails, those processes are probably the most important information for the user. And since we've already got some tracking set up, since we know the number X that failed, it would be good to show more details here. For example if it showed you that curl exited with status "3", then you wouldn't even need to open the log. You'd know exactly which processes' stdout to look at, or maybe you'd go right to the man page and see that it was a "URL malformed" error.

So I don't think it would hurt to put them in the summary, even if they're written earlier to stderr.

In fact, you should be able to just add as many log streams as you want, where each has a log level and a destination. This is how c-tor logging works too.

I would be good with this, but don't have strong feelings either way.

I suppose we could just do a major version bump, though it feels a little silly for just this.

I agree it feels silly, but also it's just a number that's encoding some semantics. Maybe if we quickly get to version >10, it won't feel as bad to make these changes :)

On the other (other other other) hand, while this doesn't technically break semver, it could result in surprising changes in storage usage for large simulations

I'd be a fan of a major version bump for that kind of change. The "MAJOR version increases (ex: 1.2.2 to 2.0.0) are intended for incompatible changes" is ambiguous, so I think we should go with whatever version bump best reflects the changes that we make. And since this would have a high probability of breaking someone's workflow, my vote would be for breaking change.

And then on the next major version bump we could remove Stream 3 and the --log-level option in favor of a default set of options that enable the Stream 1 and Stream 2 I mentioned above..

I'd vote for no intermediate stage and just jump directly to major version bump and implement the final version that we want.

Or something. But I'm not sure that anyone is asking for this. So maybe this is just a brainstorm for now and we file the idea away in an issue? Unless there is motiviation to handle it now while we're thinking about it :)

I don't have strong motivation to work on it, but the idea sounds good to me if anyone else thinks it will be useful.

Copy link
Copy Markdown
Contributor

@stevenengler stevenengler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with these changes, especially since there's a real world use case for them (the tor CI). And the tor CI probably wants to see other error messages anyways and not just the "process failed" ones, so this PR helps with that too.

I don't think this PR completely fixes #3426 in the way that I had pictured it, so I think I'd like to keep it open. Mainly for the case where stdout and stderr are both printed to the terminal. And so we don't have to say "see the log for details", and we can just show the relevant details right there regardless of how shadow's output is being redirected (or not redirected).

I think the multiple output streams idea isn't impacted by this PR, and we don't need to block this PR while we decide on that.

@sporksmith
Copy link
Copy Markdown
Contributor Author

The terminal just gets a stream of bytes for each of stdout/stderr

I think that's not (typically) true, and in the common merged case it's a single stream. I suppose it could depend on the shell, though. e.g. on my machine, using the script:

#!/usr/bin/env python3

import os
for fd in [1, 2]:
    stat = os.fstat(fd)
    dev = stat.st_dev
    ino = stat.st_ino
    print(f"fd {fd}: dev:{dev} ino:{ino}")

With no redirection they're the exact same device and inode, which I think means the shell is reading back from a single pipe.

$ ~/fd.py 
fd 1: dev:26 ino:6
fd 2: dev:26 ino:6

vs other cases:

$ ~/fd.py 2> /dev/null
fd 1: dev:26 ino:6
fd 2: dev:6 ino:5

$ ~/fd.py | cat
fd 1: dev:15 ino:2890634
fd 2: dev:26 ino:6

Trying to detect whether stdout and stderr are terminals are tricky.  In
particular, in CI environments where job output is being logged but is
also displayed in some web-pseudo-terminal it's difficult to say what
the right behavior is.

It's probably better to have consistent behavior here that doesn't
quietly change under you based on what type of device we think stdout
and stderr are.

Effectively reverts 223aa4a.

Progress on shadow#3426.
I think conventionally stdout should be more of a
standardized/machine-parseable format (as our log records are), and
stderr is intended for human consumption, e.g. on a terminal.

Reformatting the log lines this way when mirroring them to stderr makes
them a little friendlier, and makes it a little easier to understand the
duplication in cases where stdout and stderr are merged.

With an error injected into our sleep test, and stdout redirected to a
file:

```
$ (cd build/src/test/sleep/ && rm -rf sleep-shadow.data       &&  /home/jnewsome/projects/shadow/build/src/main/shadow       --data-directory=sleep-shadow.data       --log-level=trace       --use-cpu-pinning false --strace-logging-mode standard --parallelism 1       /home/jnewsome/projects/shadow/src/test/sleep/sleep.yaml > shadow.log)
** Starting Shadow 3.2.0
Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
Error: Failed to run the simulation
Error:
Error: Caused by:
Error:     1 managed processes in unexpected final state
** Shadow did not complete successfully: Failed to run the simulation
**   1 managed processes in unexpected final state
** See the log for details
```

And in merged output:

```
$ (cd build/src/test/sleep/ && rm -rf sleep-shadow.data       &&  /home/jnewsome/projects/shadow/build/src/main/shadow       --data-directory=sleep-shadow.data       --log-level=trace       --use-cpu-pinning false --strace-logging-mode standard --parallelism 1       /home/jnewsome/projects/shadow/src/test/sleep/sleep.yaml)
...
00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process 'testnode.test_sleep.1000' has completed or is otherwise no longer running
Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016660 [507267:shadow-worker] 00:00:01.000000000 [ERROR] [testnode:11.0.0.1] [process.rs:1571] [shadow_rs::host::process] process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016710 [507267:shadow-worker] 00:00:01.000000000 [TRACE] [testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not notifying parent of exit: no signal specified
...
```

The duplication in the merged case is still a little weird, but at least
identifiable.
This seems a little more accurate since we're no longer writing the full
log line to stderr.
Since we're using a buffered writer to write multiple lines to stdout,
previously it may have been possible to be in the state of having
flushed only part of a line to stdout when writing to stderr, which
could result in confusing interleaving when stdout and stderr are the
same destination.
@sporksmith sporksmith enabled auto-merge October 23, 2024 16:02
@sporksmith sporksmith merged commit 4e59b13 into shadow:main Oct 23, 2024
@sporksmith sporksmith deleted the errs-stderr branch October 23, 2024 16:34
@stevenengler
Copy link
Copy Markdown
Contributor

The terminal just gets a stream of bytes for each of stdout/stderr

I think that's not (typically) true, and in the common merged case it's a single stream.

Ah good point, I didn't realize they used the same pipe, but makes sense. I also see that with stat -L /dev/std{out,err}.

@sporksmith
Copy link
Copy Markdown
Contributor Author

Tried to capture the multi-logging "feature request" in #3430.

I'm +1 that it seems nice but I don't intend to work on it in the near future.

zydou pushed a commit to zydou/arti that referenced this pull request Oct 29, 2024
The changes in shadow/shadow#3428 should add
more details to the CI output of Shadow simulations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Build Build/install tools and dependencies Component: Documentation In-repository documentation, under docs/ Component: Main Composing the core Shadow executable Component: Testing Unit and integration tests and frameworks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants