Skip to content

Use named pipes for shim logs#2631

Merged
crosbymichael merged 1 commit intocontainerd:masterfrom
masters-of-cats:shim-io-redirect
Nov 27, 2018
Merged

Use named pipes for shim logs#2631
crosbymichael merged 1 commit intocontainerd:masterfrom
masters-of-cats:shim-io-redirect

Conversation

@danail-branekov
Copy link
Contributor

Closes #2606

When waiting for a golang cmd to exit, golang will block exiting on all of its process streams being closed. Currently, the containerd server is configuring each shim processes output (stdout/stderr) to redirect to the server’s stdout/stderr. When running the containerd restart tests, these do not get closed - and waiting for the server process to finish hangs.
Since restarting the containerd server does not restart the shim processes, the shims output streams could still be outputting logs, never to be closed. The aim of this PR is to let the shim instead log to named pipes.

  • When a new task is created, the server:
  • creates a new named pipe (one for stdout and one for stderr).
  • assigns the start shim command stdout/err to the write ends of these pipes
  • schedules a go routine to copy from the pipes’ read end to os.Stdout/os.Stderr, thus capturing the shim’s logs
  • When the shim process starts it opens the stout and stderr pipes in read-only mode in order to keep the pipes open even if the server process goes down
  • Note that the shim process never reads from the pipes, it just keeps them open
  • When a containerd server comes back and restores its tasks:
  • it opens the pipes that have been already created
  • is schedules a goroutine to copy the pipes contents to os.Stdout/os.Stderr, thus capturing the logs of an already running shim processes
  • When shim debugging is off shim processes stdout/stderr are set to nil i.e. the server does not capture shim logs

In order to be able to test the change we:

  • Enabled shim debugging for all tests as the containerd server lifecycle and its settings are managed globally in TestMain
  • In order to be able to verify that the server connects to existing shim processes stdio we had to make the server output available to tests. Initially we attempted to read from the stdio buffer from the test but because the buffer is shared accross all parallel tests we go a d
    ata race. Therefore we sent the server output to a temporary file which tests can use to check the output without data races.

We noticed that there is something similar implemented for v2, however we do not really understand how to use and run tests against it. It seems in v2, it creates a single pipe for both stdout and stderr rather than one for each stream and the server does not reconnect to those pipes when restoring running tasks. We could fix it for v2 if we knew how!

cc @ostenbom

@crosbymichael
Copy link
Member

some packages on github are having some issues right now, i'm looking into the CI issues

@crosbymichael
Copy link
Member

@danail-branekov for v2. You can use it via ctr run --runtime io.containerd.runc.v1. I feel like it should work as we already use a named pipe but I havn't testing your reproduction case yet.

@danail-branekov
Copy link
Contributor Author

@danail-branekov for v2. You can use it via ctr run --runtime io.containerd.runc.v1. I feel like it should work as we already use a named pipe but I havn't testing your reproduction case yet.

My question was more about how to test v2 in an automated manner. Is that possible?
Running the very same integration tests against v2 does make sense to me, still the Makefile does not provide that capability.

@crosbymichael
Copy link
Member

Ahh, the CI does run both runtimes.

You should be able to set TEST_RUNTIME=io.containerd.runc.v1 and run the tests locally

@danail-branekov danail-branekov force-pushed the shim-io-redirect branch 2 times, most recently from 1b508ba to 74bbef0 Compare November 2, 2018 14:52
@danail-branekov
Copy link
Contributor Author

I have rebased the PR onto master and made the TestDaemonReconnectsToShimIOPipesOnRestart test verify that containerd reconnects to the shim log after restart for both v1 and v2. As a matter of fact this capability was already there for v2.

There are however some discrepancies between v1 and v2 logging:

  • v1 has separate log files for stdout and stderr that are located in /<root-dir>/io.containerd.runtime.v1.linux/<namespace>/<id>
  • v2 has a log file for stderr only located in /<state-dir>/io.containerd.runtime.v2.task/<namespace>/<id>

Do we need need to address these discrepancies and always put files in either root-dir or state-dir? I was also tempted to make v2 have separate log files for stdout/stderr but I saw that v2 uses stdout to communicate between the service and the task manager e.g. here.

@crosbymichael
Copy link
Member

@danail-branekov ya, i see what you are talking about. Let me look at the code again and see what would be a good solution.

Also don't worry about the windows tests, we have a PR to fix the gcc issue that poped up over night.

@jterry75
Copy link
Contributor

jterry75 commented Nov 2, 2018

V2 shim logging is done via a Fifo on Linux and Named Pipe on Windows. It does not use stdout, stderr so this should not affect the V2 runtimes at all.

@danail-branekov
Copy link
Contributor Author

@jterry75 Actually, v2 would forward stderr to the named pipe and would use stdout to return the shim address. This PR does pretty much the same for v1, except that the PR would forward both stdout and stderr to named pipes.
And yes, this PR does not affect v2 (unless we decide to unify named pipes implementation) except for introducing a test relevant to both v1 and v2 that verifies that a) the server reconnects to the named pipe on restart and b) shim logs are forwarded to the server log if debug is enabled.

@BooleanCat
Copy link
Contributor

@crosbymichael Has that GCC windows issue been resolved? If so we can force push to our fork to trigger CI again.

@jterry75
Copy link
Contributor

@BooleanCat - Yes it has been fixed for some time now.

@BooleanCat
Copy link
Contributor

@danail-branekov We tried rebasing on master and we still need windows failures. Let's take a deeper look at this.

@yulianedyalkova yulianedyalkova force-pushed the shim-io-redirect branch 2 times, most recently from 272de00 to 0fa0a9c Compare November 16, 2018 13:53
Relating to issue [containerd#2606](containerd#2606)

Co-authored-by: Oliver Stenbom <ostenbom@pivotal.io>
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Danail Branekov <danailster@gmail.com>

Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
@codecov-io
Copy link

codecov-io commented Nov 16, 2018

Codecov Report

Merging #2631 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2631   +/-   ##
=======================================
  Coverage   43.72%   43.72%           
=======================================
  Files         100      100           
  Lines       10734    10734           
=======================================
  Hits         4693     4693           
  Misses       5311     5311           
  Partials      730      730
Flag Coverage Δ
#linux 47.39% <ø> (ø) ⬆️
#windows 40.89% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44b90df...1d4105c. Read the comment docs.

@Callisto13
Copy link
Contributor

The windows failures were because we introduced a file with unix imports and had not disabled the build on windows (// +build !windows). This is resolved now, so we should be good to go.

bump @crosbymichael

@crosbymichael
Copy link
Member

Thanks, I'll review this today

@crosbymichael
Copy link
Member

LGTM

@BooleanCat
Copy link
Contributor

/poke Any thoughts on this? We'd like to get this merged before it gets stale.

@crosbymichael
Copy link
Member

cc @ehazlett @AkihiroSuda @dmcgowan

@ehazlett
Copy link
Member

LGTM

@crosbymichael crosbymichael merged commit 3eae8b9 into containerd:master Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestDaemonRestart hangs if shim_debug is enabled

8 participants