Send HTTP Hijack headers after successful attach by mheon · Pull Request #7451 · containers/podman

mheon · 2020-08-25T21:08:10Z

Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed before we successfully finish attaching, causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just bad. Still, we need this fixed now to get CI back into good shape...

Fixes #7195

openshift-ci-robot · 2020-08-25T21:08:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2020-08-25T21:08:48Z

@baude @edsantiago Bear witness to my sorrow.

I think this probably fixes the #7195 flake but I have not performed anything more than trivial tests yet.

mheon · 2020-08-25T21:58:46Z

This appears to have made things worse. My interruption of their delicate dance has offended the attach functions and they will work no longer.

I will attempt to appease them tomorrow.

mheon · 2020-08-26T17:56:57Z

I think this is going to go green now.

mheon · 2020-08-26T18:53:55Z

Apparently I broke exec exit codes here - it looks like a race where it's querying for exec status before the exec session is marked as exited.

mheon · 2020-08-26T20:51:32Z

Alright, I know what's going on (connection is being closed earlier than when it used to be closed, thus causing the client to query for exec session status a lot earlier than it used to, before we have time to save it to disk). Still need to figure out how to actually fix it, the code is not structured in a way that is conducive to closing the connection later.

mheon · 2020-08-27T15:17:47Z

Failed to stop: Error getting access token for service account: Remote host terminated the handshake

Second time I've seen this...

mheon · 2020-08-27T15:42:23Z

[+0280s] # error opening file /sys/fs/cgroup//system.slice/crun-buildah-buildah175193391.scope/container/cgroup.freeze: No such file or directory

Flakes in our flake fixes...

edsantiago · 2020-08-27T15:48:27Z

At least that one usually passes on rerun

mheon · 2020-08-27T15:51:30Z

Nevermind, exec tests look red still. Damn it.

Will look more after lunch.

mheon · 2020-08-27T16:48:20Z

Re-pushed again. I think this might be it.

Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic. A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed *before* we successfully finish attaching, causing us to lose output. The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just *bad*. Still, we need this fixed now to get CI back into good shape... Fixes containers#7195 Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon · 2020-08-27T17:00:07Z

Network flake in the docs job

mheon · 2020-08-27T18:29:44Z

@rhatdan @baude @giuseppe @TomSweeneyRedHat @QiWang19 PTAL and merge, please. Fixes the flake keeping CI red.

baude · 2020-08-27T18:30:24Z

LGTM

QiWang19 · 2020-08-27T18:30:54Z

LGTM

edsantiago · 2020-08-27T18:41:42Z

Looks green to me. I am supremely unqualified to review the code, but the results talk to me and I'm about to take all my PRs, rebase them, and see how they go. So...

/lgtm

Thank you @mheon. This was a nasty one.

edsantiago · 2020-08-27T18:54:05Z

Oh mergebot... yoo-hoo...

@jwhonce

- pause test: enable when rootless + cgroups v2 (was previously disabled for all rootless) - run --pull: now works with podman-remote (in containers#7647, thank you @jwhonce) - various other run/volumes tests: try reenabling It looks like containers#7195 was fixed (by containers#7451? I'm not sure if I'm reading the conversation correctly). Anyway, remove all the skip()s on 7195. Only time will tell if it's really fixed) Also: - new test for podman image tree --whatrequires (because TIL). Doesn't work with podman-remote. Signed-off-by: Ed Santiago <santiago@redhat.com>

openshift-ci-robot requested review from TomSweeneyRedHat and jwhonce August 25, 2020 21:08

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2020

mheon force-pushed the fix_7195 branch 2 times, most recently from d839aa9 to b67ee9d Compare August 25, 2020 21:14

mheon force-pushed the fix_7195 branch 2 times, most recently from d662739 to 199158d Compare August 26, 2020 17:41

mheon force-pushed the fix_7195 branch 2 times, most recently from 6db6b01 to 7d65478 Compare August 27, 2020 14:36

mheon force-pushed the fix_7195 branch from 7d65478 to b318113 Compare August 27, 2020 16:47

mheon force-pushed the fix_7195 branch from b318113 to 2ea9dac Compare August 27, 2020 16:50

openshift-ci-robot assigned edsantiago Aug 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2020

edsantiago merged commit b13af45 into containers:master Aug 27, 2020

edsantiago mentioned this pull request Sep 28, 2020

System tests: reenable some skipped tests #7803

Merged

flouthoc mentioned this pull request Nov 8, 2021

Unwanted logging output in API Bindings package #12204

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023

Conversation

mheon commented Aug 25, 2020

Uh oh!

openshift-ci-robot commented Aug 25, 2020

Uh oh!

mheon commented Aug 25, 2020

Uh oh!

mheon commented Aug 25, 2020

Uh oh!

mheon commented Aug 26, 2020

Uh oh!

mheon commented Aug 26, 2020

Uh oh!

mheon commented Aug 26, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

edsantiago commented Aug 27, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

mheon commented Aug 27, 2020

Uh oh!

baude commented Aug 27, 2020

Uh oh!

QiWang19 commented Aug 27, 2020

Uh oh!

edsantiago commented Aug 27, 2020

Uh oh!

edsantiago commented Aug 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants