Skip to content

Send HTTP Hijack headers after successful attach#7451

Merged
edsantiago merged 1 commit intocontainers:masterfrom
mheon:fix_7195
Aug 27, 2020
Merged

Send HTTP Hijack headers after successful attach#7451
edsantiago merged 1 commit intocontainers:masterfrom
mheon:fix_7195

Conversation

@mheon
Copy link
Copy Markdown
Member

@mheon mheon commented Aug 25, 2020

Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed before we successfully finish attaching, causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just bad. Still, we need this fixed now to get CI back into good shape...

Fixes #7195

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2020
@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 25, 2020

@baude @edsantiago Bear witness to my sorrow.

I think this probably fixes the #7195 flake but I have not performed anything more than trivial tests yet.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from d839aa9 to b67ee9d Compare August 25, 2020 21:14
@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 25, 2020

This appears to have made things worse. My interruption of their delicate dance has offended the attach functions and they will work no longer.

I will attempt to appease them tomorrow.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from d662739 to 199158d Compare August 26, 2020 17:41
@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 26, 2020

I think this is going to go green now.

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 26, 2020

Apparently I broke exec exit codes here - it looks like a race where it's querying for exec status before the exec session is marked as exited.

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 26, 2020

Alright, I know what's going on (connection is being closed earlier than when it used to be closed, thus causing the client to query for exec session status a lot earlier than it used to, before we have time to save it to disk). Still need to figure out how to actually fix it, the code is not structured in a way that is conducive to closing the connection later.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from 6db6b01 to 7d65478 Compare August 27, 2020 14:36
@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

Failed to stop: Error getting access token for service account: Remote host terminated the handshake

Second time I've seen this...

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

[+0280s] # error opening file /sys/fs/cgroup//system.slice/crun-buildah-buildah175193391.scope/container/cgroup.freeze: No such file or directory

Flakes in our flake fixes...

@edsantiago
Copy link
Copy Markdown
Member

At least that one usually passes on rerun

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

Nevermind, exec tests look red still. Damn it.

Will look more after lunch.

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

Re-pushed again. I think this might be it.

Our previous flow was to perform a hijack before passing a
connection into Libpod, and then Libpod would attach to the
container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the
attach complete. As soon as we write the header, the client
assumes that all is ready, and sends a Start request. This Start
may be processed *before* we successfully finish attaching,
causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately,
this requires a downright extensive refactor of the Attach and
HTTP Exec StartAndAttach code. I think the result is an
improvement in some places (a lot more errors will be handled
with a proper HTTP error code, before the hijack occurs) but
other parts, like the relocation of printing container logs, are
just *bad*. Still, we need this fixed now to get CI back into
good shape...

Fixes containers#7195

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

Network flake in the docs job

@mheon
Copy link
Copy Markdown
Member Author

mheon commented Aug 27, 2020

@rhatdan @baude @giuseppe @TomSweeneyRedHat @QiWang19 PTAL and merge, please. Fixes the flake keeping CI red.

@baude
Copy link
Copy Markdown
Member

baude commented Aug 27, 2020

LGTM

1 similar comment
@QiWang19
Copy link
Copy Markdown
Member

LGTM

@edsantiago
Copy link
Copy Markdown
Member

Looks green to me. I am supremely unqualified to review the code, but the results talk to me and I'm about to take all my PRs, rebase them, and see how they go. So...

/lgtm

Thank you @mheon. This was a nasty one.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2020
@edsantiago
Copy link
Copy Markdown
Member

Oh mergebot... yoo-hoo...

@edsantiago edsantiago merged commit b13af45 into containers:master Aug 27, 2020
edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 28, 2020
 - pause test: enable when rootless + cgroups v2
   (was previously disabled for all rootless)

 - run --pull: now works with podman-remote
   (in containers#7647, thank you @jwhonce)

 - various other run/volumes tests: try reenabling
   It looks like containers#7195 was fixed (by containers#7451? I'm not
   sure if I'm reading the conversation correctly).
   Anyway, remove all the skip()s on 7195. Only time
   will tell if it's really fixed)

Also:

 - new test for podman image tree --whatrequires
   (because TIL). Doesn't work with podman-remote.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI flake: podman-remote: no output from container (and "does not exist in database"?)

5 participants