Skip to content

tree -> index diff for status#1363

Merged
Byron merged 9 commits intomainfrom
status
May 14, 2024
Merged

tree -> index diff for status#1363
Byron merged 9 commits intomainfrom
status

Conversation

@Byron
Copy link
Member

@Byron Byron commented May 7, 2024

Based on #1350


diff-correctness → gix-status → gix reset


Improve gix status to the point where it's suitable for use in reset functionality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

Extra Tasks

Out-of-band tasks that just should finally be done, with potential for great impact.

  • support for hasconfig as part of resolve_includes() without actual lookahead.

Tasks

  • diff index with index to learn what we would want to do in the worktree, or alternatively,
    diff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
    would avoid having to allocate a whole index even though we are only interested in a diff.
    • Must include rename tracking.
  • how to make diff results available from status with all transformations applied, to allow user to obtain diffs of any kind? Compare to tree-tree-diff which has that functionality already.
  • update is_dirty() and Submodule::status() to do full status.

Status Enables

  • cargo package and its use of complete status information.
  • Add gitoxide backend Byron/built#1
  • starship native dirty check (but needs diffstats of worktree-vs-index)
  • built can get fully-functional is-dirty flag for 'describe()'

Inbetween

Next PR: Reset

  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • a way to expand sparse dirs (but figure out if this is truly always necessary) - probably not, unless sparse dirs can be empty, but even then no expansion is needed
    • wire it up in gix index entries to optionally expand sparse entries
  • gix status with implemented 'porcelain-v2` display mode

Research

  • Ignored files are considered expandable and can be overwritten on reset
  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

Re-learn

  • pathspecs normalize themselves to turn from any kind of specification into repo-root relative patterns.
  • attribute/ignore file sources are naturally relative to the root of the repo, which remains relative (i.e. can be .. and that root will be always be used to open files like ../.gitignore, which is useful for display to the user)

turns out that Powershell is used by default there, and variable substitutions
are different there.
What's worse is that it doesn't seem to fail if something doesn't work as one
would expect.
@EliahKagan
Copy link
Member

Regarding 91b6549, is it required to use PowerShell? Otherwise, shell: bash can be used, even on Windows, which gives Git Bash, so anything that is due to PowerShell rather than a POSIX-compatible shell being used may be fixable that way.

@Byron
Copy link
Member Author

Byron commented May 8, 2024

Thanks for the reminder! Yes, that would have worked too. Overall, I think using the ${{ env.var }} syntax is more portable and should be preferred. There might be other cases where it's definitely better to change the shell though.

@EliahKagan
Copy link
Member

Yes, there is nothing wrong with ${{ env.var }}. The only reason I mentioned shell: bash was this bit from the commit message:

What's worse is that it doesn't seem to fail if something doesn't work as one
would expect.

Achieving the effect of set -e or set -u is less obvious and automatic in PowerShell than in POSIX-compatible shells, so I figured the ability to use bash might be relevant there.

@Byron
Copy link
Member Author

Byron commented May 8, 2024

Achieving the effect of set -e or set -u is less obvious and automatic in PowerShell than in POSIX-compatible shells, so I figured the ability to use bash might be relevant there.

And after writing the bit about powershell I also realized that I am so used to set -euo pipefail that maybe I forgot how bash would have done the same in that situation. But I am not sure how the defaults are set on CI, where one would hope a failing command indeed triggers failure by default.

Previousoly it would make an additional syscall to change permissions, which
is slower than necessary.
@Byron Byron force-pushed the status branch 5 times, most recently from 9ca81df to 7c9186a Compare May 11, 2024 19:39
…ng a single index.

The motivating example is here: praetorian-inc/noseyparker#179

Previously, it was possible for a trampling herd of threads to consolidate the
disk state. Most of them would be 'needs-init' threads which could notice that
the initialization already happened, and just use that.

But a thread might be late for the party and somehow manages to not get any
newly loaded index, and thus tries to consolidate with what's on disk again.
Then it would again determine no change, and return nothing, causing the caller
to abort and not find objects it should find because it wouldn't see the index
that it should have seen.

The reason the thread got into this mess is that the 'is-load-ongoing' flagging
was racy itself, so it would not wait for ongoing loads and just conclude nothing
happened. An extra delay (by yielding) now assures it either seees the loading state
and waits for it, sees the newly loaded indices.

Note that this issue can be reproduced with:

```
'./target/release/gix -r repo-with-one-pack -t10 --trace odb stats --extra-header-lookup'
```
Related to helix-editor/helix#10660
which runs into object types that are unknown.

I have looked into this and [couldn't find evidence of a new pack-entry type](https://github.com/git/git/blob/0f3415f1f8478b05e64db11eb8aaa2915e48fef6/packfile.c#L1303-L1358)
in the Git codebase.

It also looks like that Git [will never write packs that aren't V2](https://github.com/git/git/blob/0f3415f1f8478b05e64db11eb8aaa2915e48fef6/pack-write.c#L352)
 - initially I thought it might write V3 based on some other criteria.

For now, the thesis is that one would have to be able to mark bad objects
to be able to handle this more gracefully, but all we can do is try to fail.
@Byron Byron force-pushed the status branch 5 times, most recently from 25e43ec to db2acf9 Compare May 13, 2024 12:29
@Byron Byron linked an issue May 13, 2024 that may be closed by this pull request
@Byron Byron merged commit 04ef31e into main May 14, 2024
@Byron Byron mentioned this pull request May 14, 2024
9 tasks
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jun 25, 2024
When the metadata of a symlink's target cannot be obtained, even if
the error is something other than `NotFound`, this falls back to
creating file symbolic links. This only affects scenarios where
either the checkout would fail entirely or where the symlink would
have been treated as a collision and skipped (even though it was
not really a collision, since only its target had an error). Other
cases are not affected, and all exisitng scenarios where directory
symlink would be created will still create directory symlinks.

This builds on 31d02a8 (GitoxideLabs#1363) by supporting dangling symlinks even
when the target filenames are unusual, such as when the name is
invalid or reserved. Windows permits such symlinks to be created,
and going ahead and creating the matches the Git behavior.

This should also support other errors beisdes `NotFound`. For
example, some permissions-related errors, in some cases where
traversal or acccess (even to access metadata) are not allowed,
would fail to create a symlink. This should address that as well.

This works by using `Path::is_dir()` in the standard library, which
automatically converts all errors (not just `NotFound`) into
`false`. The logic here is thus quite similar to what was already
present, just more tolerant, even though the code itself is shorter
and simpler.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jun 26, 2024
When the metadata of a symlink's target cannot be obtained, even if
the error is something other than `NotFound`, this falls back to
creating file symbolic links. This only affects scenarios where
either the checkout would fail entirely or where the symlink would
have been treated as a collision and skipped (even though it was
not really a collision, since only its target had an error). Other
cases are not affected, and all exisitng scenarios where directory
symlink would be created will still create directory symlinks.

This builds on 31d02a8 (GitoxideLabs#1363) by supporting dangling symlinks even
when the target filenames are unusual, such as when the name is
invalid or reserved. Windows permits such symlinks to be created,
and going ahead and creating the matches the Git behavior.

This should also support other errors beisdes `NotFound`. For
example, some permissions-related errors, in some cases where
traversal or acccess (even to access metadata) are not allowed,
would fail to create a symlink. This should address that as well.

This works by using `Path::is_dir()` in the standard library, which
automatically converts all errors (not just `NotFound`) into
`false`. The logic here is thus quite similar to what was already
present, just more tolerant, even though the code itself is shorter
and simpler.

This fixes GitoxideLabs#1420, and also fixes GitoxideLabs#1421.
LuaKT pushed a commit to LMonitor/gitoxide that referenced this pull request Aug 20, 2024
When the metadata of a symlink's target cannot be obtained, even if
the error is something other than `NotFound`, this falls back to
creating file symbolic links. This only affects scenarios where
either the checkout would fail entirely or where the symlink would
have been treated as a collision and skipped (even though it was
not really a collision, since only its target had an error). Other
cases are not affected, and all exisitng scenarios where directory
symlink would be created will still create directory symlinks.

This builds on 31d02a8 (GitoxideLabs#1363) by supporting dangling symlinks even
when the target filenames are unusual, such as when the name is
invalid or reserved. Windows permits such symlinks to be created,
and going ahead and creating the matches the Git behavior.

This should also support other errors beisdes `NotFound`. For
example, some permissions-related errors, in some cases where
traversal or acccess (even to access metadata) are not allowed,
would fail to create a symlink. This should address that as well.

This works by using `Path::is_dir()` in the standard library, which
automatically converts all errors (not just `NotFound`) into
`false`. The logic here is thus quite similar to what was already
present, just more tolerant, even though the code itself is shorter
and simpler.

This fixes GitoxideLabs#1420, and also fixes GitoxideLabs#1421.
@EliahKagan
Copy link
Member

EliahKagan commented Jan 5, 2026

But I am not sure how the defaults are set on CI, where one would hope a failing command indeed triggers failure by default.

I had meant to replied to this, but it looks like I forgot. This is relevant to #2340 (comment), which reminded me of it, but I think other aspects related to the problem are more central there. So I figured I'd reply here, as I'd meant to. (I'll also reply there, about those aspects I think are more central to what we're discussing there.)

See also #1559, which fixed a bug related to fail-fast behavior inside a script step. (That shouln't be confused with #2342, which fixed a conceptually similar but worse bug that was not caused by script-specific semantics.)

The what

PowerShell does not generally treat an external command that fails with a nonzero exit status as an error. So while it can stop on errors, and it can be made to return the exit status of the last command it runs, as far as I know there is no simple foolproof way to implicitly cause intermediate external commands in a multi-command script to terminate the script (and indicate failure).

The way PowerShell is run by default on CI does not attempt to do that, both back when this PR was merged and today. For example, consider this test workflow:

name: Test if PowerShell fails fast

on: [push, pull_request, workflow_dispatch]

jobs:
  powershell-failfast:
    runs-on: windows-latest

    steps:
      - name: Check out repository
        uses: actions/checkout@v4
      - name: External command reporting failure
        run: |
          echo a >t
          git add t
          echo b >>t
          git diff --exit-code -- t  # Should fail.
          echo $LASTEXITCODE
          git log --oneline -1  # Should succeed, if run.
      - name: Cmdlet reporting an error
        run: |
          cat foo
          echo done

As shown in a test run:

  1. The git diff command has an exit status of 1, but the first script step that contains it continues--and reports success, since the last command it runs has an exit status of 0.
  2. In contrast, since cat in PowerShell on Windows is an alias for the Get-Content cmdlet, when it fails to open a file, this is an error closer to PowerShell. That fails the step that runs it, and does so immediately, not running the subsequent command.

The why

This doesn't entirely answer the question of why PowerShell is not run by default on GitHub Actions runners (with the pwsh or powershell values of the shell key, or behavior on Windows that defaults to them based on availability in that order) with something like -e.

I think the reason is that PowerShell doesn't easily support doing this. My recollection is that it can be done for simple cases, but that this feature in PowerShell deviates from the behavior of POSIX -e in significant ways that developers would find unexpected if it were turned on automatically, relating to the distinction in PowerShell between an error and the result of a test.

Actually, POSIX-compatible shells also have subtleties where the behavior of -e is unexpected a significant fraction fo the time. But my understanding is that making PowerShell approximate the behavior of set -e neither well-approximates the real behavior of set -e nor approximates the effect people tend to assume set -e has.

In practice, I don't usually see problems in GHA steps arising from unexpected mismatch between what set -e does and what one may intuitively think it does. In contrast, the absence of fail-fast behavior in PowerShell is very much unexpected, when PowerShell is used implicitly on CI for script steps that are conceptually "lists of commands to run in any way." Ironically, this seems not to cause problems in big complex PowerShell script steps, where the semantics of PowerShell are front and center.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants