Skip to content

Flaky tests: discard stale events#1066

Merged
samoht merged 11 commits intomirage:masterfrom
samoht:flaky-tests
Aug 18, 2020
Merged

Flaky tests: discard stale events#1066
samoht merged 11 commits intomirage:masterfrom
samoht:flaky-tests

Conversation

@samoht
Copy link
Copy Markdown
Member

@samoht samoht commented Aug 14, 2020

Discard stale events

There are two different ways to produce events (which are then consumed by watchers):

  • every update to the reference store will call notify, which will notify watcher connected to the same repository in that process
  • there is a background thread, using irmin-watcher (with either polling mode, inotify or fsevents), which check for filesystem changes, and call notify whenever a reference change.

It is necessary that the events are delivered in (commit) order, as the API is exposing diffs and it would be confusing if the events go back in time.

There is a small race right now between these two mechanism: the background thread could read a new reference value, be interrupted, and then call notify. During that interruption, the store can change and events be delivered. If that's the case, we want to discard the current event and try again.

@samoht samoht added the no-changelog-needed No changelog is needed here label Aug 14, 2020
@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 14, 2020

A new one 😭

#18 20.45 (cd _build/default/test/irmin-graphql && ./test.exe -q --color=always)
#18 20.45 Testing `irmin-graphql'.
#18 20.45 This run has ID `BFF293B6-3119-461B-8F17-FEB48E924EC8'.
#18 20.45 
#18 20.45   ...           GRAPHQL          0   get_contents-list.
> [FAIL]        GRAPHQL          0   get_contents-list.
#18 20.45   ...           GRAPHQL          1   get_tree-list.
  [FAIL]        GRAPHQL          1   get_tree-list.
#18 20.45 
#18 20.45 ┌──────────────────────────────────────────────────────────────────────────────┐
#18 20.45 │ [FAIL]        GRAPHQL          0   get_contents-list.                        │
#18 20.45 └──────────────────────────────────────────────────────────────────────────────┘
#18 20.45 [exception] Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
#18 20.45 
#18 20.45 Logs saved to `/src/_build/default/test/irmin-graphql/_build/_tests/irmin-graphql/GRAPHQL.000.output'.
#18 20.45  ──────────────────────────────────────────────────────────────────────────────
#18 20.45 
#18 20.45 Full test results in `/src/_build/default/test/irmin-graphql/_build/_tests/irmin-graphql'.
#18 20.45 2 failures! in 0.000s. 2 tests run.

@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 14, 2020

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 15, 2020

Unfortunately the failed logs do not show anything and I can't repro the error anywhere...

@craigfe
Copy link
Copy Markdown
Member

craigfe commented Aug 17, 2020

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

@samoht: I assume from this that Cohttp_lwt_unix.Server.create returns before the server is ready? Is that done asynchronously?

@dinosaure
Copy link
Copy Markdown
Member

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

About that, from what I know (and what I did on ocaml-conduit), something can exist (such as s Lwt_condition) to signal that the server was bound. However, the way to do that differs from LWT and ASYNC.

Cohttp_lwt_unix.Server.create

Even if it returns before, Lwt_unix.bind and Lwt_unix.accept can yield with LWT (it's why we need a Lwt_condition.t).

@craigfe
Copy link
Copy Markdown
Member

craigfe commented Aug 17, 2020

I pushed a commit that adds --no-buffer to the test runes in Opam, but this doesn't work since OCaml-CI appears to not use these any more. Will force-push that commit away now...

@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 17, 2020

The verbose output seems to give an interesting trace, where a DELETE event triggers an add callback instead of a remove one:

#18 108.7 +43362387us application     [INFO] ASSERT state: [pre-condition] branch-watch-all remove 0 on=true expected=10/0/0:1/1 current=10/0/0:* (0, 0.000s)
#18 108.7 --------------------------------------------------------------------------------
#18 108.7 +43362418us test            [DEBUG] [waiting for]  branch-watch-all remove 0 on=true expected=10/0/1:1/1 current=10/0/0:*
#18 108.7 +43362443us irmin.http      [DEBUG] DELETE /branch/t0
#18 108.7 S+1170038us irmin.http-srv  [DEBUG] new connection 520
#18 108.7 S+1170098us irmin.git       [DEBUG] mem t0
#18 108.7 S+1170149us irmin.git       [DEBUG] find 591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 S+1170226us irmin.git       [DEBUG] remove t0
#18 108.7 S+1170343us git-unix.fs     [DEBUG] [File.open_r] error: Unix.Unix_error(Unix.ENOENT, "open", "_build/test-db-git/.git/packed-refs")
#18 108.7 S+1170628us irmin.http-srv  [INFO] [519] 200 - GET /commit/591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 S+1170698us irmin.watch     [DEBUG] notify-all[4.0:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -> <none>)
#18 108.7 S+1170716us irmin.watch     [DEBUG] notify-all[4.1:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -> <none>)
#18 108.7 S+1170728us irmin.watch     [DEBUG] notify-all[4.2:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -> <none>)
#18 108.7 S+1170759us irmin.http-srv  [INFO] [520] 200 - DELETE /branch/t0
#18 108.7 S+1170879us irmin.http-srv  [DEBUG] connection 519 closed
#18 108.7 +43363485us irmin.watch     [DEBUG] notify-all[39.50:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -> <none>)
#18 108.7 +43363561us irmin.http      [DEBUG] GET /commit/591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 ASSERT state:  branch-watch-all remove 0 on=true expected=10/0/1:1/1 current=11/0/0:* (0, 0.000s): { adds=10; updates=0; removes=1 } / { adds=11; updates=0; removes=0 }

@craigfe craigfe force-pushed the flaky-tests branch 2 times, most recently from b72ef88 to bdf0af3 Compare August 17, 2020 15:11
@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 18, 2020

New similar issue:

#18 82.96 --------------------------------------------------------------------------------
#18 82.96 +40198365us application     [INFO] ASSERT state: [pre-condition] branch-watch-all remove 1 on=true expected=10/0/1:1/1 current=10/0/1:* (0, 0.000s)
#18 82.96 --------------------------------------------------------------------------------
#18 82.96 +40198402us test            [DEBUG] [waiting for]  branch-watch-all remove 1 on=true expected=10/0/2:1/1 current=10/0/1:*
#18 82.96 +40198445us irmin.http      [DEBUG] DELETE /branch/t1
#18 82.96 S+1216046us irmin.http-srv  [DEBUG] new connection 522
#18 82.96 S+1216103us irmin.git       [DEBUG] mem t1
#18 82.96 S+1216162us git-unix.fs     [DEBUG] Dir.contents _build/test-db-git/.git/refs/tags
#18 82.96 S+1216257us irmin.git       [DEBUG] remove t1
#18 82.96 S+1216509us git-unix.fs     [DEBUG] [File.open_r] error: Unix.Unix_error(Unix.ENOENT, "open", "_build/test-db-git/.git/packed-refs")
#18 82.96 S+1216616us irmin.http-srv  [INFO] [521] 200 - GET /commit/16a4500b3a30d0cf548004da6f657b45f4aae2be
#18 82.96 +40199632us test            [DEBUG] add { adds=10; updates=0; removes=1 }
#18 82.96 S+1217108us irmin.watch     [DEBUG] notify-all[4.0:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -> <none>)
#18 82.96 S+1217135us irmin.watch     [DEBUG] notify-all[4.1:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -> <none>)
#18 82.96 S+1217147us irmin.watch     [DEBUG] notify-all[4.2:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -> <none>)
#18 82.96 S+1217176us irmin.http-srv  [INFO] [522] 200 - DELETE /branch/t1
#18 82.96 ASSERT state:  branch-watch-all remove 1 on=true expected=10/0/2:1/1 current=11/0/1:* (0, 0.000s): { adds=10; updates=0; removes=2 } / { adds=11; updates=0; removes=1 }
#18 82.96 Server [PID 2078] is killed.

@samoht samoht force-pushed the flaky-tests branch 3 times, most recently from 7a794e6 to a4a47ba Compare August 18, 2020 10:44
samoht and others added 10 commits August 18, 2020 15:49
There are two different ways to produce events (which are then consumed by watchers):
- every update to the reference store will call notify, which will notify watcher
  connected to the same repository in that process
- there is a background thread, using irmin-watcher (with either polling mode,
  inotify or fsevents), which check for filesystem changes, and call notify
  whenever a reference change.

It is necessary that the events are delivered in (commit) order, as the API
is exposing diffs and it would be confusing if the events go back in time.

There is a small race right now between these two mechanism: the background
thread could read a new reference value, be interrupted, and then call notify.
During that interuption, the store can change and events be delivered. If
that's the case, we want to discard the current event and try again.
@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 18, 2020

f4fdc0f should fix the race ; the watches were not initialised properly and the background event loop were triggering an add event when it detected that master was actually not empty.

@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 18, 2020

And now, a new connection race:

#18 165.6 ┌──────────────────────────────────────────────────────────────────────────────┐
#18 165.6 │ [FAIL]        HTTP.FS          21   Concurrent head updates.                 │
#18 165.6 └──────────────────────────────────────────────────────────────────────────────┘
#18 165.6 pwd=/src/_build/default
#18 165.6 Executing: "../../_build/default/test/irmin-unix/test.exe serve 0 938656918 &"
#18 165.6 +56738388us application     [DEBUG] waiting for the server to start...
#18 165.6 +52000us irw-polling     [INFO] Polling mode
#18 165.6 S+55841us application     [DEBUG] pwd: /src/_build/default
#18 165.6 S+55898us application     [DEBUG] Got server: FS, root=_build/test-db
#18 165.6 S+80588us git.unix        [DEBUG] mkdir _build/test-db
#18 165.6 S+83473us application     [DEBUG] write PID 2231 in test-http/irmin-test-FS-938656918.pid
#18 165.6 +56838943us application     [DEBUG] read PID 2231 from test-http/irmin-test-FS-938656918.pid
#18 165.6 +56839071us irmin           [DEBUG] set /a/b/c/0
#18 165.6 +56839133us irmin.http      [DEBUG] GET /branch/master
#18 165.6 +56839399us irmin.http      [DEBUG] request to http://irmin/branch/master failed: Unix.Unix_error(Unix.ENOENT, "connect", "")
#18 165.6 Server [PID 2231] is killed.
#18 165.6 [exception] Unix.Unix_error(Unix.ENOENT, "connect", "")

@samoht
Copy link
Copy Markdown
Member Author

samoht commented Aug 18, 2020

Travis is failing due a transient GH error (while downloading archives). Merging.

@samoht samoht merged commit 3adefbc into mirage:master Aug 18, 2020
@samoht samoht deleted the flaky-tests branch August 18, 2020 15:51
@dinosaure dinosaure mentioned this pull request Aug 28, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog-needed No changelog is needed here

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants