Flaky tests: discard stale events by samoht · Pull Request #1066 · mirage/irmin

samoht · 2020-08-14T14:47:03Z

Discard stale events

There are two different ways to produce events (which are then consumed by watchers):

every update to the reference store will call notify, which will notify watcher connected to the same repository in that process
there is a background thread, using irmin-watcher (with either polling mode, inotify or fsevents), which check for filesystem changes, and call notify whenever a reference change.

It is necessary that the events are delivered in (commit) order, as the API is exposing diffs and it would be confusing if the events go back in time.

There is a small race right now between these two mechanism: the background thread could read a new reference value, be interrupted, and then call notify. During that interruption, the store can change and events be delivered. If that's the case, we want to discard the current event and try again.

samoht · 2020-08-14T14:54:51Z

A new one 😭

#18 20.45 (cd _build/default/test/irmin-graphql && ./test.exe -q --color=always)
#18 20.45 Testing `irmin-graphql'.
#18 20.45 This run has ID `BFF293B6-3119-461B-8F17-FEB48E924EC8'.
#18 20.45 
#18 20.45   ...           GRAPHQL          0   get_contents-list.
> [FAIL]        GRAPHQL          0   get_contents-list.
#18 20.45   ...           GRAPHQL          1   get_tree-list.
  [FAIL]        GRAPHQL          1   get_tree-list.
#18 20.45 
#18 20.45 ┌──────────────────────────────────────────────────────────────────────────────┐
#18 20.45 │ [FAIL]        GRAPHQL          0   get_contents-list.                        │
#18 20.45 └──────────────────────────────────────────────────────────────────────────────┘
#18 20.45 [exception] Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
#18 20.45 
#18 20.45 Logs saved to `/src/_build/default/test/irmin-graphql/_build/_tests/irmin-graphql/GRAPHQL.000.output'.
#18 20.45  ──────────────────────────────────────────────────────────────────────────────
#18 20.45 
#18 20.45 Full test results in `/src/_build/default/test/irmin-graphql/_build/_tests/irmin-graphql'.
#18 20.45 2 failures! in 0.000s. 2 tests run.

samoht · 2020-08-14T17:39:41Z

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

samoht · 2020-08-15T11:07:36Z

Unfortunately the failed logs do not show anything and I can't repro the error anywhere...

craigfe · 2020-08-17T08:15:43Z

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

@samoht: I assume from this that Cohttp_lwt_unix.Server.create returns before the server is ready? Is that done asynchronously?

dinosaure · 2020-08-17T08:27:01Z

I've pushed a fix for GraphQL ... it's a bit unfortunate that we don't have a reliable way to know when a server is ready to answer to client requests.

About that, from what I know (and what I did on ocaml-conduit), something can exist (such as s Lwt_condition) to signal that the server was bound. However, the way to do that differs from LWT and ASYNC.

Cohttp_lwt_unix.Server.create

Even if it returns before, Lwt_unix.bind and Lwt_unix.accept can yield with LWT (it's why we need a Lwt_condition.t).

craigfe · 2020-08-17T09:58:45Z

I pushed a commit that adds --no-buffer to the test runes in Opam, but this doesn't work since OCaml-CI appears to not use these any more. Will force-push that commit away now...

samoht · 2020-08-17T10:45:05Z

The verbose output seems to give an interesting trace, where a DELETE event triggers an add callback instead of a remove one:

#18 108.7 +43362387us application     [INFO] ASSERT state: [pre-condition] branch-watch-all remove 0 on=true expected=10/0/0:1/1 current=10/0/0:* (0, 0.000s)
#18 108.7 --------------------------------------------------------------------------------
#18 108.7 +43362418us test            [DEBUG] [waiting for]  branch-watch-all remove 0 on=true expected=10/0/1:1/1 current=10/0/0:*
#18 108.7 +43362443us irmin.http      [DEBUG] DELETE /branch/t0
#18 108.7 S+1170038us irmin.http-srv  [DEBUG] new connection 520
#18 108.7 S+1170098us irmin.git       [DEBUG] mem t0
#18 108.7 S+1170149us irmin.git       [DEBUG] find 591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 S+1170226us irmin.git       [DEBUG] remove t0
#18 108.7 S+1170343us git-unix.fs     [DEBUG] [File.open_r] error: Unix.Unix_error(Unix.ENOENT, &quot;open&quot;, &quot;_build/test-db-git/.git/packed-refs&quot;)
#18 108.7 S+1170628us irmin.http-srv  [INFO] [519] 200 - GET /commit/591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 S+1170698us irmin.watch     [DEBUG] notify-all[4.0:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -&gt; &lt;none&gt;)
#18 108.7 S+1170716us irmin.watch     [DEBUG] notify-all[4.1:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -&gt; &lt;none&gt;)
#18 108.7 S+1170728us irmin.watch     [DEBUG] notify-all[4.2:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -&gt; &lt;none&gt;)
#18 108.7 S+1170759us irmin.http-srv  [INFO] [520] 200 - DELETE /branch/t0
#18 108.7 S+1170879us irmin.http-srv  [DEBUG] connection 519 closed
#18 108.7 +43363485us irmin.watch     [DEBUG] notify-all[39.50:t0]: firing! (591f0113075f4f119936725e309513de4bc91a1b -&gt; &lt;none&gt;)
#18 108.7 +43363561us irmin.http      [DEBUG] GET /commit/591f0113075f4f119936725e309513de4bc91a1b
#18 108.7 ASSERT state:  branch-watch-all remove 0 on=true expected=10/0/1:1/1 current=11/0/0:* (0, 0.000s): { adds=10; updates=0; removes=1 } / { adds=11; updates=0; removes=0 }

samoht · 2020-08-18T08:37:10Z

New similar issue:

#18 82.96 --------------------------------------------------------------------------------
#18 82.96 +40198365us application     [INFO] ASSERT state: [pre-condition] branch-watch-all remove 1 on=true expected=10/0/1:1/1 current=10/0/1:* (0, 0.000s)
#18 82.96 --------------------------------------------------------------------------------
#18 82.96 +40198402us test            [DEBUG] [waiting for]  branch-watch-all remove 1 on=true expected=10/0/2:1/1 current=10/0/1:*
#18 82.96 +40198445us irmin.http      [DEBUG] DELETE /branch/t1
#18 82.96 S+1216046us irmin.http-srv  [DEBUG] new connection 522
#18 82.96 S+1216103us irmin.git       [DEBUG] mem t1
#18 82.96 S+1216162us git-unix.fs     [DEBUG] Dir.contents _build/test-db-git/.git/refs/tags
#18 82.96 S+1216257us irmin.git       [DEBUG] remove t1
#18 82.96 S+1216509us git-unix.fs     [DEBUG] [File.open_r] error: Unix.Unix_error(Unix.ENOENT, &quot;open&quot;, &quot;_build/test-db-git/.git/packed-refs&quot;)
#18 82.96 S+1216616us irmin.http-srv  [INFO] [521] 200 - GET /commit/16a4500b3a30d0cf548004da6f657b45f4aae2be
#18 82.96 +40199632us test            [DEBUG] add { adds=10; updates=0; removes=1 }
#18 82.96 S+1217108us irmin.watch     [DEBUG] notify-all[4.0:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -&gt; &lt;none&gt;)
#18 82.96 S+1217135us irmin.watch     [DEBUG] notify-all[4.1:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -&gt; &lt;none&gt;)
#18 82.96 S+1217147us irmin.watch     [DEBUG] notify-all[4.2:t1]: firing! (754be95ad44e65636bffeeb5a9857056b7a18567 -&gt; &lt;none&gt;)
#18 82.96 S+1217176us irmin.http-srv  [INFO] [522] 200 - DELETE /branch/t1
#18 82.96 ASSERT state:  branch-watch-all remove 1 on=true expected=10/0/2:1/1 current=11/0/1:* (0, 0.000s): { adds=10; updates=0; removes=2 } / { adds=11; updates=0; removes=1 }
#18 82.96 Server [PID 2078] is killed.

There are two different ways to produce events (which are then consumed by watchers): - every update to the reference store will call notify, which will notify watcher connected to the same repository in that process - there is a background thread, using irmin-watcher (with either polling mode, inotify or fsevents), which check for filesystem changes, and call notify whenever a reference change. It is necessary that the events are delivered in (commit) order, as the API is exposing diffs and it would be confusing if the events go back in time. There is a small race right now between these two mechanism: the background thread could read a new reference value, be interrupted, and then call notify. During that interuption, the store can change and events be delivered. If that's the case, we want to discard the current event and try again.

…lient connections

…e queued

samoht · 2020-08-18T13:53:52Z

f4fdc0f should fix the race ; the watches were not initialised properly and the background event loop were triggering an add event when it detected that master was actually not empty.

samoht · 2020-08-18T13:56:18Z

And now, a new connection race:

#18 165.6 ┌──────────────────────────────────────────────────────────────────────────────┐
#18 165.6 │ [FAIL]        HTTP.FS          21   Concurrent head updates.                 │
#18 165.6 └──────────────────────────────────────────────────────────────────────────────┘
#18 165.6 pwd=/src/_build/default
#18 165.6 Executing: "../../_build/default/test/irmin-unix/test.exe serve 0 938656918 &"
#18 165.6 +56738388us application     [DEBUG] waiting for the server to start...
#18 165.6 +52000us irw-polling     [INFO] Polling mode
#18 165.6 S+55841us application     [DEBUG] pwd: /src/_build/default
#18 165.6 S+55898us application     [DEBUG] Got server: FS, root=_build/test-db
#18 165.6 S+80588us git.unix        [DEBUG] mkdir _build/test-db
#18 165.6 S+83473us application     [DEBUG] write PID 2231 in test-http/irmin-test-FS-938656918.pid
#18 165.6 +56838943us application     [DEBUG] read PID 2231 from test-http/irmin-test-FS-938656918.pid
#18 165.6 +56839071us irmin           [DEBUG] set /a/b/c/0
#18 165.6 +56839133us irmin.http      [DEBUG] GET /branch/master
#18 165.6 +56839399us irmin.http      [DEBUG] request to http://irmin/branch/master failed: Unix.Unix_error(Unix.ENOENT, "connect", "")
#18 165.6 Server [PID 2231] is killed.
#18 165.6 [exception] Unix.Unix_error(Unix.ENOENT, "connect", "")

samoht · 2020-08-18T15:51:06Z

Travis is failing due a transient GH error (while downloading archives). Merging.

samoht added the no-changelog-needed No changelog is needed here label Aug 14, 2020

samoht force-pushed the flaky-tests branch from 2d5aebe to d79d149 Compare August 14, 2020 17:36

samoht force-pushed the flaky-tests branch from d79d149 to a2f87f0 Compare August 14, 2020 18:10

craigfe force-pushed the flaky-tests branch from 3b8c3a6 to a2f87f0 Compare August 17, 2020 09:59

craigfe force-pushed the flaky-tests branch 2 times, most recently from b72ef88 to bdf0af3 Compare August 17, 2020 15:11

samoht force-pushed the flaky-tests branch from 32f5e66 to 9e4c9cc Compare August 17, 2020 17:06

samoht force-pushed the flaky-tests branch 3 times, most recently from 7a794e6 to a4a47ba Compare August 18, 2020 10:44

samoht and others added 10 commits August 18, 2020 15:49

irmin: add more debug information for watches

54c5cbe

irmin-test: delete socket files for HTTP tests

6d73584

irmin-graphql: use a unix-domain socket for GraphQL tests and retry c…

ea26139

…lient connections

irmin-test: limit retries to convergence for monotonic types

d614520

irmin-http: register exception printer for Escape

61bc305

irmin-git: call notify while holding the lock for set/remove

5f45489

irmin-test add more debug messages

bfbba7c

irmin: print event debug messages when they trigger, not when they ar…

cff5cdd

…e queued

irmin-test: init Branch.watch_all correctly

f4fdc0f

samoht force-pushed the flaky-tests branch from 2ff284f to f4fdc0f Compare August 18, 2020 13:49

irmin-http: also wait for the socket file to appear in the tests

489b313

craigfe approved these changes Aug 18, 2020

View reviewed changes

samoht merged commit 3adefbc into mirage:master Aug 18, 2020

samoht deleted the flaky-tests branch August 18, 2020 15:51

dinosaure mentioned this pull request Aug 28, 2020

Conduit 3.0.0 mirage/ocaml-cohttp#692

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky tests: discard stale events#1066

Flaky tests: discard stale events#1066
samoht merged 11 commits intomirage:masterfrom
samoht:flaky-tests

samoht commented Aug 14, 2020 •

edited

Loading

Uh oh!

samoht commented Aug 14, 2020 •

edited

Loading

Uh oh!

samoht commented Aug 14, 2020

Uh oh!

samoht commented Aug 15, 2020

Uh oh!

craigfe commented Aug 17, 2020

Uh oh!

dinosaure commented Aug 17, 2020

Uh oh!

craigfe commented Aug 17, 2020

Uh oh!

samoht commented Aug 17, 2020 •

edited

Loading

Uh oh!

samoht commented Aug 18, 2020 •

edited

Loading

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samoht commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samoht commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samoht commented Aug 14, 2020

Uh oh!

samoht commented Aug 15, 2020

Uh oh!

craigfe commented Aug 17, 2020

Uh oh!

dinosaure commented Aug 17, 2020

Uh oh!

craigfe commented Aug 17, 2020

Uh oh!

samoht commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samoht commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

samoht commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samoht commented Aug 14, 2020 •

edited

Loading

samoht commented Aug 14, 2020 •

edited

Loading

samoht commented Aug 17, 2020 •

edited

Loading

samoht commented Aug 18, 2020 •

edited

Loading