Skip to content

feat: http sync service#1187

Merged
hacdias merged 88 commits intomasterfrom
sync-service
Sep 27, 2021
Merged

feat: http sync service#1187
hacdias merged 88 commits intomasterfrom
sync-service

Conversation

@hacdias
Copy link
Copy Markdown
Member

@hacdias hacdias commented Jan 15, 2021

Closes #1186.

Read more on: testground/sync-service#4


This PR introduces a new web service to Testground: the sync service. The sync service will act as the central service to which the tests should connect to get sync features (barriers, publishing, subscribing, entry signaling, event signaling). Previously, the tests were connecting directly to the Redis instance through the SDK. However, that brings us some issues:

  • In case we want to switch from Redis to a different technology (unlikely in the short term), we would need to change the implementation of all SDKs in all languages. refactor: replace Redis by custom implementation sync-service#8
  • The support of Redis on non-backend technologies (such as browsers) is non-existent, which would not allow us to successfully build a JavaScript SDK that successfully works in the browser.
  • By accessing Redis directly, a test may, on purpose, mess up with the purpose of Redis.

This new service is built to battle those issues, by:

  • Providing an HTTP server to which each instance should connect. From that point on, a WebSocket connection should be created. This WebSocket will live as long as the test instance is alive. All connections to the sync service shall be done through this single WebSocket to avoid useless connections.
  • The implementation of the sync service is abstracted by the sync.Service interface, which makes it easy to swap to a different backend in the future.

Regarding testing, I simply moved the tests from sdk-go that now belong here, and adapted them. Do you think we should do more unit tests or are the current integration tests enough?


To do

Before merging

  • Update sdk-go version to definite one on tests and tg
  • Update infra repository to use the new sync service Updating TaaS is deferred.

- signal event
- signal entry
= publish

License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias
Copy link
Copy Markdown
Member Author

hacdias commented Jan 29, 2021

@nonsense I implemented the WebSockets as we discussed. Tomorrow, I will start updating the client in order to support the new sync server. Plus, I still need to add tests here.

I wanted to ask you to take a quick look to validate the architecture. A small recap of what I did so far was:

  • sync.Service is an interface that should implement the actual sync actions using a certain backend. The current implementation, DefaultService uses Redis and it is mostly based on the original sync package from the Go SDK.
  • We have an HTTP server that accepts connections, which must upgrade to WebSockets. For each connection, we have a connection that is responsible for consuming the new requests and consuming (sending) the responses that we generate. When we receive a new request, we spin a new goroutine to handle that request. Usually, those are short lived goroutines except for Subscribe and Barrier.

One of my concerns was spinning up goroutines for each request but, as I mentioned, most of them are short-lived. And for the long-lived ones, it's unlikely that a test is calling Barriers and Subscribes at the same time.

License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias
Copy link
Copy Markdown
Member Author

hacdias commented Feb 12, 2021

I now noticed testground/pkg/runners and testground/pkg/sidecar both depend on sdk-go for the sync client. The usage of the watch client is making my life complicated as I'm not sure what's the use for it (or even the need).

I plan on moving the GC collection to the sync service.

@nonsense do you have any recommendation on what to do with the Watch Client? What was it used for? is it useful? Should the sidecar connect directly to Redis? Should I create some interface through the sync service for the watching thing?

@nonsense
Copy link
Copy Markdown
Member

The WatchClient is used by the testground daemon to fetch information about the run of individual testplan instances for a given run, and display that information on TaaS on the dashboard, i.e. clients 998/1000 ; miners 10/10 in terms of failed/ok testplan instances.

The testground daemon would have to connect to the new sync service to fetch that information, when we introduce the sync service, that would replace the sync library and the direct connection to Redis.

- fixes
- tested with some plans and it seems to work correctly
- NOTE: there is note on the logs saying websocket closed unexpectedly
    - IS  related to the client used by the sidecar.

License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
NOTE:
- Im probably printing something inside doRun that is causing this error
    write error: write /Users/henriquedias/testground/data/daemon/c0jsphucie6mgsgepg6g.out: file already closed
- TODO: remove code from signalEvent from testrond/pkg/sync as .Subscribe/.Publish can do the same

License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias
Copy link
Copy Markdown
Member Author

hacdias commented Feb 16, 2021

Current issues

### Issue 1

Every time a test fails, I get something like this on the console.

Feb 16 15:31:37.247686	INFO	1.3661s INCOMPLETE << sidecar      >>
2021-02-16 16:31:37.24772 +0100 CET m=+263.132244532 write error: write /Users/henriquedias/testground/data/daemon/XXXXXXXXXXXX.out: file already closed

So I guess I'm still somehow writing to that file while I shouldn't (see logs).

Apparently this also happens on my computer on the master branch so it may be something to do with my local configuration.

### Issue 2 (splitbrain)

Splitbrain hangs and then fails with deadline exceeded. It appears that pubsub might be only sending data to one sub.

PubSub system is made for only one sub. Make it support more subs. Take into account that multiple subs might've started in different points in time, hence lastid might differ.

Rework barrier logic: currently we are asking for same barrier the number of instances waiting on that barrier.

License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias hacdias requested a review from nonsense April 15, 2021 14:26
hacdias and others added 3 commits April 17, 2021 08:32
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias hacdias requested a review from nonsense April 21, 2021 13:38
@hacdias hacdias force-pushed the sync-service branch 2 times, most recently from 19a349a to f01af2b Compare May 15, 2021 08:25
./integration_tests/10_docker_splitbrain_reject.sh
./integration_tests/11_docker_splitbrain_drop.sh
./integration_tests/12_docker_example-js_pingpong.sh
# ./integration_tests/12_docker_example-js_pingpong.sh
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK JS is a work in progress. We can re-add the tests afterwards.

hacdias added 2 commits July 2, 2021 11:16
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
License: MIT
Signed-off-by: Henrique Dias <hacdias@gmail.com>
@hacdias hacdias merged commit b0d9aa8 into master Sep 27, 2021
@hacdias hacdias deleted the sync-service branch September 27, 2021 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sync service refactor sdk-go FetchAllEvents and SubscribeEvents

2 participants