[WIP] Cleanup routine for old tasks.db tasks by olljanat · Pull Request #2917 · moby/swarmkit

olljanat · 2019-12-15T20:10:08Z

relates to

single manager docker swarm stuck in Down state after reboot moby#34827 "single manager docker swarm stuck in Down state after reboot"
Swarm's tasks.db takes up lots of disk space #2367 Swarm's tasks.db takes up lots of disk space
Possible bugs in task reaper #2672 Possible bugs in task reaper
Docker 19.03.6 crash moby#40573 Docker 19.03.6 crash

- What I did
Added cleanup routine for old tasks.db tasks to avoid it growing on environments where daemon is restarted rarely.

Without this change those are removed only during worker init:
https://github.com/docker/swarmkit/blob/42085d2f8e43a3ed90ed289d3f3ed3de57837100/agent/worker.go#L95-L103

closes #2367

- How I did it
Cleanup routine run on its own thread and every 5 minutes it removes tasks on completed/failed/shutdown state which are more than 5 minutes old.

- How to test it

Add some constantly completing/failing service(s).
Let then run couple of minutes.
Remove those services.
Use e.g. https://github.com/nisboo/BoltGUI to read tasks.db and see that all tasks exists still on database.
Wait 5 minutes.
When removal happens it write lines like these to debug log:

time="2019-12-15T20:12:39+02:00" level=debug msg="Removing task ID: 1p9g7tjdfo4myfc58r1121fs8 State: COMPLETE LastUpdate: 2019-12-15T18:06:24.3410758Z" module=node/agent/worker node.id=x2cjt3iizyahokwt5n7maey11
time="2019-12-15T20:12:39+02:00" level=debug msg="Removing task ID: 1pra4b3nandod9lqlfy7qjr58 State: FAILED LastUpdate: 2019-12-15T18:07:01.7059872Z" module=node/agent/worker node.id=x2cjt3iizyahokwt5n7maey11

Use BoltGUI again to verify that there is no tasks left on DB anymore.

Avoid it growing on environments where daemon is restarted rarely Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>

dperny · 2019-12-18T15:47:53Z

I have honestly no idea what the purpose of the task database is, so I'm a bit nervous to merge code related to it. I'll spend some time looking into it and give you an answer, though. I know that this is a rather disruptive problem for most users.

olljanat · 2019-12-18T21:23:51Z

As far I understand it is cache database which contains information about started tasks which might be needed if there network issue between worker and manager? And maybe it also contains some info why tasks failed (those which are shown with docker service ps). That why I think that some kind of history is those is needed for thoublestooting purses.

What I can see is that if I remove all services from swarm tasks will stay on tasks.db but only until swsrmkit is restarted.

SvenDowideit · 2020-01-06T02:39:05Z

as per moby/moby#34827 I had a single swarm node go splat and stay down with a 3.4GB tasks.db - deleting it, and starting docker again seems to get us back to a functional swarm, with all the old services, volumes, secrets etc appearing to work.

now the tasks.db is 25M

@dperny if you do figure out what its used for, would you be willing to write it down somewhere?

thaJeztah · 2020-01-31T06:54:06Z

ping @stevvooe PTAL

olljanat · 2020-02-16T10:35:03Z

@dperny Do I understand correctly that with new Jobs tasks.db will grow even faster so some kind of cleanup routine would be useful?

cpuguy83 · 2020-03-04T20:26:32Z

What cleans these up on start? Does this have any particular effect on leader nodes? i.g. access to task history?

It seems like this cleanup should be happening at the specified task history threshold rather than a randomly picked time interval.

SvenDowideit · 2020-03-04T23:10:33Z

@cpuguy83 I agree that ideally, someone that knows what this database is supposed to do, should work out how to make it do whatever it was intended

but given how long this has been a big issue, having a fallback "oh crap, its all gone to hell, delete on timer" is 100% better than punting this to some future when perfect happens.

cpuguy83 · 2020-03-04T23:19:30Z

I'm not suggesting that this be punted down the line.
This seems like a critical bug that probably should never have shipped.

SvenDowideit · 2020-03-06T00:47:39Z

@cpuguy83 @thaJeztah @dperny so what do we need to do to get this merge before 20.03 is branched? considering the date, the clock is ticking :(

xals · 2020-03-20T11:46:35Z

We have quite a lot of docker swarm nodes and we faced the huge tasks.db problem this morning.

We think it is also a performance issue as the full database seems to be loaded in memory (I have a dockerd process eating 13Gb…), and update may cause CPU usage too.

We consider this problem as critical and this patch would be very welcome in the next release.

dperny · 2020-03-23T16:25:21Z

PTAL #2938, which is a bit cleaner of a fix.

olljanat force-pushed the 2367-remove-tasks-from-db branch 2 times, most recently from 226139b to 90977a2 Compare December 15, 2019 20:24

Cleanup routine for old tasks.db tasks

64b5650

Avoid it growing on environments where daemon is restarted rarely Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>

olljanat force-pushed the 2367-remove-tasks-from-db branch from 90977a2 to 64b5650 Compare December 15, 2019 20:51

olljanat changed the title ~~Cleanup routine for old tasks.db tasks~~ [WIP] Cleanup routine for old tasks.db tasks Dec 17, 2019

SvenDowideit mentioned this pull request Feb 25, 2020

Docker 19.03.6 crash moby/moby#40573

Closed

thaJeztah mentioned this pull request Mar 4, 2020

single manager docker swarm stuck in Down state after reboot moby/moby#34827

Closed

dperny mentioned this pull request Mar 23, 2020

Fix leaking tasks.db #2938

Merged

olljanat closed this Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Cleanup routine for old tasks.db tasks#2917

[WIP] Cleanup routine for old tasks.db tasks#2917
olljanat wants to merge 1 commit intomoby:masterfrom
olljanat:2367-remove-tasks-from-db

olljanat commented Dec 15, 2019 •

edited by thaJeztah

Loading

Uh oh!

dperny commented Dec 18, 2019

Uh oh!

olljanat commented Dec 18, 2019

Uh oh!

SvenDowideit commented Jan 6, 2020

Uh oh!

thaJeztah commented Jan 31, 2020

Uh oh!

olljanat commented Feb 16, 2020

Uh oh!

cpuguy83 commented Mar 4, 2020

Uh oh!

SvenDowideit commented Mar 4, 2020 •

edited

Loading

Uh oh!

cpuguy83 commented Mar 4, 2020

Uh oh!

SvenDowideit commented Mar 6, 2020

Uh oh!

xals commented Mar 20, 2020

Uh oh!

dperny commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

olljanat commented Dec 15, 2019 • edited by thaJeztah Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dperny commented Dec 18, 2019

Uh oh!

olljanat commented Dec 18, 2019

Uh oh!

SvenDowideit commented Jan 6, 2020

Uh oh!

thaJeztah commented Jan 31, 2020

Uh oh!

olljanat commented Feb 16, 2020

Uh oh!

cpuguy83 commented Mar 4, 2020

Uh oh!

SvenDowideit commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuguy83 commented Mar 4, 2020

Uh oh!

SvenDowideit commented Mar 6, 2020

Uh oh!

xals commented Mar 20, 2020

Uh oh!

dperny commented Mar 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

olljanat commented Dec 15, 2019 •

edited by thaJeztah

Loading

SvenDowideit commented Mar 4, 2020 •

edited

Loading