Skip to content

[WIP] Cleanup routine for old tasks.db tasks#2917

Closed
olljanat wants to merge 1 commit intomoby:masterfrom
olljanat:2367-remove-tasks-from-db
Closed

[WIP] Cleanup routine for old tasks.db tasks#2917
olljanat wants to merge 1 commit intomoby:masterfrom
olljanat:2367-remove-tasks-from-db

Conversation

@olljanat
Copy link
Copy Markdown
Contributor

@olljanat olljanat commented Dec 15, 2019

relates to

- What I did
Added cleanup routine for old tasks.db tasks to avoid it growing on environments where daemon is restarted rarely.

Without this change those are removed only during worker init:
https://github.com/docker/swarmkit/blob/42085d2f8e43a3ed90ed289d3f3ed3de57837100/agent/worker.go#L95-L103

closes #2367

- How I did it
Cleanup routine run on its own thread and every 5 minutes it removes tasks on completed/failed/shutdown state which are more than 5 minutes old.

- How to test it

  • Add some constantly completing/failing service(s).
  • Let then run couple of minutes.
  • Remove those services.
  • Use e.g. https://github.com/nisboo/BoltGUI to read tasks.db and see that all tasks exists still on database.
  • Wait 5 minutes.
  • When removal happens it write lines like these to debug log:
time="2019-12-15T20:12:39+02:00" level=debug msg="Removing task ID: 1p9g7tjdfo4myfc58r1121fs8 State: COMPLETE LastUpdate: 2019-12-15T18:06:24.3410758Z" module=node/agent/worker node.id=x2cjt3iizyahokwt5n7maey11
time="2019-12-15T20:12:39+02:00" level=debug msg="Removing task ID: 1pra4b3nandod9lqlfy7qjr58 State: FAILED LastUpdate: 2019-12-15T18:07:01.7059872Z" module=node/agent/worker node.id=x2cjt3iizyahokwt5n7maey11
  • Use BoltGUI again to verify that there is no tasks left on DB anymore.

@olljanat olljanat force-pushed the 2367-remove-tasks-from-db branch 2 times, most recently from 226139b to 90977a2 Compare December 15, 2019 20:24
Avoid it growing on environments
where daemon is restarted rarely

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
@olljanat olljanat force-pushed the 2367-remove-tasks-from-db branch from 90977a2 to 64b5650 Compare December 15, 2019 20:51
@olljanat olljanat changed the title Cleanup routine for old tasks.db tasks [WIP] Cleanup routine for old tasks.db tasks Dec 17, 2019
@dperny
Copy link
Copy Markdown
Collaborator

dperny commented Dec 18, 2019

I have honestly no idea what the purpose of the task database is, so I'm a bit nervous to merge code related to it. I'll spend some time looking into it and give you an answer, though. I know that this is a rather disruptive problem for most users.

@olljanat
Copy link
Copy Markdown
Contributor Author

As far I understand it is cache database which contains information about started tasks which might be needed if there network issue between worker and manager? And maybe it also contains some info why tasks failed (those which are shown with docker service ps). That why I think that some kind of history is those is needed for thoublestooting purses.

What I can see is that if I remove all services from swarm tasks will stay on tasks.db but only until swsrmkit is restarted.

@SvenDowideit
Copy link
Copy Markdown

as per moby/moby#34827 I had a single swarm node go splat and stay down with a 3.4GB tasks.db - deleting it, and starting docker again seems to get us back to a functional swarm, with all the old services, volumes, secrets etc appearing to work.

now the tasks.db is 25M

@dperny if you do figure out what its used for, would you be willing to write it down somewhere?

@thaJeztah
Copy link
Copy Markdown
Member

ping @stevvooe PTAL

@olljanat
Copy link
Copy Markdown
Contributor Author

@dperny Do I understand correctly that with new Jobs tasks.db will grow even faster so some kind of cleanup routine would be useful?

@cpuguy83
Copy link
Copy Markdown
Member

cpuguy83 commented Mar 4, 2020

What cleans these up on start? Does this have any particular effect on leader nodes? i.g. access to task history?

It seems like this cleanup should be happening at the specified task history threshold rather than a randomly picked time interval.

@SvenDowideit
Copy link
Copy Markdown

SvenDowideit commented Mar 4, 2020

@cpuguy83 I agree that ideally, someone that knows what this database is supposed to do, should work out how to make it do whatever it was intended

but given how long this has been a big issue, having a fallback "oh crap, its all gone to hell, delete on timer" is 100% better than punting this to some future when perfect happens.

@cpuguy83
Copy link
Copy Markdown
Member

cpuguy83 commented Mar 4, 2020

I'm not suggesting that this be punted down the line.
This seems like a critical bug that probably should never have shipped.

@SvenDowideit
Copy link
Copy Markdown

@cpuguy83 @thaJeztah @dperny so what do we need to do to get this merge before 20.03 is branched? considering the date, the clock is ticking :(

@xals
Copy link
Copy Markdown

xals commented Mar 20, 2020

We have quite a lot of docker swarm nodes and we faced the huge tasks.db problem this morning.

We think it is also a performance issue as the full database seems to be loaded in memory (I have a dockerd process eating 13Gb…), and update may cause CPU usage too.

We consider this problem as critical and this patch would be very welcome in the next release.

@dperny dperny mentioned this pull request Mar 23, 2020
@dperny
Copy link
Copy Markdown
Collaborator

dperny commented Mar 23, 2020

PTAL #2938, which is a bit cleaner of a fix.

@olljanat olljanat closed this Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Swarm's tasks.db takes up lots of disk space

6 participants