Data queues, prefetching and multi-source #1775

cypof · 2015-01-22T02:56:34Z

I split the work on data_layer from #1148. It was written initially to get enough bandwidth to feed multiple GPUs and fix performance issues with the thread creation/destruction on each batch. Over time a few other things got in. In particular we are experimenting at Flickr with different ratios of classes by reading from multiple sources. E.g. each dataset can be setup to contain one class, and the probability of each source defines the class ratios at runtime.

In terms of performance, the current code could be fast enough, but it's hard to evaluate. If many solvers open the same DB and read, only the first one will actually load data, the other ones read from the cache. For parallel training, each solver needs to see a different batch, so either we split the dataset in several DBs, or use large initial offsets in the same DB and hope they won't catch up with each other. If the offset if large, data might not be in cache anymore when the next solver reaches the same location, requiring the disk to seek back and forth. Seeking kills mechanical disks performance. Using SSD helps but now the dataset might not fit and you need multiple sources. This PR tries to answer these different problems.

Features:

Multiple solvers read from a single queue. This makes sure they see different examples, and the source is accessed sequentially.
Reading from multiple sources, in case one network location or disk is not fast enough to feed all solvers, or contain the whole dataset. Each source can read from a shard, or a copy of the same dataset with a random offset.
Probabilities on sources, e.g. to change the ratio of positive/negative when doing binary classification. It is also useful to balance reads between sharded sources. If one is faster than another, some examples might used more often than others, which would change SGD behavior. Setting probabilities on sources, inverse to their size, ensures a balanced coverage.
One loading thread per database, even if multiple solvers are running. Needed for single threaded DBs like LevelDB, and to ensure sequential access. In almost all cases one thread is enough for loading speed as it doesn't do anything else. There is still a transform thread for each solver like today.
No thread creation/deletion per batch. It's inefficient and it causes problems with components that rely on thread-local caching. We also had problem with memory pinning and virtual memory. C.f. @thatguymike
Prefetch asynchronously to each GPU on a separate CUDA stream, so that the batch is already on the GPU when the solver needs it.
Prefetch a configurable number of batches in host memory to erase bandwidth glitches, in particular if data is loaded from a network it might make sense to configure a large prefetch queue.

cypof · 2015-02-02T22:11:09Z

I merged the move from datasets to db.

thatguymike · 2015-02-20T03:39:39Z

I'm hitting an interesting issue with the updated prefetch and the way the queues work. The CPU is absolutely slammed now. Looks like the queue implementation in boost, or at least how we are using it here, is busy waiting, but we need to dig into things more. There is a comment about the spsc queues not being in the boost version in ubuntu 12.04. Seems we can either detect the version of boost and choose the right approach, or or move to our own queuing system (there are several options and implementations).

However, I don't think moving to spsc queues will solve the busy wait problem. Thinking through the design here and how to implement a backoff strategy. (Really hit this looking at the multiGPU code where we are starving the driver when you have several really fast GPUs). Thinking perhaps we should be monitoring what the consume vs produce ratio is to implement backoff, which would also be useful to tell you if your IO system is fast enough. Parts of that performance monitor are actually in this PR already.

cypof · 2015-02-20T04:53:48Z

I don't see this on the code related to this PR, only on the P2P code. Are you using the branch with both? I actually just finished another P2P proto without busy waiting: https://github.com/cypof/caffe/tree/p2p. It uses blocking_queues and callbacks instead of looping over events, and can create groups of more than 2 gpus. Bandwidth is not that good, and when I try to increase it by changing P2PSync::QUEUE, it slows down SGD. GPUs are not fully used so there must be some congestion somewhere. I also still have a shutdown bug, but fixed a lot of them since last version.

thatguymike · 2015-02-20T05:07:30Z

I hit this without P2P just running single GPU on your branch and also in this one, but I have been swapping branches left and right tracing this down CPU load so it is possible I have a funky branch setup now. I traced at least one source of the CPU hammering to the prefetch update code which also appers in this PR and seems to originate here. I will retest with just this path in the morning with a clean checkout, but it has the same changes to pop/push the data prefetch queue which uses the same blocking queue setup and looping on queue pop. However, there are difference in the queue used here and the P2P branch.

There is a LOT of CPU activity we think is causing your SGD slowdown and limiting GPU scaling. The driver is getting partially starved issuing command to the GPU and there is a lot of transfer traffic. Hard to get clean traces with the segfaults at exit, glad to here you are making progress there.

The GPU load is why I started to backtrack to attempt to find all the sources of things occupying the GPU. (I quickly see things like this running gkrellm on ubuntu to track general system load, page fault rates, etc., as well as GPU temps and the like).

I'll revalidate my braches and try again.

thatguymike · 2015-02-20T18:51:13Z

Cyprien, you are correct. It's not this PR, it is an interaction with the P2P branch. The issue gets triggered when running single GPU, so I'll concentrate on working through that on your P2P branch.

I still suggest we build in feedback on if the IO system is fast enough to keep up with the solvers and spit out effective BW's we are getting from the different threads and data sources. That is going to be critical information when training multiple high performance GPUs.

cypof · 2015-02-20T23:30:13Z

Updated to latest dev. I first pushed data_layers without changes relative to dev, and github closed the PR. Not sure why I can't reopen it now that my branch is up to date...

Also I'm not sure what to do about the random numbers. The test needs to set a seed, but you don't want all components in Caffe to use the same seed, so I'm storing it in Caffe and using a fixed offset for the data_layer component. We should unify this to get each component a fixed offset, but that should probably be a separate PR?

shelhamer · 2015-02-21T02:25:59Z

@cypof please open a new PR against master. We decided to do away with dev to prevent confusion and overhead so now all development branches from master.

Of course the latest master is nearly dev since we just did a release, so if you rebased to dev the rebase to master should be easy. Let me know if you have any questions about the switch.

cypof mentioned this pull request Jan 22, 2015

Data queues, prefetching and multi-source #1773

Closed

shelhamer added the ready for review label Jan 22, 2015

cypof force-pushed the data_queues branch 2 times, most recently from 5c66513 to 778e645 Compare February 2, 2015 21:58

cypof force-pushed the data_queues branch 3 times, most recently from 166c7a5 to fe23447 Compare February 18, 2015 00:29

cypof closed this Feb 20, 2015

cypof force-pushed the data_queues branch from fe23447 to 079e416 Compare February 20, 2015 22:55

cypof mentioned this pull request Feb 21, 2015

Data queues, prefetching and multi-source #1933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data queues, prefetching and multi-source #1775

Data queues, prefetching and multi-source #1775

Uh oh!

cypof commented Jan 22, 2015

Uh oh!

cypof commented Feb 2, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

cypof commented Feb 20, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

cypof commented Feb 20, 2015

Uh oh!

shelhamer commented Feb 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data queues, prefetching and multi-source #1775

Data queues, prefetching and multi-source #1775

Uh oh!

Conversation

cypof commented Jan 22, 2015

Uh oh!

cypof commented Feb 2, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

cypof commented Feb 20, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

thatguymike commented Feb 20, 2015

Uh oh!

cypof commented Feb 20, 2015

Uh oh!

shelhamer commented Feb 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants