-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Data queues, prefetching and multi-source #1775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5c66513 to
778e645
Compare
|
I merged the move from datasets to db. |
166c7a5 to
fe23447
Compare
|
I'm hitting an interesting issue with the updated prefetch and the way the queues work. The CPU is absolutely slammed now. Looks like the queue implementation in boost, or at least how we are using it here, is busy waiting, but we need to dig into things more. There is a comment about the spsc queues not being in the boost version in ubuntu 12.04. Seems we can either detect the version of boost and choose the right approach, or or move to our own queuing system (there are several options and implementations). However, I don't think moving to spsc queues will solve the busy wait problem. Thinking through the design here and how to implement a backoff strategy. (Really hit this looking at the multiGPU code where we are starving the driver when you have several really fast GPUs). Thinking perhaps we should be monitoring what the consume vs produce ratio is to implement backoff, which would also be useful to tell you if your IO system is fast enough. Parts of that performance monitor are actually in this PR already. |
|
I don't see this on the code related to this PR, only on the P2P code. Are you using the branch with both? I actually just finished another P2P proto without busy waiting: https://github.com/cypof/caffe/tree/p2p. It uses blocking_queues and callbacks instead of looping over events, and can create groups of more than 2 gpus. Bandwidth is not that good, and when I try to increase it by changing P2PSync::QUEUE, it slows down SGD. GPUs are not fully used so there must be some congestion somewhere. I also still have a shutdown bug, but fixed a lot of them since last version. |
|
I hit this without P2P just running single GPU on your branch and also in this one, but I have been swapping branches left and right tracing this down CPU load so it is possible I have a funky branch setup now. I traced at least one source of the CPU hammering to the prefetch update code which also appers in this PR and seems to originate here. I will retest with just this path in the morning with a clean checkout, but it has the same changes to pop/push the data prefetch queue which uses the same blocking queue setup and looping on queue pop. However, there are difference in the queue used here and the P2P branch. There is a LOT of CPU activity we think is causing your SGD slowdown and limiting GPU scaling. The driver is getting partially starved issuing command to the GPU and there is a lot of transfer traffic. Hard to get clean traces with the segfaults at exit, glad to here you are making progress there. The GPU load is why I started to backtrack to attempt to find all the sources of things occupying the GPU. (I quickly see things like this running gkrellm on ubuntu to track general system load, page fault rates, etc., as well as GPU temps and the like). I'll revalidate my braches and try again. |
|
Cyprien, you are correct. It's not this PR, it is an interaction with the P2P branch. The issue gets triggered when running single GPU, so I'll concentrate on working through that on your P2P branch. I still suggest we build in feedback on if the IO system is fast enough to keep up with the solvers and spit out effective BW's we are getting from the different threads and data sources. That is going to be critical information when training multiple high performance GPUs. |
|
Updated to latest dev. I first pushed data_layers without changes relative to dev, and github closed the PR. Not sure why I can't reopen it now that my branch is up to date... Also I'm not sure what to do about the random numbers. The test needs to set a seed, but you don't want all components in Caffe to use the same seed, so I'm storing it in Caffe and using a fixed offset for the data_layer component. We should unify this to get each component a fixed offset, but that should probably be a separate PR? |
|
@cypof please open a new PR against master. We decided to do away with dev to prevent confusion and overhead so now all development branches from master. Of course the latest master is nearly dev since we just did a release, so if you rebased to dev the rebase to master should be easy. Let me know if you have any questions about the switch. |
I split the work on data_layer from #1148. It was written initially to get enough bandwidth to feed multiple GPUs and fix performance issues with the thread creation/destruction on each batch. Over time a few other things got in. In particular we are experimenting at Flickr with different ratios of classes by reading from multiple sources. E.g. each dataset can be setup to contain one class, and the probability of each source defines the class ratios at runtime.
In terms of performance, the current code could be fast enough, but it's hard to evaluate. If many solvers open the same DB and read, only the first one will actually load data, the other ones read from the cache. For parallel training, each solver needs to see a different batch, so either we split the dataset in several DBs, or use large initial offsets in the same DB and hope they won't catch up with each other. If the offset if large, data might not be in cache anymore when the next solver reaches the same location, requiring the disk to seek back and forth. Seeking kills mechanical disks performance. Using SSD helps but now the dataset might not fit and you need multiple sources. This PR tries to answer these different problems.
Features: