-
Notifications
You must be signed in to change notification settings - Fork 18.6k
RDMA, data pipeline #1629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA, data pipeline #1629
Conversation
|
@cypof thanks for updating the parallel branch. I don't know what is the logic you try to implement in the data_layers with multiple sources, are you trying to read from different DB but add to the same queue which in turn could be read from different threads? Could have more that one thread loading data even from one source? Take a look at #1568 I'm trying to rewrite datasets to be more clear, simpler and easy to expand. |
|
@sguada thanks for looking at this. The second queue is about reusing buffers, mostly for GPUs as allocations and deletes are slow. It seems more efficient to put the buffer back in another queue when the solver is done with it. I used the same setup for the dataloader for unification, and because I like the idea of having a constant number of Datums at all time in the system. Yes multiple sources is in case one disk cannot feed the solvers, and they all write to a shared queue that each solver picks from. It is changing SGD behavior a bit, as some examples will be seen multiple times before the second epoch, but over time coverage should be the same. Would you like to work together on merging this and #1568? Here are the other goals of this update:
|
|
@cypof sure I will be happy to work with you on #1568 and this. There are several steps to feed the data into the GPU:
We should time each of this steps to know where are the bottlenecks and put more emphasis there. For instance I'm not sure if just reading Datum from disk is a bottleneck or not. Currently if Datum is not encoded and images are resized to 256x256 it can read at around 800 datum/second. But if the Datum are resized and encoded (i.e. they are resized first and then stored as a .jpg file) the sequential reading is be even faster around 2400 datum/second but then the .jpg need to be decoded which takes time. I agree on the goals you made for this PR, should we split the work? Maybe I can take care of the parts related to datum_DB in #1568.
|
|
I've set the parallel feature branch to this latest version. These are all good improvements but I'll note that the branch is now massive 4000+ new lines of code. It might be helpful to split Thanks @cypof! |
|
@shelhamer I started cleaning and splitting the data_layer work first and will do a separate PR that should be easy to adapt to #1568. parallel.cpp can probably be simplified a lot once everything works well. Thanks! |
|
@cypof Does current version of parallel branch works with GPU? I have compiled code, but when running"gpus" I get error: Could you said too why there are two source in examples/parallel? What is the purpose of it? |
|
@melgor the gpus.cpp sample might be broken. I focused on the rdma one. I am currently working on shopping off the parallel PR into simpler ones. data_layer was first, multi-gpus on one node is next, hopefully next week. |
Big update, it took a while to finish and I have changes to the new data_layer so I had to branch again from the latest BVLC dev, let me know if that's fine.