RDMA, data pipeline #1629

cypof · 2014-12-24T01:45:50Z

Big update, it took a while to finish and I have changes to the new data_layer so I had to branch again from the latest BVLC dev, let me know if that's fine.

sguada · 2014-12-24T23:34:10Z

@cypof thanks for updating the parallel branch.
I was looking over some of the code and it seems that you use two blocking_queues to simulate a queue with a predefined maximum capacity which blocks on push and on pop. Take a look a the modified version I wrote here:
https://gist.github.com/sguada/1e1d474a25f4ddcc7ba8
I think using this you will be able to remove the duplicate queue, one for empty and another for full.

I don't know what is the logic you try to implement in the data_layers with multiple sources, are you trying to read from different DB but add to the same queue which in turn could be read from different threads? Could have more that one thread loading data even from one source?

Take a look at #1568 I'm trying to rewrite datasets to be more clear, simpler and easy to expand.

cypof · 2014-12-28T23:37:55Z

@sguada thanks for looking at this. The second queue is about reusing buffers, mostly for GPUs as allocations and deletes are slow. It seems more efficient to put the buffer back in another queue when the solver is done with it. I used the same setup for the dataloader for unification, and because I like the idea of having a constant number of Datums at all time in the system.

Yes multiple sources is in case one disk cannot feed the solvers, and they all write to a shared queue that each solver picks from. It is changing SGD behavior a bit, as some examples will be seen multiple times before the second epoch, but over time coverage should be the same.

Would you like to work together on merging this and #1568? Here are the other goals of this update:

One context per database. E.g. leveldb can only be opened once, and lmdb allocates the whole database size as virtual memory for each context.
One loading thread per database. For single threaded DBs, and to ensure sequential access, which is usually faster. In almost all cases one thread is enough for loading speed as it doesn't do anything else.
Getting rid of thread creations/deletions. It's inefficient to create a thread for each batch, and it can cause problems with components that rely on thread-local caching.
Prefetching to each GPU on a separate CUDA thread so that the batch is already on the GPU when the solver needs it.
Prefetching a configurable number of batches in host memory to erase bandwidth glitches, in particular if data is loaded from a network it might make sense to configure a large prefetch queue.

sguada · 2014-12-29T19:31:31Z

@cypof sure I will be happy to work with you on #1568 and this.

There are several steps to feed the data into the GPU:

Read Datum from disk
Decode Datum if was encoded
Apply Transformation to Datum to make it part of the prefetch Blob
Copy the prefetch Blob to the GPU

We should time each of this steps to know where are the bottlenecks and put more emphasis there. For instance I'm not sure if just reading Datum from disk is a bottleneck or not. Currently if Datum is not encoded and images are resized to 256x256 it can read at around 800 datum/second. But if the Datum are resized and encoded (i.e. they are resized first and then stored as a .jpg file) the sequential reading is be even faster around 2400 datum/second but then the .jpg need to be decoded which takes time.

I agree on the goals you made for this PR, should we split the work? Maybe I can take care of the parts related to datum_DB in #1568.

Each database can only be opened once, so for each source there should be only one opened. By defining a DatumDB and DatumDBFactory we can impose that constraint more easily.
Do we want to allow multiple cursors per database (each cursor read from a different position in the database)? This makes things a bit more complicated, since need to keep track of multiple cursors.
Do we want to allow multiple readers per cursor, each reader could be a different thread, so it should be thread safe. We can get by using a blocking_queue, so the internal thread push Datum into the queue while the readers pop Datums from the queue.

shelhamer · 2015-01-16T05:31:31Z

I've set the parallel feature branch to this latest version. These are all good improvements but I'll note that the branch is now massive 4000+ new lines of code. It might be helpful to split parallel.cpp into different parts for say CPU, GPU, and distributed parallelism instead of having a monolithic implementation file.

Thanks @cypof!

cypof · 2015-01-16T05:54:14Z

@shelhamer I started cleaning and splitting the data_layer work first and will do a separate PR that should be easy to adapt to #1568. parallel.cpp can probably be simplified a lot once everything works well. Thanks!

melgor · 2015-02-06T16:23:52Z

@cypof Does current version of parallel branch works with GPU? I have compiled code, but when running"gpus" I get error:
F0206 17:00:12.894520 11428 syncedmem.cpp:51] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Could you said too why there are two source in examples/parallel? What is the purpose of it?

cypof · 2015-02-06T21:51:54Z

@melgor the gpus.cpp sample might be broken. I focused on the rdma one. I am currently working on shopping off the parallel PR into simpler ones. data_layer was first, multi-gpus on one node is next, hopefully next week.

cypof mentioned this pull request Dec 24, 2014

Parallel / distributed training #1148

Closed

6 tasks

sguada mentioned this pull request Dec 29, 2014

Datum db #1568

Closed

shelhamer merged commit aa3b877 into BVLC:parallel Jan 16, 2015

shelhamer force-pushed the parallel branch from a5f2208 to aa3b877 Compare January 16, 2015 05:26

melgor mentioned this pull request Feb 10, 2015

Error when compiling the parellel branch #1835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RDMA, data pipeline #1629

RDMA, data pipeline #1629

Uh oh!

cypof commented Dec 24, 2014

Uh oh!

sguada commented Dec 24, 2014

Uh oh!

cypof commented Dec 28, 2014

Uh oh!

sguada commented Dec 29, 2014

Uh oh!

shelhamer commented Jan 16, 2015

Uh oh!

cypof commented Jan 16, 2015

Uh oh!

melgor commented Feb 6, 2015

Uh oh!

cypof commented Feb 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RDMA, data pipeline #1629

RDMA, data pipeline #1629

Uh oh!

Conversation

cypof commented Dec 24, 2014

Uh oh!

sguada commented Dec 24, 2014

Uh oh!

cypof commented Dec 28, 2014

Uh oh!

sguada commented Dec 29, 2014

Uh oh!

shelhamer commented Jan 16, 2015

Uh oh!

cypof commented Jan 16, 2015

Uh oh!

melgor commented Feb 6, 2015

Uh oh!

cypof commented Feb 6, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants