Skip to content

Conversation

@cypof
Copy link
Member

@cypof cypof commented Dec 24, 2014

Big update, it took a while to finish and I have changes to the new data_layer so I had to branch again from the latest BVLC dev, let me know if that's fine.

@cypof cypof mentioned this pull request Dec 24, 2014
6 tasks
@sguada
Copy link
Contributor

sguada commented Dec 24, 2014

@cypof thanks for updating the parallel branch.
I was looking over some of the code and it seems that you use two blocking_queues to simulate a queue with a predefined maximum capacity which blocks on push and on pop. Take a look a the modified version I wrote here:
https://gist.github.com/sguada/1e1d474a25f4ddcc7ba8
I think using this you will be able to remove the duplicate queue, one for empty and another for full.

I don't know what is the logic you try to implement in the data_layers with multiple sources, are you trying to read from different DB but add to the same queue which in turn could be read from different threads? Could have more that one thread loading data even from one source?

Take a look at #1568 I'm trying to rewrite datasets to be more clear, simpler and easy to expand.

@cypof
Copy link
Member Author

cypof commented Dec 28, 2014

@sguada thanks for looking at this. The second queue is about reusing buffers, mostly for GPUs as allocations and deletes are slow. It seems more efficient to put the buffer back in another queue when the solver is done with it. I used the same setup for the dataloader for unification, and because I like the idea of having a constant number of Datums at all time in the system.

Yes multiple sources is in case one disk cannot feed the solvers, and they all write to a shared queue that each solver picks from. It is changing SGD behavior a bit, as some examples will be seen multiple times before the second epoch, but over time coverage should be the same.

Would you like to work together on merging this and #1568? Here are the other goals of this update:

  • One context per database. E.g. leveldb can only be opened once, and lmdb allocates the whole database size as virtual memory for each context.
  • One loading thread per database. For single threaded DBs, and to ensure sequential access, which is usually faster. In almost all cases one thread is enough for loading speed as it doesn't do anything else.
  • Getting rid of thread creations/deletions. It's inefficient to create a thread for each batch, and it can cause problems with components that rely on thread-local caching.
  • Prefetching to each GPU on a separate CUDA thread so that the batch is already on the GPU when the solver needs it.
  • Prefetching a configurable number of batches in host memory to erase bandwidth glitches, in particular if data is loaded from a network it might make sense to configure a large prefetch queue.

@sguada
Copy link
Contributor

sguada commented Dec 29, 2014

@cypof sure I will be happy to work with you on #1568 and this.

There are several steps to feed the data into the GPU:

  1. Read Datum from disk
  2. Decode Datum if was encoded
  3. Apply Transformation to Datum to make it part of the prefetch Blob
  4. Copy the prefetch Blob to the GPU

We should time each of this steps to know where are the bottlenecks and put more emphasis there. For instance I'm not sure if just reading Datum from disk is a bottleneck or not. Currently if Datum is not encoded and images are resized to 256x256 it can read at around 800 datum/second. But if the Datum are resized and encoded (i.e. they are resized first and then stored as a .jpg file) the sequential reading is be even faster around 2400 datum/second but then the .jpg need to be decoded which takes time.

I agree on the goals you made for this PR, should we split the work? Maybe I can take care of the parts related to datum_DB in #1568.

  • Each database can only be opened once, so for each source there should be only one opened. By defining a DatumDB and DatumDBFactory we can impose that constraint more easily.
  • Do we want to allow multiple cursors per database (each cursor read from a different position in the database)? This makes things a bit more complicated, since need to keep track of multiple cursors.
  • Do we want to allow multiple readers per cursor, each reader could be a different thread, so it should be thread safe. We can get by using a blocking_queue, so the internal thread push Datum into the queue while the readers pop Datums from the queue.

@sguada sguada mentioned this pull request Dec 29, 2014
@shelhamer shelhamer merged commit aa3b877 into BVLC:parallel Jan 16, 2015
@shelhamer
Copy link
Member

I've set the parallel feature branch to this latest version. These are all good improvements but I'll note that the branch is now massive 4000+ new lines of code. It might be helpful to split parallel.cpp into different parts for say CPU, GPU, and distributed parallelism instead of having a monolithic implementation file.

Thanks @cypof!

@cypof
Copy link
Member Author

cypof commented Jan 16, 2015

@shelhamer I started cleaning and splitting the data_layer work first and will do a separate PR that should be easy to adapt to #1568. parallel.cpp can probably be simplified a lot once everything works well. Thanks!

@melgor
Copy link

melgor commented Feb 6, 2015

@cypof Does current version of parallel branch works with GPU? I have compiled code, but when running"gpus" I get error:
F0206 17:00:12.894520 11428 syncedmem.cpp:51] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Could you said too why there are two source in examples/parallel? What is the purpose of it?

@cypof
Copy link
Member Author

cypof commented Feb 6, 2015

@melgor the gpus.cpp sample might be broken. I focused on the rdma one. I am currently working on shopping off the parallel PR into simpler ones. data_layer was first, multi-gpus on one node is next, hopefully next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants