Skip to content

Conversation

@cypof
Copy link
Member

@cypof cypof commented Jan 21, 2015

I split the work on data_layer from #1148. It was written initially to get enough bandwidth to feed multiple GPUs and fix performance issues with the thread creation/destruction on each batch. Over time a few other things got in. In particular we are experimenting at Flickr with different ratios of classes by reading from multiple sources. E.g. each dataset can be setup to contain one class, and the probability of each source defines the class ratios at runtime. Features:

  • Reading from multiple sources, in case one network location or disk cannot feed the solvers. Each source can hold only a shard, in which case they need probabilities balanced by their size. Or a copy of the same dataset with a random offset, which might also change SGD behavior a bit, as some examples might be seen multiple times before the second epoch, but over time coverage should be the same.
  • Probabilities on sources, e.g. to change the ratio of positive/negative when doing binary classification.
  • One loading thread per database, even if multiple solvers are running. For single threaded DBs like LevelDB, and to ensure sequential access, which is usually faster. In almost all cases one thread is enough for loading speed as it doesn't do anything else. There is still a transform thread for each solver like today.
  • No thread creation/deletion per batch. It's inefficient and it causes problems with components that rely on thread-local caching. We also had problem with memory pinning and virtual memory. C.f. @thatguymike
  • Prefetch asynchronously to each GPU on a separate CUDA stream, so that the batch is already on the GPU when the solver needs it.
  • Prefetch a configurable number of batches in host memory to erase bandwidth glitches, in particular if data is loaded from a network it might make sense to configure a large prefetch queue.

@cypof cypof mentioned this pull request Jan 21, 2015
@shelhamer
Copy link
Member

@cypof thanks for all the data pipeline improvements. Just a heads-up: this'll likely need a rebase after #1748.

@cypof cypof closed this Jan 22, 2015
@cypof cypof deleted the data_queues branch January 22, 2015 02:51
@cypof
Copy link
Member Author

cypof commented Jan 22, 2015

Deleted my branch by mistake, copied the PR to #1775

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants