Skip to content

Conversation

@netheril96
Copy link
Contributor

This is an new layer implemented, as mentioned in #523. The IndirectionLayer maps each single integer label into an arbitrary blob. The mapping is stored in the file specified by the source parameter.

Motivation

In a lot of cases, more than a single integer label for each image is needed. It may be a group of independent labels, or linear array/matrix, or a combination of both. The current workarounds is to create multiple DataLayer and ImageDataLayer, each outputting a data array. The problem with this approach is

  1. It is not natural. The additional labels must be converted to image format that is only a dummy container of numbers, combined with dummy labels to create a DataLayer or ImageDataLayer. This approach is basically a hack that abuses the existing layers.
  2. It is not flexible. The order of different set of data must be manually synchronized, and it disables the use of shuffling in a lot of cases.
  3. It is error prone. The images and associated properties are tightly coupled in concept, but they are separated in implementation. This makes it very easy to change one part and then forget to change the other parts.

The Datum format can be reworked to allow arbitrary data in place of a single label. That approach, however, is likely to complicate the majority of use cases where a single label is already enough, and/or break backwards compatibility.

Network structure with indirection layer

In the first example, the single label output of DataLayer is transformed to three independent labels, suitable for three different loss layers.

In the second example, the label output is transformed into a large blob instead. Hence images (or other data) can have arbitrary data naturally associated.

More complex network can be constructed by adding several indirection layers so that each data can be associated with arbitrary number of arbitrary properties with arbitrary shapes.

Basic usage

The indirection_param is declared as

// Message that store parameters for IndirectionLayer
message IndirectionParameter {
  enum IndirectionSourceType {
    // Each line of the source file consists of numbers separated by whitespace
    SIMPLE_TEXT = 0;
    // Each line of the source file is a path to an unstructured binary file
    INDEXED_BINARY = 1;
    // Each line of the source file is a path to a binary BlobProto file
    INDEXED_BLOB = 2;
  }
  optional IndirectionSourceType type = 1;
  // The source file for index mapping
  repeated string source = 2;

  // The product of channels, height and width must equal to array length 
  // in the source
  optional uint32 channels = 3 [default = 1];
  optional uint32 height = 4 [default = 1];
  optional uint32 width = 5 [default = 1];
  enum IndirectionCacheType {
    NONE = 0;  // No cache is used
    WHOLE = 1;  // The data is cached as a whole in memory

    // Part of the data is cached; replacement is based on clock algorithm
    CLOCK = 2;
  }
  optional IndirectionCacheType cache_type = 6;

  // The following parameters are used only with partial cache
  optional uint32 cache_block_size = 7 [default = 256];
  optional uint32 cache_block_num = 8 [default = 8];
}

The key attributes are type and source. The source attribute contains the file name of a text file containing the mapping and can be repeated to produce multiple tops. In each mapping file, the zeroth line indicates the zeroth blob, the 33rd line the 33rd blob, etc.

The type parameter determines how each line is interpreted. In the default case, each line composes of space separated floating numbers. They may also be interpreted as file names that contains the actual data. More type (such as indexed images) can be added by subclassing IndexedDataReader.

The channels, width and height attributes control the shape of top blob, and must be specified (or left with default value 1) because many type of mapping does not contain the relevant information. Changing these values can allow different interpretation of underlying data contained in the mapping.

When the type is not SIMPLE_TEXT, each lookup requires disk access and possibly format conversion, which is costly. To reduce the overhead, one can enable caching. Currently only the simplest caching mechanism are implemented: complete in memory cache and block cache with CLOCK replacement algorithm. More cache strategy can be implemented by subclassing IndexedCachedDataReader.

@mtamburrano
Copy link
Contributor

me and @bhack tested this layer and seems to work fine. We used indirection_layer to feed the multilabel_loss_layer, the net converges normally like with the old PR #523

@netheril96
Copy link
Contributor Author

The project maintainers are probably hibernating.

@ducha-aiki
Copy link
Contributor

The project maintainers are probably hibernating.

http://www.pamitc.org/cvpr15/dates.php
Paper Submission Deadline 2014 November 14 (Fri).

So only until next week (Saturday and Sunday to sleep after submission :))

@netheril96
Copy link
Contributor Author

Have the project maintainers awaken from their prolonged hibernation?

@bhack
Copy link
Contributor

bhack commented Nov 20, 2014

@shelhamer @longjon @sguada @jeffdonahue what is the BVLC team status now?

@shelhamer
Copy link
Member

Supplementary material. We're all nearly back and preparing pull requests,
not to mention queuing up reviews, so everything will come back to life
soon!
On Thu, Nov 20, 2014 at 16:56 bhack notifications@github.com wrote:

@shelhamer https://github.com/shelhamer @longjon
https://github.com/longjon @sguada https://github.com/sguada
@jeffdonahue https://github.com/jeffdonahue what is the BVLC team
status now?


Reply to this email directly or view it on GitHub
#1414 (comment).

@netheril96
Copy link
Contributor Author

The description of this PR is updated to better clarify what this layer does.

@netheril96
Copy link
Contributor Author

@shelhamer @longjon @sguada @jeffdonahue Is there any status update? I would like this PR to be merged upstream as soon as possible, because I have other ideas to implement, and maintaining several feature branches at the same is becoming increasingly difficult.

@bhack
Copy link
Contributor

bhack commented Nov 28, 2014

@netheril96 This problem is common. There is no "known" policy for users contribution reviews. It is not about the age of the PR. It is not about the quality of the PR. It is not about the triviality of the PR.
I think that we need almost a periodically review to classify PR queue status with an updated tag (author MIA, NEW, NEED REVIEW, IN REVIEW, reviewer MIA or whatever you want). I know that actually it is best effort based on BVLC team interest/free time to review PR and that everyone can maintain its own fork (like with every other open source project). But I think that the BVLC team need to start to think, with the increasing Caffe popularity, if it is still or mainly want to be only an "open source" product of BVLC team or they want to become an "open source" partially community driven project. It is obvious that with the actual organization and without a clear path for PR and little bit view on what BVLC team are cooking in the middle term I really see very a hard path ahead to scale, maintain and attract more contributors.

@sguada
Copy link
Contributor

sguada commented Nov 28, 2014

@netheril96 after looking over this PR, I think it should be splitted in two PRs:
First PR to read labels from files, blobs, or why not from a lmdb or a leveldb

  • I think reading labels from files, blobs, DB, ... should be similar to read data from files, blobs, DB, ... so let's move this part into datasource or into labelsource.
  • I think after this generalization, where labels can be Blobs, there is not much difference with data, so it should be handle similarly.
  • Maybe the only difference seems to be that the key for data are strings (and data are typically read sequentially), while the key for labels are ints (and labels are typically retrieved by key)

The second PR should be an indirection layer (which should be part of data_layers.hpp) and uses a datasource (or labelsource) to read the labels and provide them when requested.

Also I think being able to read data from txt files, binary files or blob files will be a good addition to datasource.

@sguada
Copy link
Contributor

sguada commented Nov 28, 2014

I forgot to mention that datasource #1238 needs to be cleaned too. So @kmatzen this will be a great time to do it.

@netheril96
Copy link
Contributor Author

@sguada

Your idea of a further abstraction is interesting, but I have not completely understood it.

For one, there is no reading labels from files, blobs, DB in this PR. The only real labels are from DataLayer and ImageDataLayer, which is mandated because how Datum is declared. The output from IndirectionLayer are regular blobs, which can be treated as labels if needed but not fundamentally different from regular data. So I don't understand the need to unify data source and label source.

Maybe the only difference seems to be that the key for data are strings (and data are typically read sequentially), while the key for labels are ints (and labels are typically retrieved by key).

If I understand it correctly, the keys in the key-value store (leveldb or lmdb) currently has not much semantic meaning. They are all prepended with the ascii representation of an integer to enforce ordering, and is not amenable to random access, not the current interface anyway. Or do you propose we change that too?

@netheril96
Copy link
Contributor Author

@sguada

Actually isn't IndexedDataReader in the implementation already the unified data/label source you are talking about? Leveldb/LMDB support is not very hard to add either.

@shelhamer
Copy link
Member

I think reading labels from files, blobs, DB, ... should be similar to read data from files, blobs, DB, ... so let's move this part into datasource or into labelsource.

@sguada in general I think there should not be a real distinction between data and label, but instead there should be only blob. I don't want to make this off-topic so I'll post another issue.

@sguada
Copy link
Contributor

sguada commented Nov 30, 2014

@shelhamer Agreed, that is what I was trying to say, data and label should be treated more uniformly. Unfortunately, currently Datum makes a clear distinction between data (which can be int8 or float 3D-matrix) and label (which can only be int).
@netheril96 IndexedDataReader is not enough to unify data and label, since usually data is retrieved sequentially (although it has a string key), and label (should be retrieved using int key which is associated with data).
Regarding the IndexedDataReader I think the files should contain the key (int) in them and not assume that the line or the location in the file represents their key.

For the unification I see three options:

  1. Use int64 for all keys and use that for data and labels. In this case we will only need a mapping (key:int64, blob: Blob) which could be used to store data or labels.
    • The downside is that using integers as keys for data makes difficult to know which data point we are using.
    • The upside is that the same key can be used to retrieve data and labels.
  2. Use strings for data keys and int64 for label keys. In this case we will need two different mappings a datasource with (key:string, value:Blob, label_idx:int64) and a labelsource (key:int64, value: Blob)
    • The downside is that we need two different set of keys, and they should be aligned when constructing the datasource.
    • The upside is that it allows more meaningful keys for data and labels, they are kept separated and have different meaning.
  3. Define two types of sources one using strings as keys and another using int64 as keys.
    • This approach is agnostic about what is stored, they could be data or labels, the same key should be used to link them together after reading.
    • The downside is that in Caffe we cannot pass strings between layers, only Blobs, which means that if one wants to use strings for keys, then it also needs to define a mapping between strings and int64 to pass to another layer.

In any case the source should contain the type of the data/label stored.
Now, I don't know if it is the best idea try to fuse data and label in this PR or leave that for later.

@sguada
Copy link
Contributor

sguada commented Nov 30, 2014

@netheril96 I forgot to mention that IndirectionLayer should be part of DataLayers

@netheril96
Copy link
Contributor Author

@sguada

About data and label

Unfortunately, currently Datum makes a clear distinction between data (which can be int8 or float 3D-matrix) and label (which can only be int).

Label is a special case of data (1x1x1 3D matrix), so they are already unified. Datum makes a distinction between the two, but anything not needing Datum does not have to make the same distinction.

since usually data is retrieved sequentially (although it has a string key), and label (should be retrieved using int key which is associated with data).

This seems not true. Without indirection layer, both are retrieved sequentially. If you are referring to the string key in leveldb/lmdb, currently they are not meaningful and are never used.

In a nutshell, I strongly disagree with the suggestion that we should make distinction between data and label. The latter is but a special case of the former, and they are already unified under the interface Blob. The Datum class is a bad design, and that bad design should not be propagated unless the alternative is to break backwards compatibility, which this PR does not. @shelhamer

About representation

Regarding the IndexedDataReader I think the files should contain the key (int) in them and not assume that the line or the location in the file represents their key.

I considered such design in the beginning, and decided to go for a simpler approach.

But maybe that was wrong. The alternative as I am currently thinking is to mandate that the source file is of text protobuf format. The structure can be declared as

message DataMappingEntry {
  required int64 key = 1;
  required string type = 2; 
  repeated float data = 3;
  optional string filename = 4;
}

message DataMapping {
  repeated DataMappingEntry entry;
}

This has several changes with respect to current design:

  • Easier parsing since protobuf format is structured.
  • Harder to produce the source file, especially by programming.
  • More flexible and may result in overall better code.
  • Less efficient, because the keys may now have holes, and hash tables are probably needed in place of simple arrays.

Misc

I am not sure whether to put this layer. Conceptually, this layer does belong to data layers. But all the layers in data_layers.hpp are currently derived from BaseDataLayer, which makes IndirectionLayer a poor fit.

@longjon
Copy link
Contributor

longjon commented Dec 1, 2014

Some comments:

  • I fear this layer does too many unrelated things... it knows about parsing text files, reading "unstructured binary data" (but note that blobs are Dtype arrays, so there is no such thing...), reading BlobProtos, and various caching mechanisms. I think my vision of this (based on offline discussion with @sguada) was that we would have some data layers that took keys as bottoms and produced certain blobs as tops, perhaps as extensions to the existing data layers, where possible, or as new layers, where not possible. Is there a good reason for all of the indexed data sources to live in one layer?
  • @netheril96, (not sure if you reading @sguada differently than I am, but) we all agree that there is no fundamental distinction between data and label, and the current distinction made in Datum is likely to go away or be deprecated in the future.
  • The main issue I see here is therefore the one about strings: one often wants to use strings as keys, so... should there be string blobs? Or should layers be able to communicate with non-blob things? Or what?
  • But since this PR already seems to not deal in strings, that's really an issue for another place.

So the path I see is: break this into some data layers that take indices and produce data, and/or add options to existing data layers to take bottom blobs that contain indices, where that can be done and is desired. (However, I'm a newcomer to this PR, so let me know if I've missed some reason why things should be done differently.)

@longjon
Copy link
Contributor

longjon commented Dec 1, 2014

And @bhack @netheril96, re: the reviewing process, I agree that we could be a bit more organized, and know that we are discussing how we can improve things to be more efficient with everyone's time.

In the case of this PR, I can say why I personally haven't had time for a close look: it's a behemoth at nine hundred lines, including hundreds of lines added to util/. You can help us out by making PRs the smallest useful diffs they can be (and this will probably result in more modular code to boot).

(If you have more to say about the process, let's take it to an issue to keep this on topic.)

@bhack
Copy link
Contributor

bhack commented Dec 1, 2014

@longjon I know. But we need to add to this all the history of #523. Cause we have not a dev mailing list I'll open a new ticket to collect ideas.

@netheril96
Copy link
Contributor Author

@longjon

I think my vision of this (based on offline discussion with @sguada) was that we would have some data layers that took keys as bottoms and produced certain blobs as tops, perhaps as extensions to the existing data layers, where possible, or as new layers, where not possible.

Eh, isn't "keys as bottom and data as tops" exactly what IndirectionLayer does? Do I miss something?

Is there a good reason for all of the indexed data sources to live in one layer?

Are you suggesting that a different layer is made for each different type of data mapping? Like SimpleTextIndirectionLayer, IndexedBlobIndirectionLayer and IndexedImageIndirectionLayer?

The current design is that IndirectionLayer is itself a light wrapper. The layer delegates the task of actually retrieving the data to an abstract class IndexedDataReader, which could be a source on the disk, a cache for an underlying IndexedDataReader, or even calculated on the fly. The source of the data and caching mechanism are orthogonal in this architecture---you can pair any data source with any caching mechanism, or no caching at all. I don't know how it should be designed if each type of source corresponds to a different layer. Duplicating all the caching code in each of the layer? Make a common base class which handles all the different type of caching? Multiple inheritance? Or a separate CacheLayer?

it's a behemoth at nine hundred lines, including hundreds of lines added to util/. You can help us out by making PRs the smallest useful diffs they can be (and this will probably result in more modular code to boot)

My intention is to give users freedom of choices of data format, and the design is modular (or so I wish), but with many small modules. In fact, the several choices I give the users are pretty trivial, the corresponding class light, and was meant to demonstrate the flexibility and extensibility of this design. Apparently this choice has backfired. Maybe we should instead agree on an one true format, and thus dramatically cut down the lines of code needed.

@netheril96
Copy link
Contributor Author

@sguada @shelhamer @longjon

Perhaps I was ahead of myself when I accounted for various scenarios that may or may not be applicable in practice. After the discussion here is my revised design:

  • The layer will map int32 to an arbitrary blob.
    • About 64-bit keys: they are not going to be supported, as any integer not representable as 32 bit cannot be accurately represented by float, which is what will be passed between layers. In addition, protobuf has trouble dealing with more than 232 -1 items.
    • About string keys: They cannot be passed between layers, and therefore will require manual synchronization. That defeats the purpose of this layer.
    • About the value: There will be no distinction between data and labels still.
  • The PR is split into two:
    • The first one deals with data that can be loaded into memory at once. The data format is text protobuf---each entry contains the integer key and the float point array data. No more choice of format, no more caching.
    • The second one, possibly a different layer class, will deal with data that are too large to fit in main memory. The format is restricted to LMDB only. LMDB is a key-value store, exactly what we needed here. Besides, it has already solved the problem of caching. Again, no choice of data format or caching mechanism anymore.

These two decision should cut down the lines of code needed and improve performance (the caching mechanism I wrote is primitive). If no major disagreement is found with such design, I will implement it in the following days. Hopefully feedback can be quicker this time.

@sguada
Copy link
Contributor

sguada commented Dec 2, 2014

I think having multiple ways to define data/labels and load them is good, but it should be isolated from the layers, including DataLayers and IndirectLayer (what about renaming as EmbedingLayer or MappingLayer)

So the indirect layer should allow different label_source (or indexed_data) which can be Text, Binary, protobuf, LMDB, ... or whatever we want to define in the future.

I would create a new message in caffe.proto that allows to define different types of sources. An indirect layer would define one of those sources.

Assuming int32 for the index seems good.

@netheril96
Copy link
Contributor Author

I think having multiple ways to define data/labels and load them is good, but it should be isolated from the layers, including DataLayers and IndirectLayer (what about renaming as EmbedingLayer or MappingLayer)

So the indirect layer should allow different label_source (or indexed_data) which can be Text, Binary, protobuf, LMDB, ... or whatever we want to define in the future.

I would create a new message in caffe.proto that allows to define different types of sources. An indirect layer would define one of those sources.

@sguada

So you think the current design of IndexedDataReader is reasonable?

It also appears that you want to unify DataLayer and IndirectionLayer, as well as the sources of the data/label. I'm not sure how to implement this idea, given that DataLayer is intertwined with Datum. Datum is a very rigid class in that it forces any data (usually images) to be associated with exactly one integer, even if it may not have any integer to associate with, or have multiple ones. Unifying the two would require ditching Datum and breaking a lot of existing code. Do you have something in mind?

@amiralush
Copy link

@mtamburrano, Can you please share your multi-label training example with us?

@bhack
Copy link
Contributor

bhack commented Dec 7, 2014

@amiralush When all the involved PR will be merged in dev we can try to port our network to SVHN, if we have enough spare time, and share it in model zoo.

@amiralush
Copy link

@bhack Thanks. I've already managed to run the example mentioned in #523 with success.
However using my own data set it fails. I guess that my label density is just too low (~6%). How would you tackle this problem from your experience?

@netheril96
Copy link
Contributor Author

It has come to my attention that HDF5DataLayer can now produce multiple blob outputs as well, which makes this layer less needed. Strange that no one mentions it.

@shelhamer
Copy link
Member

@netheril96 HDF5DataLayer learned to make multiple tops in #1183.

@netheril96
Copy link
Contributor Author

@shelhamer It is very odd that no one, including you, had even hinted at it.

@bhack
Copy link
Contributor

bhack commented Jan 9, 2015

@shelhamer So using HDF5DataLayer will be officially the only solution to produce multiple blobs?

@shelhamer
Copy link
Member

@bhack @netheril96 HDF5DataLayer will not necessarily be alone in this. #1183 offered a simple +24 line patch for a helpful generalization of the data layer that's useful for problems with rich ground truth like multi-label classification, multivariate regression, multiple losses, and so on so it was merged.

From recent experience it seems that forming a new, general data pipeline without a lot of complications is a tricky task so light, decisively useful changes like #1183 are important steps.

@netheril96
Copy link
Contributor Author

@shelhamer

So you mean that this IndirectionLayer is too heavy a step towards general data pipeline?

@shelhamer
Copy link
Member

@netheril96 I don't know yet -- more code takes more time to review and really work out the consequences for the framework.

It's important to have different proposals to understand the options and pick the right way, so thanks for this interpretation of how to load rich data from an integer index.

@shelhamer shelhamer added the JL label Mar 10, 2015
@netheril96 netheril96 closed this Apr 6, 2015
@netheril96 netheril96 reopened this Apr 6, 2015
@netheril96 netheril96 closed this Jun 17, 2015
@bhack
Copy link
Contributor

bhack commented Jun 17, 2015

Can somebody tag this PR as abandoned by reviewers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants