Skip to content
This repository was archived by the owner on Mar 29, 2025. It is now read-only.

Add PriorBox op and test.#1581

Merged
mergify[bot] merged 14 commits intoplaidml-v1from
xin-add-PriorBoxOp
Dec 14, 2020
Merged

Add PriorBox op and test.#1581
mergify[bot] merged 14 commits intoplaidml-v1from
xin-add-PriorBoxOp

Conversation

@XinWangIntel
Copy link
Copy Markdown
Contributor

No description provided.

@XinWangIntel
Copy link
Copy Markdown
Contributor Author

Work with plaidml/openvino#117 to solve plaidml/openvino#71, change openvino commit to enable PriorBoxOp

Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Copy link
Copy Markdown
Contributor

@tzerrell tzerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rewrite this to not require the 2nd input to be a Constant tensor.

While making the changes necessary for that, also look for whether you can rewrite to remove some of the concatenates & contractions. My intuition is that there may be some room to streamline those. Perhaps I'm mistaken though; I can review in more detail once the other changes are in

const std::vector<std::vector<float>> maxSizes = {{315.0f}};
const std::vector<std::vector<float>> aspectRatios = {
{1.0f},
// {2.0f, 3.0f}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case commented out to reduce the total number of tests, or because the case isn't working yet?

And I have the same question for other parameters below, i.e. clip, flip, etc.

If it's for test size reduction, we should make 2 sets of tests, a smoke test that runs the current tests, and a non-smoke test that is more comprehensive (i.e., that runs all the currently-commented-out tests).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to reduce the amount of tests, there will be 700+ tests for this op and in fact 1400+ if I add more details, these may take hours on my side, so not sure if it worth to open all tests as i locally passed all of them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open all tests and just keep a few in smoke tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about how slow these tests are running. I would not expect PriorBox to be especially fast, but it's slower than I hoped for. I wonder if perhaps there's a way to use additional dimensions when constructing the output (perhaps a dimension for x vs y, a dimension for which aspect ratio we're on, etc) that would enable us to avoid some of the splits/concatenations currently required...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I divide the tests to three parts which with some fixed arguments to reduce size, it can cover most classic combinations and now only 76 tests, it takes 2 minutes on my side. As i have confirmed the whole combinations which covers all paths can be passed, i think it is enough now.

std::vector<int> input_shape = input_shape_ngraph_op->get_vector<int>();
H = input_shape[0];
W = input_shape[1];
auto* image_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(1));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This input should not be required to be a Constant. I don't believe it will be Constant in typical use cases, and it restricts the utility of this op to require a Constant here.

Unlike the other input, which affects the output shape and therefore must be a Constant (as PlaidML does not currently support dynamic shapes), I see no barriers to allowing this input to be a generic tensor. It looks like each use of IH and IW could be converted to use a Tensor instead, although some will need downstream changes as well.

You can use op::slice, or a simple edsl::Contraction to extract and separate the two values IH/IW as tensors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although in places where IH/IW are used in similar ways, it is preferable to leave the values together to avoid the need to slice and then later re-concatenate)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems operand 1 can be changed to tensor , i will try this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input operand layout is IH IW, but in most cases we use IW IH in layout, and use IW && IH for their related tensor then concatenate may reduce some calculations ( sometimes concatenate four tensor which are a little different, but i am not sure calculate multiple times or concatenate multiple times which is less expensive), so keep use separate IW && IH this version

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am uncertain of this, but I think currently concatenate is slower than repeat calculations, but in the long run repeat calculations may be slower. That's just a guess though.

Therefore I would not duplicate calculations to remove concatenations.

Comment on lines +120 to +121
auto CW = edsl::cast(edsl::index({edsl::TensorDim(H), edsl::TensorDim(W), edsl::TensorDim(1)}, 1), DType::FLOAT32);
auto CH = edsl::cast(edsl::index({edsl::TensorDim(H), edsl::TensorDim(W), edsl::TensorDim(1)}, 0), DType::FLOAT32);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the use of FLOAT32, but I wonder how this will effect networks that are intended to run with FLOAT16. Is there some other data type we can use to cast into rather than hard coding the type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For precision issues, just follow the reference implementation as now the output will be compared with it, i think it make sense to use different precision if we don't need to compare with it which needs to force set RefMode to IE mode

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in all places FLOAT32 is currently used, it would be correct to replace with the requested output type. And that should be available as ctx.layer->get_element_type() since PriorBox is a single-output op.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, shall do this, i use get_output_element_type(0) to get the type as the comments on get_element_type() advice

Comment on lines +49 to +52
float step;
float step_x;
float step_y;
float offset;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think we should use the highest precision floating point type (double) for constants and values computed at eDSL construction/compile time. Is there a reason why this wouldn't be possible in this context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make step as tensor for the change of IW && IH

…ino commit.

Signed-off-by: xin1.wang <xin1.wang@intel.com>
Copy link
Copy Markdown
Contributor

@jd-bartlett96 jd-bartlett96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @tzerrell and agree that it could be streamlined a bit to remove some unnecessary looping/concatenation/slicing, which I believe would improve the speed. Does what I suggest below make sense?

int density_s = 0, shift = 0;
float var = 0;

for (size_t s = 0; s < fixed_size_count; ++s) {
Copy link
Copy Markdown
Contributor

@jd-bartlett96 jd-bartlett96 Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've calculated lists of X-centers CX and Y-centers CY, each of length N, and with a list of sizes S of length M, you can avoid this loop and the associated concatenation at line 262 and calculate all the box corners at once with something to the effect of

F1 = [1,0,1,0]
F2 = [0,1,0,1]
P = [-1,-1,1,1]
D = [W,H,W,H]
B[i,j,k] = (F1[i]*CX[j] + F2[i]*CY[j] + P[i]*S[k])/(2*D[i])

where the shape of B is 4xNxM (which can be reshaped into the correct flattened form).

I imagine doing it in this way would allow you to avoid some of the concatenations within the loop as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, an implementation detail for the last line of pseudocode:

Since you can't use explicit indices with elementwise operations, you will need to expand F1/F2/P/D/S/CX/CY to rank 3 tensors by adding size 1 dims, then rely on autobroadcasting. I.e., reshape F1, F2, and P to (4, 1, 1), reshape CX and CY to (1, N, 1), and reshape S to (1, 1, M); then you can just write B = (F1 * CX + F2 * CY + P * S) / (2 * D) and autobroadcasting will handle the indices for you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i used to think about how to remove this loop, but i did not find a way to remove slice for the dynamic shape of each element so keep this path to avoid some duplicate calculations. The main reason is for the attribute "density", each element of fixed_size will use the density with its index, it makes each of them get a different shape for the reason of producing some elements of "density[i] * density[i]" size (different Center_temp for each attributes in one loop), and the concatenate for each attribute is a little different ( different reshape for each C_dst) . One path like what you mentioned is to broadcast the center to the result size, but this may need sperate slice for each attribute or a concatenate of other attributes and may get some duplicate calculation for broadcast. I may think more about this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for that explanation, we had not understood the significance of density.

I notice that density is an optional attribute and I suspect that it is frequently not set. I think it would be worthwhile to have a forked path where the more efficient computation method is used where density is unset or uniform, and the current method is used if density varies.

Alternatively, perhaps it would be possible to first do all the computations for the first density (in the manner @jd-bartlett96 suggested) and then where there are multiple densities follow up & concatenate those results; then reorder everything to the correct order. I suspect this may not be possible to do efficiently (in particular I think that reordering step might be very messy), but if there is a reasonable way to do this I think it's a good idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to isolate normal and special case. I will try this, but it may need some time as each change shall pass the full tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I checked the reference implementation in openvino, "density" will always have same count with "fixed_size" even openvion spec not shows this request, it seems to be inevitable to use density in all paths of "fixed_size", so I keep current logical. I follow your advice to use auto broadcast to replace some hand writing ops, but I do not find a way to use one formula to combine all these conditions as the shape of each block is not equal, so there are still some concatenate and slice for the combination of results now.

Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
// Preprocess data with some attributes
void PriorBoxImpl::prepareConfig() {
precision = to_plaidml(ctx.layer->get_output_element_type(0));
auto* input_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a check for whether this is a null pointer, and if so, throw an exception indicating that this input must be a constant (similar to other ops that use as_type<ngraph::op::Constant>).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this an exception not an assertion. There are legal IRv10 models that do not have a Constant here. If a user runs such a model, they should get an exception indicating that the model is unsupported in PlaidML. This should happen even if they are not checking assertions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Copy link
Copy Markdown
Contributor

@tzerrell tzerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the upgrades! There's one detail to fix mentioned inline, but otherwise this looks good to land.

I am not certain we've found all the ways to eliminate concatenates. However, I'm also not certain we haven't found all the ways to eliminate concatenates. I am confident that this is a complex op and further upgrades will require long and careful work. Given this, I want to land now and leave this as-is unless and until we have evidence that this op is causing substantive performance issues.

// Preprocess data with some attributes
void PriorBoxImpl::prepareConfig() {
precision = to_plaidml(ctx.layer->get_output_element_type(0));
auto* input_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this an exception not an assertion. There are legal IRv10 models that do not have a Constant here. If a user runs such a model, they should get an exception indicating that the model is unsupported in PlaidML. This should happen even if they are not checking assertions.

Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
@mergify mergify bot merged commit 927821d into plaidml-v1 Dec 14, 2020
@mergify mergify bot deleted the xin-add-PriorBoxOp branch December 14, 2020 17:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants