Conversation
|
Work with plaidml/openvino#117 to solve plaidml/openvino#71, change openvino commit to enable PriorBoxOp |
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
tzerrell
left a comment
There was a problem hiding this comment.
We should rewrite this to not require the 2nd input to be a Constant tensor.
While making the changes necessary for that, also look for whether you can rewrite to remove some of the concatenates & contractions. My intuition is that there may be some room to streamline those. Perhaps I'm mistaken though; I can review in more detail once the other changes are in
| const std::vector<std::vector<float>> maxSizes = {{315.0f}}; | ||
| const std::vector<std::vector<float>> aspectRatios = { | ||
| {1.0f}, | ||
| // {2.0f, 3.0f} |
There was a problem hiding this comment.
Is this test case commented out to reduce the total number of tests, or because the case isn't working yet?
And I have the same question for other parameters below, i.e. clip, flip, etc.
If it's for test size reduction, we should make 2 sets of tests, a smoke test that runs the current tests, and a non-smoke test that is more comprehensive (i.e., that runs all the currently-commented-out tests).
There was a problem hiding this comment.
Just to reduce the amount of tests, there will be 700+ tests for this op and in fact 1400+ if I add more details, these may take hours on my side, so not sure if it worth to open all tests as i locally passed all of them.
There was a problem hiding this comment.
Open all tests and just keep a few in smoke tests
There was a problem hiding this comment.
I'm a little worried about how slow these tests are running. I would not expect PriorBox to be especially fast, but it's slower than I hoped for. I wonder if perhaps there's a way to use additional dimensions when constructing the output (perhaps a dimension for x vs y, a dimension for which aspect ratio we're on, etc) that would enable us to avoid some of the splits/concatenations currently required...
There was a problem hiding this comment.
I divide the tests to three parts which with some fixed arguments to reduce size, it can cover most classic combinations and now only 76 tests, it takes 2 minutes on my side. As i have confirmed the whole combinations which covers all paths can be passed, i think it is enough now.
| std::vector<int> input_shape = input_shape_ngraph_op->get_vector<int>(); | ||
| H = input_shape[0]; | ||
| W = input_shape[1]; | ||
| auto* image_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(1)); |
There was a problem hiding this comment.
This input should not be required to be a Constant. I don't believe it will be Constant in typical use cases, and it restricts the utility of this op to require a Constant here.
Unlike the other input, which affects the output shape and therefore must be a Constant (as PlaidML does not currently support dynamic shapes), I see no barriers to allowing this input to be a generic tensor. It looks like each use of IH and IW could be converted to use a Tensor instead, although some will need downstream changes as well.
You can use op::slice, or a simple edsl::Contraction to extract and separate the two values IH/IW as tensors.
There was a problem hiding this comment.
(although in places where IH/IW are used in similar ways, it is preferable to leave the values together to avoid the need to slice and then later re-concatenate)
There was a problem hiding this comment.
Yes, it seems operand 1 can be changed to tensor , i will try this
There was a problem hiding this comment.
The input operand layout is IH IW, but in most cases we use IW IH in layout, and use IW && IH for their related tensor then concatenate may reduce some calculations ( sometimes concatenate four tensor which are a little different, but i am not sure calculate multiple times or concatenate multiple times which is less expensive), so keep use separate IW && IH this version
There was a problem hiding this comment.
I am uncertain of this, but I think currently concatenate is slower than repeat calculations, but in the long run repeat calculations may be slower. That's just a guess though.
Therefore I would not duplicate calculations to remove concatenations.
| auto CW = edsl::cast(edsl::index({edsl::TensorDim(H), edsl::TensorDim(W), edsl::TensorDim(1)}, 1), DType::FLOAT32); | ||
| auto CH = edsl::cast(edsl::index({edsl::TensorDim(H), edsl::TensorDim(W), edsl::TensorDim(1)}, 0), DType::FLOAT32); |
There was a problem hiding this comment.
I notice the use of FLOAT32, but I wonder how this will effect networks that are intended to run with FLOAT16. Is there some other data type we can use to cast into rather than hard coding the type?
There was a problem hiding this comment.
For precision issues, just follow the reference implementation as now the output will be compared with it, i think it make sense to use different precision if we don't need to compare with it which needs to force set RefMode to IE mode
There was a problem hiding this comment.
I believe in all places FLOAT32 is currently used, it would be correct to replace with the requested output type. And that should be available as ctx.layer->get_element_type() since PriorBox is a single-output op.
There was a problem hiding this comment.
Yes, shall do this, i use get_output_element_type(0) to get the type as the comments on get_element_type() advice
| float step; | ||
| float step_x; | ||
| float step_y; | ||
| float offset; |
There was a problem hiding this comment.
Generally I think we should use the highest precision floating point type (double) for constants and values computed at eDSL construction/compile time. Is there a reason why this wouldn't be possible in this context?
There was a problem hiding this comment.
follow https://github.com/plaidml/openvino/blob/plaidml/ngraph/core/reference/include/ngraph/runtime/reference/prior_box.hpp#L95 to get same results, can be flexible if close the comparation.
There was a problem hiding this comment.
Make step as tensor for the change of IW && IH
…ino commit. Signed-off-by: xin1.wang <xin1.wang@intel.com>
jd-bartlett96
left a comment
There was a problem hiding this comment.
I discussed this with @tzerrell and agree that it could be streamlined a bit to remove some unnecessary looping/concatenation/slicing, which I believe would improve the speed. Does what I suggest below make sense?
| int density_s = 0, shift = 0; | ||
| float var = 0; | ||
|
|
||
| for (size_t s = 0; s < fixed_size_count; ++s) { |
There was a problem hiding this comment.
If you've calculated lists of X-centers CX and Y-centers CY, each of length N, and with a list of sizes S of length M, you can avoid this loop and the associated concatenation at line 262 and calculate all the box corners at once with something to the effect of
F1 = [1,0,1,0]
F2 = [0,1,0,1]
P = [-1,-1,1,1]
D = [W,H,W,H]
B[i,j,k] = (F1[i]*CX[j] + F2[i]*CY[j] + P[i]*S[k])/(2*D[i])
where the shape of B is 4xNxM (which can be reshaped into the correct flattened form).
I imagine doing it in this way would allow you to avoid some of the concatenations within the loop as well.
There was a problem hiding this comment.
Also, an implementation detail for the last line of pseudocode:
Since you can't use explicit indices with elementwise operations, you will need to expand F1/F2/P/D/S/CX/CY to rank 3 tensors by adding size 1 dims, then rely on autobroadcasting. I.e., reshape F1, F2, and P to (4, 1, 1), reshape CX and CY to (1, N, 1), and reshape S to (1, 1, M); then you can just write B = (F1 * CX + F2 * CY + P * S) / (2 * D) and autobroadcasting will handle the indices for you.
There was a problem hiding this comment.
Yes, i used to think about how to remove this loop, but i did not find a way to remove slice for the dynamic shape of each element so keep this path to avoid some duplicate calculations. The main reason is for the attribute "density", each element of fixed_size will use the density with its index, it makes each of them get a different shape for the reason of producing some elements of "density[i] * density[i]" size (different Center_temp for each attributes in one loop), and the concatenate for each attribute is a little different ( different reshape for each C_dst) . One path like what you mentioned is to broadcast the center to the result size, but this may need sperate slice for each attribute or a concatenate of other attributes and may get some duplicate calculation for broadcast. I may think more about this.
There was a problem hiding this comment.
Ah, thanks for that explanation, we had not understood the significance of density.
I notice that density is an optional attribute and I suspect that it is frequently not set. I think it would be worthwhile to have a forked path where the more efficient computation method is used where density is unset or uniform, and the current method is used if density varies.
Alternatively, perhaps it would be possible to first do all the computations for the first density (in the manner @jd-bartlett96 suggested) and then where there are multiple densities follow up & concatenate those results; then reorder everything to the correct order. I suspect this may not be possible to do efficiently (in particular I think that reordering step might be very messy), but if there is a reasonable way to do this I think it's a good idea.
There was a problem hiding this comment.
It's a good idea to isolate normal and special case. I will try this, but it may need some time as each change shall pass the full tests.
There was a problem hiding this comment.
As I checked the reference implementation in openvino, "density" will always have same count with "fixed_size" even openvion spec not shows this request, it seems to be inevitable to use density in all paths of "fixed_size", so I keep current logical. I follow your advice to use auto broadcast to replace some hand writing ops, but I do not find a way to use one formula to combine all these conditions as the shape of each block is not equal, so there are still some concatenate and slice for the combination of results now.
Signed-off-by: xin1.wang <xin1.wang@intel.com>
Signed-off-by: xin1.wang <xin1.wang@intel.com>
| // Preprocess data with some attributes | ||
| void PriorBoxImpl::prepareConfig() { | ||
| precision = to_plaidml(ctx.layer->get_output_element_type(0)); | ||
| auto* input_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(0)); |
There was a problem hiding this comment.
Please add a check for whether this is a null pointer, and if so, throw an exception indicating that this input must be a constant (similar to other ops that use as_type<ngraph::op::Constant>).
There was a problem hiding this comment.
Please make this an exception not an assertion. There are legal IRv10 models that do not have a Constant here. If a user runs such a model, they should get an exception indicating that the model is unsupported in PlaidML. This should happen even if they are not checking assertions.
Signed-off-by: xin1.wang <xin1.wang@intel.com>
tzerrell
left a comment
There was a problem hiding this comment.
Thanks for the upgrades! There's one detail to fix mentioned inline, but otherwise this looks good to land.
I am not certain we've found all the ways to eliminate concatenates. However, I'm also not certain we haven't found all the ways to eliminate concatenates. I am confident that this is a complex op and further upgrades will require long and careful work. Given this, I want to land now and leave this as-is unless and until we have evidence that this op is causing substantive performance issues.
| // Preprocess data with some attributes | ||
| void PriorBoxImpl::prepareConfig() { | ||
| precision = to_plaidml(ctx.layer->get_output_element_type(0)); | ||
| auto* input_shape_ngraph_op = ngraph::as_type<ngraph::op::Constant>(ctx.layer->get_input_node_ptr(0)); |
There was a problem hiding this comment.
Please make this an exception not an assertion. There are legal IRv10 models that do not have a Constant here. If a user runs such a model, they should get an exception indicating that the model is unsupported in PlaidML. This should happen even if they are not checking assertions.
Signed-off-by: xin1.wang <xin1.wang@intel.com>
No description provided.