Skip to content

G-API: oneVPL (simplification) added CPU, DX11(fallback CPU) accels & surface pool#20727

Merged
alalek merged 4 commits intoopencv:masterfrom
sivanov-work:merge_vpl_accel_impl
Sep 23, 2021
Merged

G-API: oneVPL (simplification) added CPU, DX11(fallback CPU) accels & surface pool#20727
alalek merged 4 commits intoopencv:masterfrom
sivanov-work:merge_vpl_accel_impl

Conversation

@sivanov-work
Copy link
Copy Markdown
Contributor

@sivanov-work sivanov-work commented Sep 21, 2021

this PR is the single one in series of #20469
Added VPL CPU accelerator, DX11 ( fallback CPU) accelerator & surface pool

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Build configuration

force_builders=Custom,Custom Win,Custom Mac
build_gapi_standalone:Linux x64=ade-0.1.1f
build_gapi_standalone:Win64=ade-0.1.1f
build_gapi_standalone:Mac=ade-0.1.1f
build_gapi_standalone:Linux x64 Debug=ade-0.1.1f

Xbuild_image:Custom=centos:7
Xbuildworker:Custom=linux-1
build_gapi_standalone:Custom=ade-0.1.1f

build_image:Custom=ubuntu-openvino-2021.3.0:20.04
build_image:Custom Win=openvino-2021.2.0
build_image:Custom Mac=openvino-2021.2.0

test_modules:Custom=gapi,python2,python3,java
test_modules:Custom Win=gapi,python2,python3,java
test_modules:Custom Mac=gapi,python2,python3,java

buildworker:Custom=linux-1
test_opencl:Custom=OFF
test_bigdata:Custom=1
test_filter:Custom=*

Comment on lines +29 to +33
for (auto& pair : pool_table) {
pair.second.clear();
// do not free key here: last surface will release it
}
pool_table.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this happen automatically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, I wanted to make sure that all pointers are released straight before print "destroyed".

But later the next situation was emerged when some data consumer was keeping referenced surface in MediaFrame. So clear for pool_table here doesn't means that all pointers released until last consumer takes MediaFrame.

So "destoyed" doesn't mean free and this row might be removed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So clear for pool_table here doesn't means that all pointers released until last consumer takes MediaFrame.

Sorry but all this object lifetime thing is still not very clear to me. Can you please explain in detail how these different objects relate?

Copy link
Copy Markdown
Contributor Author

@sivanov-work sivanov-work Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure:

Accelerator created shared_ptr<void> for preallocated_memory and used its raw ptr as an unique key for a pool in pools table. Let's assume that "pool" here is a simple std::vector of "surface" ( actually shared_ptr<surface>). So
nor pool owns preallocated memory nor 'Accelerator.
as oneVPL mentioned - it's important for CPU that all surfaces shares continuous memory allocation (with different offset). According to ownership idea, this continuous memory area must exist until the last surface destroyed. This continuous area is preallocated memory which passed to every surface as shared_ptr argument during surface constructions. So when last surface dies it will release last shared_ptr for preallocated memory
Surfaces as shared_ptr<surface> remembered in pool ( std::vector or CachedPool) in pools table in Accelerator.
Additionally, surface as shared_ptr uses in MediaFrame construction which also shared its ownership. So Accellerator has shared ownership for surface through pool and MediaFrame shares surface ownership as aggregation in class.

If MediaFrames (usually in another thread) was destroyed before Accelerator then final shared_ptr<surface> release would happen in Accelerator in its dtor. After last surface shared_ptr release it will release last instance of shared_ptr for preallocated_memory (Normal shutdown)

If Accelerator dies first right before MediaFrame (living in another thread) then Accelerator just decrease surface reference counting during pool destruction. The last MediaFrame would keep last shared_ptr<surface > in its own thread. When it dies then last surface reference counter will reach 0 and surface destroyes preallocated_memory too (Delayed shutdown).

P.S. This problem is more acute for DX11 pool, because surface need living Accelerator object for destruction.... I'll check it in the DX11 PRs more precisely

P.P.S. Additionally, problem for "pool reallocation" is exposed here

it's important for CPU that all surfaces shares continuous memory allocation (with different offset).

I cannot simply reallocate preallocated_memory for bigger size in runtime, because shared_ptr<surfaces> (which owns this preallocated_memory) may or may not be assigned to MediaFrames in different threads and still be in-use. to solve this problem I need make COPY for all depended surfaces involved in VPL decode (and not assinged as MediaFrames) at prev step (because they may be dependendt as I/B frames) and realloc new pool with new key. So gready-pool is unavailable at now, and may require additional investigation (orphant MediaFrames)

#endif

if (!preallocated_pool_memory_ptr) {
throw std::runtime_error("VPLCPUAccelerationPolicy::create_surface_pool - failed: not enough memory."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it throw this now on Linux by default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question... I aimed to write it for WIN-only as discussed in feature request. But most part of other code was written without WIN-specific details and I assume it would work on Linux too. But it hasn't been tested

So there are two possible way:
-Spent time to test/fix Linux incompatible now
-or put message for Windows-only on source creation.

I suggest the second one. But! it maybe make sense to ask another team member for trying to set it up under Linux (with removing such incompatibles) to keep sharing code expertise. @dmatveev, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the second one, and I am definitely sure we'll have to keep Linux in mind - maybe in the backlog

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue was added in backlog

GAPI_LOG_INFO(nullptr, "Released workspace memory: " << ptr);
ptr = nullptr;
#else
abort(); //not implemented
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort() is a wrong way to indicate error for a library I believe - but it seems it can't be an exception here (due to destructor context) - right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, i suggest to put something up like

#else
        GAPI_Assert(false && "Not implemented for systems differ than \"_WIN32\". "
                             "Please feel free to set it up under OPENCV contribution policy");
#endif

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines +87 to +102
try {
for (; i < pool_size; i++) {
size_t preallocated_mem_offset = static_cast<size_t>(i) * surface_size_bytes;

surface_ptr_t surf_ptr = creator(workspace_mem_owner,
preallocated_mem_offset,
preallocated_raw_bytes);
pool.push_back(std::move(surf_ptr));
}
} catch (const std::exception& ex) {
throw std::runtime_error(std::string("VPLCPUAccelerationPolicy::create_surface_pool - ") +
"cannot construct surface index: " + std::to_string(i) + ", error: " +
ex.what() +
"Requested surface count: " + std::to_string(pool_size) +
", surface bytes: " + std::to_string(surface_size_bytes));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this handling here? What throws here in particular - is it our code or the external one?

Any exception handling within our module complicates the future debug (or I miss something)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective his method creates_surface_pool has the same semantics as operator new which normally throws exception std::bad_alloc to make for caller to take decision: decrease requested memory size or crash.
later we will introduce external params for configured pool size in optimization purposes (preallocated frames count due to SafeQueue size to avoid decoder stall when waiting till drawing thread or other consumer releases MediaFrames) and customization might required

Any exception handling within our module complicates the future debug

Not sure:

  1. in the next PRs this part would be handled by caller side in initialize_session
    https://github.com/sivanov-work/opencv/pull/4/files#diff-4abf50e2ded39bdb37c29822533abed9cd6395cc8b9826c667e3cd10af59c9b0R204
    then
    it would be handled by SourcePriv here
    https://github.com/sivanov-work/opencv/pull/5/files#diff-d1526c45e93418aa18b64502a4e01cc25151be376b891015a7d9049ec876fcddR216
    which would made all part for gracefull deinitialization: session close, mxf unload and so on with GAPI_LOG_WARNING printing

  2. go on with customization (prev paragrraph)

  3. "unit-testsable" for multiple scenarious

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's see in the future PRs

next_free_it = surfaces.begin();
}

CachedPool::surface_ptr_t CachedPool::find_free() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pool supposed to be thread safe or the protection is done outside?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pool considered as single-threaded. It contains constantly defined surface count and finds a first "free" one.
each pool is consumed (by key in accelerator table) by only one MFX session by requesting a "free" surface for the next decode frame portion.
Each MFX session makes async request to VPL for decode frame and provides "free" session from exclusive pool object (by key) . Looks like we can use only single thread to use up VPL decoder (CONSUMER thread).

"free" means LockedCounter == 0 in MFX surface inner structure it also means that this surface is not using by VPL decoder and we can use it for another decode or read-write it from application code. So Pool returns "free" surface if

  1. it is not currently used by VPL decode
    and...

Surface, as a data provider must be aggregated, by MediaFrame to allow read or write to it by application after MediaFrame::access, especially in another threads. So, when we create MediaFrame from fully decoded surface - the surface itself may or may not get LockedCounter ==0 (not 0 for I and B frames), but we additionally increment application atomic locked counter - ApplicationCounter. It happens in the CONSUMER thread after frame is decoded.
So created MediaFrame (LockedCounter ==0 or LockedCounter != 0 AND ApplicationCounter==1) just returned as data for another thread/threads.
So "free" means LockedCounter ==0 AND ApplicationCounter==0 in result

...and Pool returns "free" surface if

  1. and ApplicationCounter==0 too. Both counters must be 0

This ensure guarantees that pool doesn't return surface to decode (for CONSUMER thread) if it is used in MediaFrame in another threads.
Right after another thread destroy MediaFrame then ApplicationCounter decreased to 0 which means full "free" surface and pool may returns it for CONSUMER thread.

Classic solution for pool was to keep a queue and make pop in and insert back by critical section. In current case we just use lightweight atomic.load() for the next surface in list to make sure that it is free by application. But with single consume limitation (which is applicable here for decode model)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MediaFrames may be destroyed concurrently, is it ok for this implementation?

Copy link
Copy Markdown
Contributor Author

@sivanov-work sivanov-work Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is ok. It just decreased surface shared counter as shared_ptr<surface> and ApplicationCounter inside surface ( which helps surfrace became free in pool)

Comment on lines +29 to +33
for (auto& pair : pool_table) {
pair.second.clear();
// do not free key here: last surface will release it
}
pool_table.clear();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, I wanted to make sure that all pointers are released straight before print "destroyed".

But later the next situation was emerged when some data consumer was keeping referenced surface in MediaFrame. So clear for pool_table here doesn't means that all pointers released until last consumer takes MediaFrame.

So "destoyed" doesn't mean free and this row might be removed

#endif

if (!preallocated_pool_memory_ptr) {
throw std::runtime_error("VPLCPUAccelerationPolicy::create_surface_pool - failed: not enough memory."
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question... I aimed to write it for WIN-only as discussed in feature request. But most part of other code was written without WIN-specific details and I assume it would work on Linux too. But it hasn't been tested

So there are two possible way:
-Spent time to test/fix Linux incompatible now
-or put message for Windows-only on source creation.

I suggest the second one. But! it maybe make sense to ask another team member for trying to set it up under Linux (with removing such incompatibles) to keep sharing code expertise. @dmatveev, what do you think?

GAPI_LOG_INFO(nullptr, "Released workspace memory: " << ptr);
ptr = nullptr;
#else
abort(); //not implemented
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, i suggest to put something up like

#else
        GAPI_Assert(false && "Not implemented for systems differ than \"_WIN32\". "
                             "Please feel free to set it up under OPENCV contribution policy");
#endif

Comment on lines +87 to +102
try {
for (; i < pool_size; i++) {
size_t preallocated_mem_offset = static_cast<size_t>(i) * surface_size_bytes;

surface_ptr_t surf_ptr = creator(workspace_mem_owner,
preallocated_mem_offset,
preallocated_raw_bytes);
pool.push_back(std::move(surf_ptr));
}
} catch (const std::exception& ex) {
throw std::runtime_error(std::string("VPLCPUAccelerationPolicy::create_surface_pool - ") +
"cannot construct surface index: " + std::to_string(i) + ", error: " +
ex.what() +
"Requested surface count: " + std::to_string(pool_size) +
", surface bytes: " + std::to_string(surface_size_bytes));
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective his method creates_surface_pool has the same semantics as operator new which normally throws exception std::bad_alloc to make for caller to take decision: decrease requested memory size or crash.
later we will introduce external params for configured pool size in optimization purposes (preallocated frames count due to SafeQueue size to avoid decoder stall when waiting till drawing thread or other consumer releases MediaFrames) and customization might required

Any exception handling within our module complicates the future debug

Not sure:

  1. in the next PRs this part would be handled by caller side in initialize_session
    https://github.com/sivanov-work/opencv/pull/4/files#diff-4abf50e2ded39bdb37c29822533abed9cd6395cc8b9826c667e3cd10af59c9b0R204
    then
    it would be handled by SourcePriv here
    https://github.com/sivanov-work/opencv/pull/5/files#diff-d1526c45e93418aa18b64502a4e01cc25151be376b891015a7d9049ec876fcddR216
    which would made all part for gracefull deinitialization: session close, mxf unload and so on with GAPI_LOG_WARNING printing

  2. go on with customization (prev paragrraph)

  3. "unit-testsable" for multiple scenarious

next_free_it = surfaces.begin();
}

CachedPool::surface_ptr_t CachedPool::find_free() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pool considered as single-threaded. It contains constantly defined surface count and finds a first "free" one.
each pool is consumed (by key in accelerator table) by only one MFX session by requesting a "free" surface for the next decode frame portion.
Each MFX session makes async request to VPL for decode frame and provides "free" session from exclusive pool object (by key) . Looks like we can use only single thread to use up VPL decoder (CONSUMER thread).

"free" means LockedCounter == 0 in MFX surface inner structure it also means that this surface is not using by VPL decoder and we can use it for another decode or read-write it from application code. So Pool returns "free" surface if

  1. it is not currently used by VPL decode
    and...

Surface, as a data provider must be aggregated, by MediaFrame to allow read or write to it by application after MediaFrame::access, especially in another threads. So, when we create MediaFrame from fully decoded surface - the surface itself may or may not get LockedCounter ==0 (not 0 for I and B frames), but we additionally increment application atomic locked counter - ApplicationCounter. It happens in the CONSUMER thread after frame is decoded.
So created MediaFrame (LockedCounter ==0 or LockedCounter != 0 AND ApplicationCounter==1) just returned as data for another thread/threads.
So "free" means LockedCounter ==0 AND ApplicationCounter==0 in result

...and Pool returns "free" surface if

  1. and ApplicationCounter==0 too. Both counters must be 0

This ensure guarantees that pool doesn't return surface to decode (for CONSUMER thread) if it is used in MediaFrame in another threads.
Right after another thread destroy MediaFrame then ApplicationCounter decreased to 0 which means full "free" surface and pool may returns it for CONSUMER thread.

Classic solution for pool was to keep a queue and make pop in and insert back by critical section. In current case we just use lightweight atomic.load() for the next surface in list to make sure that it is free by application. But with single consume limitation (which is applicable here for decode model)

[](const surface_ptr_t& val) {
GAPI_DbgAssert(val && "Pool contains empty surface");
return !val->get_locks_count();
});
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

circular buffer imitation

Copy link
Copy Markdown
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to me, stuff can be fixed with the next set of PRs

Comment on lines +11 to +12
//TODO Remove the next MACRO right after DX11 implementation
#define CPU_ACCEL_ADAPTER
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's avoid this macro-defined behavior in the future (if I get what's being addressed there)

Comment on lines +57 to +60
next_free_it = it;
++next_free_it;

return *it;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting may look strange

@sivanov-work
Copy link
Copy Markdown
Contributor Author

@alalek or @mshabunin could you please merge it?
There is an one unaddressed comment about DX11 linkage: https://github.com/opencv/opencv/pull/20727/files#r713648649
Is there any best known practices from your side?

@alalek alalek merged commit 54386c8 into opencv:master Sep 23, 2021
@alalek alalek mentioned this pull request Oct 15, 2021
@sivanov-work sivanov-work deleted the merge_vpl_accel_impl branch March 5, 2022 05:49
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
G-API: oneVPL (simplification) added CPU, DX11(fallback CPU) accels & surface pool

* Add CPU, DX11(fallback CPU) accels & surface pool

* Fix build for surface_pool

* Apply some comments

* Fix indentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants