sync data to placement raise resource in use exception

Bug #2017513 reported by Wenping Song
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cyborg (OpenStack)
New
Undecided
Unassigned

Bug Description

when we remove the pci device that is used from host, the cyborg-agent will raise 'resource provider in use ' exception.

2023-04-24 14:50:08.267 1 ERROR oslo_service.periodic_task [None req-3493a3b4-7275-46cb-8767-253b414eac60 - - - - -] Error during AgentManager.update_available_resource: cyborg.common.exception_Remote.ResourceProviderInUse_Remote: An unknown exception occurred.
Traceback (most recent call last):

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
    res = self.dispatcher.dispatch(message)

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
    result = func(ctxt, **new_args)

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/cyborg/conductor/manager.py", line 86, in report_data
    old_driver_device_list, driver_device_list)

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/cyborg/conductor/manager.py", line 112, in drv_device_make_diff
    self._delete_provider_and_sub_providers(context, rp_uuid)

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/cyborg/conductor/manager.py", line 388, in _delete_provider_and_sub_providers
    self.placement_client.delete_provider(rp["uuid"])

  File "/var/lib/kolla/venv/lib/python3.6/site-packages/cyborg/common/placement_client.py", line 313, in delete_provider
    raise exception.ResourceProviderInUse()

cyborg.common.exception.ResourceProviderInUse: An unknown exception occurred.

Revision history for this message
chandan kumar (chkumar246) wrote :
Download full text (36.7 KiB)

I am not sure I understand this bug correctly. Are we removing the pci device from cyborg.conf or trying to delete the pci device from device profile?

I am tried to create a reproducer by removing the pci device from the config. Below is the error
(venv) stack@cyborg-devstack:/opt/stack/cyborg$ openstack accelerator device attribute list
+--------------------------------------+---------------+--------+----------------------------+
| uuid | deployable_id | key | value |
+--------------------------------------+---------------+--------+----------------------------+
| 3ebba534-0aaf-4681-880d-a01a6f001c30 | 2 | rc | CUSTOM_PCI |
| c54284b2-b042-4995-8aa9-8a92fd9782a3 | 2 | trait0 | CUSTOM_PCI_PRODUCT_ID_0010 |
+--------------------------------------+---------------+--------+----------------------------+
(venv) stack@cyborg-devstack:/opt/stack/cyborg$ openstack accelerator device list
+--------------------------------------+------+--------+-----------------+--------------------------------------------+
| uuid | type | vendor | hostname | std_board_info |
+--------------------------------------+------+--------+-----------------+--------------------------------------------+
| c81426fe-e130-45f0-bdd5-278776173f12 | GPU | 1b36 | cyborg-devstack | {"product_id": "0010", "controller": null} |
+--------------------------------------+------+--------+-----------------+--------------------------------------------+
(venv) stack@cyborg-devstack:/opt/stack/cyborg$ openstack accelerator device profile create bug2017513 '[{"resources:CUSTOM_PCI": "1", "trait:CUSTOM_PCI_PRODUCT_ID_0010": "required"}]'
+-------------+---------------------------------------------------------------------------------+
| Field | Value |
+-------------+---------------------------------------------------------------------------------+
| created_at | 2026-05-06 05:26:51+00:00 |
| updated_at | None |
| uuid | 32a5bf34-a7f9-419d-939e-143eef0e04cd |
| name | bug2017513 |
| groups | [{'resources:CUSTOM_PCI': '1', 'trait:CUSTOM_PCI_PRODUCT_ID_0010': 'required'}] |
| description | None |
+-------------+---------------------------------------------------------------------------------+
(venv) stack@cyborg-devstack:/opt/stack/cyborg$ openstack accelerator device profile list
+--------------------------------------+------------+---------------------------------------------------------------------------------+-------------+
| uuid | name | groups | description |
+--------------------------------------+------------...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

just putting this here for future us

15:04 <sean-k-mooney> https://github.com/openstack/nova/commit/26c41eccade6412f61f9a8721d853b545061adcc https://github.com/openstack/nova/commit/284ea72e96604bdf16d1c5c4db47247334841b2f https://github.com/openstack/nova/commit/0208be629c3853863bcd49b8bdbe2b9889b85012 https://github.com/openstack/nova/commit/f37cdf0c4182103ad81dbf39188ff39955da3850
15:04 <sean-k-mooney> those are the nova patches related to the isseu reproted in https://bugs.launchpad.net/openstack-cyborg/+bug/2017513
15:05 <sean-k-mooney> https://bugs.launchpad.net/nova/+bug/1633120 https://bugs.launchpad.net/nova/+bug/1969496 and https://bugs.launchpad.net/nova/+bug/2115905 are the releated nova bugs we had
15:07 <sean-k-mooney> the tl;dr is if a device is refence by a ARQ and that device is not in the whitelist or viaabel on teh host anymore we cannot remove the device form the db or placmeent until that ARQ is deleted and we shoudl not do that automaticlly
15:07 <sean-k-mooney> the admin need to move or delete the vm or readd the device
15:08 <sean-k-mooney> we shoudl complain very very loadly in teh logs when the compute agent start up in a miscondifured state but we dont geenrally want to make that an agent startup failure as that a potical dos vector if we do

the tldr is we shoudl not allwo removing trackign fo a device that is currently allocated even if the driver noloanger allows that device to be manged.

removign it is operator error but it shoudl not cause the agent to fail to start or the sync comamdn to fail.

this will requrie some non triival tought to resolve correctly.

nova's pci tracker had a simialr desgin issue that was fixed but those fixes never made it to cybrogs version

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.