Skip to content

Fix dead loop while the status of volume is error#126

Merged
lbajolet-hashicorp merged 11 commits intohashicorp:mainfrom
zhenggu:main
May 24, 2024
Merged

Fix dead loop while the status of volume is error#126
lbajolet-hashicorp merged 11 commits intohashicorp:mainfrom
zhenggu:main

Conversation

@zhenggu
Copy link
Copy Markdown
Contributor

@zhenggu zhenggu commented May 13, 2024

There is a dead loop bug that while we create a volume successfully, but after that the status of the volume becomes error, because the code

status, err := GetVolumeStatus(blockStorageClient, volumeID)
only check the second return variable err, but it is only related with whether the Openstack API call is OK, but the code doesn't care about the first return variable status which identify the status of volume, if error is nil but status is "error", it will be dead lock.
It can easily be reproduce by set the volume size less than the OS image size, the debug log is following:
图片

@zhenggu zhenggu requested a review from a team as a code owner May 13, 2024 06:25
@zhenggu zhenggu marked this pull request as draft May 13, 2024 06:45
@zhenggu zhenggu marked this pull request as ready for review May 13, 2024 07:19
@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 14, 2024

After the patch, the output will be:
图片

@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 14, 2024

@nywilken Can you help to have a review?

Copy link
Copy Markdown
Contributor

@lbajolet-hashicorp lbajolet-hashicorp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhenggu,

Thanks for opening this PR! I left a couple comments on the code, I think we can iterate on that and make it something simpler/more robust, but this is a good first draft imo.

Let me know when you're ready for another pass on my end, and if you need help with landing the next steps let me know!

@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 15, 2024

@lbajolet-hashicorp As you said, we can keep the error volume for troubleshooting, I have removed the lines for bad volume cleanup, can you review it again?

@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 21, 2024

@lbajolet-hashicorp any change to review again?

@lbajolet-hashicorp
Copy link
Copy Markdown
Contributor

Hi @zhenggu,

Apologies it took me a bit to revisit this PR again.

I'm not sure I said something to the effect of "keeping the error volume for troubleshooting", but this would be valid, however I'm not certain this is a good approach broadly speaking.

The GetVolumeStatus call succeeds and returns a volume state of "error", we don't know necessarily why at this point, but all we know is that we won't be able to build, do we expect to be able to get insights on why the volume errorred after the VM gets torn down?
It feels like cleaning-up should happen to me, and if you're troubleshooting a build, this is a job for the --on-error flag so Packer doesn't automatically cleanup everything before you tell it to, this way you can inspect your VM/volumes.

I'm worried that systematically considering volume provisioning errors as an OK scenario and not cleaning-up after us will mean that there'll be a bunch of volumes to cleanup manually after the fact, so I'm cautious on this change.

One thing that I probably wasn't clear with in the first place (sorry about that), was regarding the cleanup strategy, I see we have a doCleanup flag which kinda feels like a duplicate for volumeID since both are set at the same time. Unless there's an option to not cleanup the volume after the build maybe?

I would suggest maybe revisiting the logic in the Run for the step:

  1. Can we remove doCleanup? If it has an external use (linked to a configuration flag for example) we can't, but otherwise at first glance it looks like duplicated logic to me.
  2. Regarding volumeID, we should have it created once the volume exists, it should be removed after the build ends, whether we successfully created the volume, or if there was an error during provisioning.
  3. At Cleanup, the logic can be simplified to return immediately if volumeID is empty instead of looking at doCleanup
  4. I'm not sure we need to call GetVolumeStatus at cleanup, do we? If the volumeID is not empty, this means a volume needs to be cleaned-up, if it failed to be created in the first place, the check above will take care not to execute anything here.
  5. In the end we can just call Delete on the volume, not caring about its state before that, unless I'm missing something in which case please feel free to clarify this.

Let me know if there's something else you need from me, and if you need help with the code feel free to ping, not sure when I'll be able to, but I can take a look at it as well.

Thanks!

@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 22, 2024

@lbajolet-hashicorp I am totally agree with you, will submit a patch soon.

@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 23, 2024

@lbajolet-hashicorp I have updated the patch to restruct volume clean up logic, and no issue on testing, please have a review again, thanks!

Copy link
Copy Markdown
Contributor

@lbajolet-hashicorp lbajolet-hashicorp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhenggu,

Thanks for the reroll, this looks cleaner to me!

I left a few last comments on the code, but I think we're close to a merge, let me know what you think and I'll let you come back to me if/when you've addressed those for another round of review.

zhenggu and others added 3 commits May 24, 2024 09:29
Co-authored-by: Lucas Bajolet <105649352+lbajolet-hashicorp@users.noreply.github.com>
@zhenggu
Copy link
Copy Markdown
Contributor Author

zhenggu commented May 24, 2024

@lbajolet-hashicorp have updated the code, please have a review again.

Copy link
Copy Markdown
Contributor

@lbajolet-hashicorp lbajolet-hashicorp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution @zhenggu!

Merging now

@lbajolet-hashicorp lbajolet-hashicorp merged commit 2e7e219 into hashicorp:main May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants