Prototype 'skip_role_on_error: True' in 4 roles {lokole, moodle, mongodb, sugarizer}#3255
Prototype 'skip_role_on_error: True' in 4 roles {lokole, moodle, mongodb, sugarizer}#3255holta merged 1 commit intoiiab:masterfrom
Conversation
|
|
|
If this is expanded to include the roles that rely on mariadb this technique could be very useful for image creation. |
|
Recap: Some IIAB implementers/testers/devs want to see errors immediately (e.g. when running ./runrole, ./iiab-install, etc) — whereas others want to bypass those errors (e.g. Admin Console, ./runroles, etc) "until later." (So this PR tries to tease apart and acknowledge... some of these very different needs... expected by some IIAB implementers/testers/developers but not others.) |
I wrote runroles and admin console and I have no wish to bypass errors until later. I'm not even sure what that means. All I want is for a role that knows it will fail due to lack of support for a particular OS to do so gracefully. |
|
I also don't know what the reference to the readme is for; there's no mention of errors later. |
|
This RP doesn't directly solve 3242, but I can see a use for this functionally with automated testing.
Can you expand on 'fail' and 'gracefully'? Stop and display error message as I proposed in 3244? |
|
Think #3242 (comment) is causing some confusion. |
|
All I'm saying is that if a particular role can't support a particular environment it should say so and do nothing. |
Before trying to install the role? as a precautionary check? We know in advance where the failure is going to occur 'with a particular environment'. Are you suggesting just to ignore the *_install variable for that role and silently skip the role without relying on 'skip_role_on_error' being 'True' trying to install the role, failing then silently proceeding? The summary is a nice touch thou. |
|
There is no crystal ball or magic wand that will say whether a role will complete or not — and there never will be. Oddmakers are useful (and entertaining!) but in end that's all they are, oddsmakers. Counting balls and strikes may be interesting (and yes necessary, to monitor and improve reliability!) but the higher-level need here is the people who want to (1) run long Ansible jobs largely uninterrupted, to install/enable many roles (i.e. many IIAB apps/services) (2) and then address the consequences later. Some of these roles will succeed and some will fail, for innumerable and different reasons (e.g. an apt update that was applied last night fixing/breaking the OS and/or any upstream CI app fixing/breaking last night — and countless other reasons such as transient root causes, quasi-permanent root causes, networking/mirror failures, hardware failures, ETC, etc). This important need has been overlooked for too long. So this PR works towards allowing everything to be batched up in service to this growing population of people (who want to batch everything up, then deal with the consequences later, indeed whether they have time later or not...) Regardless how that batch process of installing many roles (in a more uninterrupted way) is invoked. Just one example is those installing a preset / learning bouquet that's trying to install many such roles/apps — whether from Admin Console, or from a script like ./iiab-install or ./runroles or any other means. |
Indeed! QA (community testing) is one of several demographics that will contribute far more effectively with a batch process working through the installing/enabling of many roles in a best effort fashion (rather than hurry-up-and-wait, constantly watching the kettle boil...) Reason: Being randomly interrupted may be "addicting" according to many psychological studies (and extremely mentally taxing, ask almost anyone in on-call / retail service industries) but such "ADHD" workflows can also be the worst way to get serious work accomplished. 😱 |
|
I ran another successful test (LARGE-sized IIAB install on Ubuntu 22.04 not 20.04 this time) using Output confirms the 3 expected errors are trapped — with the error count ( With error detail (for all 3 errors in this case) reviewable here: |
|
@deldesir thanks for having reviewed this quickly overnight! FYI he is eager to install arbitrary collections of IIAB ASIDE: This PR does not time out individual roles that "get stuck" for example installing too slowly (e.g. due to Internet problems from Haiti/India/Etc) but that use case might be considered in future — if it proves to be a common problem. |
|
Now that I see more of the details I think I misunderstood this PR. I thought it was a wrapper approach, which I oppose, rather than causing a role internally to fail gracefully, which is what I have been asking for. So, sorry if I got it wrong. If I now understand correctly, let me ask when and why would I ever want skip_role_on_error to be false, if I want ansible to terminate immediately rather than completing the run and installing other roles because I am only interested in testing one role? If that is the case perhaps True should be the default, not False. I also suggest that skip_role_on_error might be skip_roles_on_error as it is global.
easily done with Admin Console. |
"Traditional craftsmanship" is the reason arguably. For those who feel industrial efficiency is too often impersonal (which it is!) So we should keep this legacy option to help Perfectionists (which are not at all uncommon in the QA community, and in fact often incredibly insightful people as a result of their OCD attention to fine detail, often the only people who catch critical underlying flaws...) So I guess the answer is to support these very important people who sometimes/often want to watch every little step of Ansible's output (deliberately, as this can be a learning experience) and then rerun things like
I'm sympathetic. But also torn trying to convey that just 1 role will be skipped per error. (I didn't want to accidentally convey that multiple roles would be skipped upon hitting a single error...) |
|
It's been more than 24h so I'm going to merge this to encourage many more people to bang on it and provide feedback around next steps. Changes in strategy/naming/etc can and should certainly be made down the road, as we learn more. |
I assume you mean in local vars, Please don't encourage anyone to modify default vars |
Yes! To clarify, most everyone should add/modify IIAB Variables[*] within /etc/iiab/local_vars.yml — prior to IIAB installation if possible. [*] Used by Ansible and other things too. |
It's not just that humanity is impatient (the world is melting after all) but also that some prefer to race through all roles and then look at the individual errors later.
So this PR prototypes an unblocking mechanism, allowing Ansible to continue to the next role(s) even despite serious errors in individual roles.
It uses Ansible's block/rescue technique. All this really does is indent most all of ROLE/tasks/main.yml — and then tack this on the bottom:
This means that if you override default_vars.yml to set
skip_role_on_error: Trueyou will get arescued=2count of all bypassed (rescued) errors in the bottom-right — to notify you of how many error(s) need to be addressed — as in this example output:As you can see above, it immediately "pops to the top of the stack" if encountering an error in a deeply nested role when
skip_role_on_error: FalseOn the other hand, those who choose to override default_vars.yml to set
skip_role_on_error: Truelater need to take responsibility for all the errors they intentionally chose to bypass! One way to do this is:The
-B1flag is the essential part, as it includes the actual error on the previous line of the log file(s), for every error.PS This technique should very likely be expanded to most all/other user-facing IIAB apps (i.e. their roles) in the coming days, if it proves to work well!
Related: