Skip to content

Prototype 'skip_role_on_error: True' in 4 roles {lokole, moodle, mongodb, sugarizer}#3255

Merged
holta merged 1 commit intoiiab:masterfrom
holta:skip_role_on_error
Jun 18, 2022
Merged

Prototype 'skip_role_on_error: True' in 4 roles {lokole, moodle, mongodb, sugarizer}#3255
holta merged 1 commit intoiiab:masterfrom
holta:skip_role_on_error

Conversation

@holta
Copy link
Copy Markdown
Member

@holta holta commented Jun 17, 2022

It's not just that humanity is impatient (the world is melting after all) but also that some prefer to race through all roles and then look at the individual errors later.

So this PR prototypes an unblocking mechanism, allowing Ansible to continue to the next role(s) even despite serious errors in individual roles.

It uses Ansible's block/rescue technique. All this really does is indent most all of ROLE/tasks/main.yml — and then tack this on the bottom:

  rescue:

  - name: 'SEE ERROR ABOVE (skip_role_on_error: {{ skip_role_on_error }})'
    fail:
      msg: ""
    when: not skip_role_on_error

This means that if you override default_vars.yml to set skip_role_on_error: True you will get a rescued=2 count of all bypassed (rescued) errors in the bottom-right — to notify you of how many error(s) need to be addressed — as in this example output:

TASK [mongodb : SOME SERIOUS ERROR TRYING TO INSTALL MongoDB] **************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "DEEP FAIL 1! (MongoDB)"}

TASK [mongodb : SEE ERROR ABOVE (skip_role_on_error: False)] ***********************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": ""}

TASK [sugarizer : SEE ERROR ABOVE (skip_role_on_error: False)] *********************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": ""}

PLAY RECAP *************************************************************************************************************************************************************
127.0.0.1                  : ok=42   changed=7    unreachable=0    failed=1    skipped=5    rescued=2    ignored=0

As you can see above, it immediately "pops to the top of the stack" if encountering an error in a deeply nested role when skip_role_on_error: False

On the other hand, those who choose to override default_vars.yml to set skip_role_on_error: True later need to take responsibility for all the errors they intentionally chose to bypass! One way to do this is:

cd /opt/iiab/iiab
grep -B1 'SEE ERROR ABOVE' *.log

The -B1 flag is the essential part, as it includes the actual error on the previous line of the log file(s), for every error.

PS This technique should very likely be expanded to most all/other user-facing IIAB apps (i.e. their roles) in the coming days, if it proves to work well!

Related:

@holta holta added this to the 8.0 milestone Jun 17, 2022
@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 17, 2022

  1. Explanation above revised to make it more clear.

  2. iiab-diagnostics might later include the following command, to surface disparate Ansible errors that folks have (intentionally or unintentionally) swept under the carpet:

    grep -B1 'SEE ERROR ABOVE' /opt/iiab/iiab/*.log
    

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 17, 2022

  1. This PR was heavily tested on Ubuntu. It should probably be merged shortly, for broader community evaluation and testing.

@jvonau
Copy link
Copy Markdown
Contributor

jvonau commented Jun 17, 2022

If this is expanded to include the roles that rely on mariadb this technique could be very useful for image creation.

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 17, 2022

Recap:

Some IIAB implementers/testers/devs want to see errors immediately (e.g. when running ./runrole, ./iiab-install, etc) — whereas others want to bypass those errors (e.g. Admin Console, ./runroles, etc) "until later."

(So this PR tries to tease apart and acknowledge... some of these very different needs... expected by some IIAB implementers/testers/developers but not others.)

@tim-moody
Copy link
Copy Markdown
Contributor

Some IIAB implementers/testers/devs want to see errors immediately (e.g. when running ./runrole, ./iiab-install, etc) — whereas others want to bypass those errors (e.g. Admin Console, ./runroles, etc) "until later."

I wrote runroles and admin console and I have no wish to bypass errors until later. I'm not even sure what that means.

All I want is for a role that knows it will fail due to lack of support for a particular OS to do so gracefully.

@tim-moody
Copy link
Copy Markdown
Contributor

I also don't know what the reference to the readme is for; there's no mention of errors later.

@jvonau
Copy link
Copy Markdown
Contributor

jvonau commented Jun 17, 2022

This RP doesn't directly solve 3242, but I can see a use for this functionally with automated testing.

All I want is for a role that knows it will fail due to lack of support for a particular OS to do so gracefully.

Can you expand on 'fail' and 'gracefully'? Stop and display error message as I proposed in 3244?

@jvonau
Copy link
Copy Markdown
Contributor

jvonau commented Jun 17, 2022

Think #3242 (comment) is causing some confusion.

@tim-moody
Copy link
Copy Markdown
Contributor

All I'm saying is that if a particular role can't support a particular environment it should say so and do nothing.

@jvonau
Copy link
Copy Markdown
Contributor

jvonau commented Jun 17, 2022

All I'm saying is that if a particular role can't support a particular environment it should say so and do nothing.

Before trying to install the role? as a precautionary check? We know in advance where the failure is going to occur 'with a particular environment'. Are you suggesting just to ignore the *_install variable for that role and silently skip the role without relying on 'skip_role_on_error' being 'True' trying to install the role, failing then silently proceeding? The summary is a nice touch thou.

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 17, 2022

There is no crystal ball or magic wand that will say whether a role will complete or not — and there never will be. Oddmakers are useful (and entertaining!) but in end that's all they are, oddsmakers.

Counting balls and strikes may be interesting (and yes necessary, to monitor and improve reliability!) but the higher-level need here is the people who want to (1) run long Ansible jobs largely uninterrupted, to install/enable many roles (i.e. many IIAB apps/services) (2) and then address the consequences later.

Some of these roles will succeed and some will fail, for innumerable and different reasons (e.g. an apt update that was applied last night fixing/breaking the OS and/or any upstream CI app fixing/breaking last night — and countless other reasons such as transient root causes, quasi-permanent root causes, networking/mirror failures, hardware failures, ETC, etc).

This important need has been overlooked for too long. So this PR works towards allowing everything to be batched up in service to this growing population of people (who want to batch everything up, then deal with the consequences later, indeed whether they have time later or not...)

Regardless how that batch process of installing many roles (in a more uninterrupted way) is invoked. Just one example is those installing a preset / learning bouquet that's trying to install many such roles/apps — whether from Admin Console, or from a script like ./iiab-install or ./runroles or any other means.

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 17, 2022

I can see a use for this functionally with automated testing

Indeed!

QA (community testing) is one of several demographics that will contribute far more effectively with a batch process working through the installing/enabling of many roles in a best effort fashion (rather than hurry-up-and-wait, constantly watching the kettle boil...)

Reason: Being randomly interrupted may be "addicting" according to many psychological studies (and extremely mentally taxing, ask almost anyone in on-call / retail service industries) but such "ADHD" workflows can also be the worst way to get serious work accomplished. 😱

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 18, 2022

I ran another successful test (LARGE-sized IIAB install on Ubuntu 22.04 not 20.04 this time) using skip_role_on_error: True to override default_vars.yml

Output confirms the 3 expected errors are trapped — with the error count (rescued=3 in this case) clear to anyone who chooses to look:

root@box:/opt/iiab/iiab# grep rescue *log
2022-06-17 22:15:15,859 p=4267 u=root n=ansible | 127.0.0.1                  : ok=793  changed=450  unreachable=0    failed=0    skipped=221  rescued=3    ignored=5

With error detail (for all 3 errors in this case) reviewable here:

root@box:/opt/iiab/iiab# grep -B1 'SEE ERROR ABOVE' *log
2022-06-17 22:08:06,133 p=4267 u=root n=ansible | fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "/opt/iiab/moodle/moodle_installer", "delta": "0:00:00.106486", "end": "2022-06-17 22:08:06.113942", "msg": "non-zero return code", "rc": 255, "start": "2022-06-17 22:08:06.007456", "stderr": "+ sudo -u www-data /usr/bin/php /opt/iiab/moodle/admin/cli/install.php --wwwroot=http://box.lan/moodle --dataroot=/library/moodle --dbtype=pgsql --dbname=moodle --dbuser=Admin --dbpass=changeme --fullname=Your_School --shortname=School --adminuser=admin --adminpass=changeme --non-interactive --agree-license --allow-unstable\nPHP Deprecated:  Return type of moodle_recordset::rewind() should either be compatible with Iterator::rewind(): void, or the #[\\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /opt/iiab/moodle/lib/dml/moodle_recordset.php on line 61\nPHP Fatal error:  Type of xml_format_exception::$line must be int (as in class Exception) in /opt/iiab/moodle/lib/xmlize.php on line 42", "stderr_lines": ["+ sudo -u www-data /usr/bin/php /opt/iiab/moodle/admin/cli/install.php --wwwroot=http://box.lan/moodle --dataroot=/library/moodle --dbtype=pgsql --dbname=moodle --dbuser=Admin --dbpass=changeme --fullname=Your_School --shortname=School --adminuser=admin --adminpass=changeme --non-interactive --agree-license --allow-unstable", "PHP Deprecated:  Return type of moodle_recordset::rewind() should either be compatible with Iterator::rewind(): void, or the #[\\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /opt/iiab/moodle/lib/dml/moodle_recordset.php on line 61", "PHP Fatal error:  Type of xml_format_exception::$line must be int (as in class Exception) in /opt/iiab/moodle/lib/xmlize.php on line 42"], "stdout": "\nDeprecated: Return type of moodle_recordset::rewind() should either be compatible with Iterator::rewind(): void, or the #[\\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /opt/iiab/moodle/lib/dml/moodle_recordset.php on line 61\n\nFatal error: Type of xml_format_exception::$line must be int (as in class Exception) in /opt/iiab/moodle/lib/xmlize.php on line 42", "stdout_lines": ["", "Deprecated: Return type of moodle_recordset::rewind() should either be compatible with Iterator::rewind(): void, or the #[\\ReturnTypeWillChange] attribute should be used to temporarily suppress the notice in /opt/iiab/moodle/lib/dml/moodle_recordset.php on line 61", "", "Fatal error: Type of xml_format_exception::$line must be int (as in class Exception) in /opt/iiab/moodle/lib/xmlize.php on line 42"]}
2022-06-17 22:08:06,146 p=4267 u=root n=ansible | TASK [moodle : SEE ERROR ABOVE (skip_role_on_error: True)] *************************************************************************************************************
--
2022-06-17 22:11:35,032 p=4267 u=root n=ansible | fatal: [127.0.0.1]: FAILED! => {"cache_update_time": 1655518292, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"       install 'mongodb-org=4.4.14' 'mongodb-org-server=4.4.14'' failed: E: Unable to correct problems, you have held broken packages.\n", "rc": 100, "stderr": "E: Unable to correct problems, you have held broken packages.\n", "stderr_lines": ["E: Unable to correct problems, you have held broken packages."], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nSome packages could not be installed. This may mean that you have\nrequested an impossible situation or if you are using the unstable\ndistribution that some required packages have not yet been created\nor been moved out of Incoming.\nThe following information may help to resolve the situation:\n\nThe following packages have unmet dependencies:\n mongodb-org-mongos : Depends: libssl1.1 (>= 1.1.0) but it is not installable\n mongodb-org-server : Depends: libssl1.1 (>= 1.1.0) but it is not installable\n mongodb-org-shell : Depends: libssl1.1 (>= 1.1.0) but it is not installable\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "Some packages could not be installed. This may mean that you have", "requested an impossible situation or if you are using the unstable", "distribution that some required packages have not yet been created", "or been moved out of Incoming.", "The following information may help to resolve the situation:", "", "The following packages have unmet dependencies:", " mongodb-org-mongos : Depends: libssl1.1 (>= 1.1.0) but it is not installable", " mongodb-org-server : Depends: libssl1.1 (>= 1.1.0) but it is not installable", " mongodb-org-shell : Depends: libssl1.1 (>= 1.1.0) but it is not installable"]}
2022-06-17 22:11:35,048 p=4267 u=root n=ansible | TASK [mongodb : SEE ERROR ABOVE (skip_role_on_error: True)] ************************************************************************************************************
--
2022-06-17 22:11:35,109 p=4267 u=root n=ansible | fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "MongoDB INSTALLATION FAILED, perhaps because MongoDB doesn't yet support Ubuntu 22.04 with libssl3?"}
2022-06-17 22:11:35,124 p=4267 u=root n=ansible | TASK [sugarizer : SEE ERROR ABOVE (skip_role_on_error: True)] **********************************************************************************************************

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 18, 2022

@deldesir thanks for having reviewed this quickly overnight!

FYI he is eager to install arbitrary collections of IIAB
roles/apps/services in a more "uninterrupted" way, using this PR's approach.

ASIDE: This PR does not time out individual roles that "get stuck" for example installing too slowly (e.g. due to Internet problems from Haiti/India/Etc) but that use case might be considered in future — if it proves to be a common problem.

@tim-moody
Copy link
Copy Markdown
Contributor

Now that I see more of the details I think I misunderstood this PR. I thought it was a wrapper approach, which I oppose, rather than causing a role internally to fail gracefully, which is what I have been asking for. So, sorry if I got it wrong.

If I now understand correctly, let me ask when and why would I ever want skip_role_on_error to be false, if I want ansible to terminate immediately rather than completing the run and installing other roles because I am only interested in testing one role? If that is the case perhaps True should be the default, not False.

I also suggest that skip_role_on_error might be skip_roles_on_error as it is global.

he is eager to install arbitrary collections of IIAB roles/apps/services

easily done with Admin Console.

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 18, 2022

why would I ever want skip_role_on_error to be false

"Traditional craftsmanship" is the reason arguably. For those who feel industrial efficiency is too often impersonal (which it is!)

So we should keep this legacy option to help Perfectionists (which are not at all uncommon in the QA community, and in fact often incredibly insightful people as a result of their OCD attention to fine detail, often the only people who catch critical underlying flaws...)

So I guess the answer is to support these very important people who sometimes/often want to watch every little step of Ansible's output (deliberately, as this can be a learning experience) and then rerun things like sudo iiab on each error — to build up a test/field IIAB progressively/linearly in a more organic/patient way.

I also suggest that skip_role_on_error might be skip_roles_on_error as it is global.

I'm sympathetic.

But also torn trying to convey that just 1 role will be skipped per error. (I didn't want to accidentally convey that multiple roles would be skipped upon hitting a single error...)

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 18, 2022

It's been more than 24h so I'm going to merge this to encourage many more people to bang on it and provide feedback around next steps.

Changes in strategy/naming/etc can and should certainly be made down the road, as we learn more.

@holta holta merged commit 83ea6ce into iiab:master Jun 18, 2022
@tim-moody
Copy link
Copy Markdown
Contributor

override default_vars.yml to set skip_role_on_error: True

I assume you mean in local vars, Please don't encourage anyone to modify default vars

@holta
Copy link
Copy Markdown
Member Author

holta commented Jun 18, 2022

I assume you mean in local vars

Yes!

To clarify, most everyone should add/modify IIAB Variables[*] within /etc/iiab/local_vars.yml — prior to IIAB installation if possible.

[*] Used by Ansible and other things too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants