Skip to content

ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE#14833

Closed
DDvO wants to merge 9 commits intoopenssl:masterfrom
siemens:malloc_err_reporting_fix_6251
Closed

ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE#14833
DDvO wants to merge 9 commits intoopenssl:masterfrom
siemens:malloc_err_reporting_fix_6251

Conversation

@DDvO
Copy link
Contributor

@DDvO DDvO commented Apr 12, 2021

This proposes a simple fix of #6251. Maybe this is already sufficient?
It also takes into account thoughts given in #5781 and #6217.

@t8m
Copy link
Member

t8m commented Apr 12, 2021

You'd have to clean up all the calls where ERR_R_MALLOC_FAILURE is raised otherwise there would be double error reporting wouldn't it?

@DDvO DDvO added this to the Post 3.0.0 milestone Apr 12, 2021
@DDvO DDvO changed the title ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE WIP: ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE Apr 12, 2021
@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

You'd have to clean up all the calls where ERR_R_MALLOC_FAILURE is raised otherwise there would be double error reporting wouldn't it?

Yes, but I'd spend this major effort only if/when the core solution is firmly accepted.

Marking this PR as WIP for now, and post-3.0.0.

@DDvO DDvO force-pushed the malloc_err_reporting_fix_6251 branch from 40e0bcd to 1a8143c Compare April 12, 2021 10:16
@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

My goodness.
For testing purposes, I injected pseudo-malloc failures after a given number of {m,re}alloc calls
in the beginning of report_malloc_failure() as follows:

    static int count = 31000;
    if (--count < 0)
        result = NULL;

Then, for instance make -j4 -s test TESTS="test_cms" fails.
When using 32000, this works fine.
So running an innocent OpenSSL app can cause more than 31000 memory allocation calls!

Raising the limit to 5300000 still causes make test to fail with this outcome:

Test Summary Report
-------------------
99-test_fuzz_server.t            (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
99-test_fuzz_asn1.t              (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1
04-test_encoder_decoder.t        (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1

while for 5400000 only the above two fuzz tests fail.

@richsalz
Copy link
Contributor

Suppose OPENSSL_malloc tries to raise an error condition, and the alloc at line 615 of err.c gets invoked. Infinite recursion I think.

@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

Suppose OPENSSL_malloc tries to raise an error condition, and the alloc at line 615 of err.c gets invoked. Infinite recursion I think.

That's what the following code is for:

        /* prevent error-in-error-handler malloc loop */
        if (len < 6
            ? strcmp(file, "err.c") != 0
            : strcmp(file + len - 6, "/err.c") != 0) {

@richsalz
Copy link
Contributor

And you're sure all of the other possible "init" functions can't cause a loop? I guess not since they'll end up falling through to that err code. okay. thanks.

@davidben
Copy link
Contributor

davidben commented Apr 12, 2021

Conditioning based on the name of the file seems rather fragile. If __FILE__ behaves funny on some system, it will break. If someone renames err.c without realizing the malloc implementation has this check, it will break. If some other file matches the name check, it will break. Moreover, when it breaks, the problem is only visible on malloc failure, so it will go unnoticed for a long time. I would suggest just sticking to the straightforward thing and just making two functions.

This also isn't quite how you check for malloc failure. Some malloc implementations return NULL on zero, so you need to check both values.

@richsalz
Copy link
Contributor

somebody has to be at the bottom; either malloc or error-stuff. making them interdependent, which this PR (as of this writing) does is like David said, fragile.

It's a pain to report malloc failures all over the place. This fix seems wrong.

@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

I had a closer look at what is potentially called through

            ERR_new();
            ERR_set_debug(file, line, NULL);
            ERR_set_error(ERR_LIB_CRYPTO, ERR_R_MALLOC_FAILURE, NULL);

and it's a whole mess of stuff, where as @richsalz wrote one cannot really be sure that this does not include any indirect memory allocation calls.
So indeed a different solution is needed.

@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

I'm now considering options to use a pre-allocated place that is reserved per thread
solely for the purpose of reporting the first occurrence of a malloc failure,
basically with the following idea in mind:

When a malloc failure is encountered and this place is available and not (yet) occupied, it is used,
else simply no new error entry is generated.
The probability that this pre-allocated place is not available should be negligible because it means that already during initialization the thread application runs out of memory.
The situation that the place is already occupied means that the thread ran into a malloc failure before, which should be the actual failure to be reported, while any subsequent ones are just part of the aftermath;
anyway all one can then hope is that the application is still able to output its error queue before dying.

Does anyone see a problem with this approach?

@richsalz
Copy link
Contributor

The probability that this pre-allocated place is not available should be negligible because it means that already during initialization the thread runs out of memory.

In my experience, most of the time when malloc returns NULL it is because of a pointer problem (clobbering state) rather than (virtual) memory being exhausted. And also, we know some people are writing applications that spawn and reap multiple threads, rather than a fixed number near start-up time. So I think your probability estimate is wrong.

@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

The probability that this pre-allocated place is not available should be negligible because it means that already during initialization the thread runs out of memory.

In my experience, most of the time when malloc returns NULL it is because of a pointer problem (clobbering state) rather than (virtual) memory being exhausted.

Well, in such a case, memory is corrupted anyway,
so being able to (correctly) output error messages is then a matter of good luck.

And also, we know some people are writing applications that spawn and reap multiple threads, rather than a fixed number near start-up time. So I think your probability estimate is wrong.

Good point. This brings me to an improvement of the sketched new approach:
have the pre-allocated space not per thread, but per application,
which is actually simpler and should be sufficient.
Also the OpenSSL error queue is allocated per application, right?

@richsalz
Copy link
Contributor

Also the OpenSSL error queue is allocated per application, right?

No it's per-thread. If that's not a typo, then I have to wonder if you really read the err.c code.

Quietly making "malloc failed" a global error, unlike all other errors, is not something to be just tossed into a PR but should be discussed by the OTC. I am going to not comment any more until after 3.0 is out.

@DDvO
Copy link
Contributor Author

DDvO commented Apr 12, 2021

Also the OpenSSL error queue is allocated per application, right?

No it's per-thread. If that's not a typo, then I have to wonder if you really read the err.c code.

Argh, I should stop working after 12 hours. Apparently I had too many context switches today.
Sure, it's per thread: state = CRYPTO_THREAD_get_local(&err_thread_local);

Quietly making "malloc failed" a global error, unlike all other errors, is not something to be just tossed into a PR but should be discussed by the OTC. I am going to not comment any more until after 3.0 is out.

In the light of the above clarification, reverting my proposal to an extra space per thread.
Even if the probability of not being able to allocate it along with the normal error queue entries
is a little higher than what I thought initially, this is not really a problem:
If a new thread is able to allocate sizeof(ERR_STATE), it should be able to allocate also 17 instead of 16 (= ERR_NUM_ERRORS) slots. And even if not, the application will be in severe trouble already at that point anyway (and the worst thing that can happen with the new approach is that the ERR_R_MALLOC_FAILURE could be dropped - so what).

@t8m t8m added branch: master Applies to master branch triaged: feature The issue/pr requests/adds a feature labels Aug 6, 2021
@DDvO DDvO changed the title WIP: ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE ERR: Make CRYPTO_malloc() and friends report ERR_R_MALLOC_FAILURE Aug 15, 2022
@DDvO
Copy link
Contributor Author

DDvO commented Aug 15, 2022

Finally, I took the time to continue working in on this, taking into account the review comments above.
I now take a more robust approach for preventing potential error-in-error-handler malloc loops:
counting per thread how many malloc errors are being reported.

@t8m, as discussed above, I'm going to clean up all the calls where ERR_R_MALLOC_FAILURE is raised
when I got positive feedback on the solution taken here.

@DDvO DDvO force-pushed the malloc_err_reporting_fix_6251 branch from 1a8143c to 5802ad1 Compare August 15, 2022 15:16
@DDvO DDvO force-pushed the malloc_err_reporting_fix_6251 branch from 5802ad1 to 25ab843 Compare August 15, 2022 15:48
@t8m
Copy link
Member

t8m commented Aug 23, 2022

OTC: The current state of the PR is not acceptable because it possibly reports error on 0 bytes allocated. If this is resolved this PR can be merged if there will be followup PR to fix the duplicated ERR_raise(ERR_R_MALLOC_FAILURE) calls in one way or another depending on the context.

Since the former is done now, the question remains how best to handle the existing ERR_raise(ERR_LIB_FOO, ERR_R_MALLOC_FAILURE) calls.

My clear preference is to replace them by ERR_raise(ERR_LIB_FOO, ERR_R_CRYPTO_LIB) for the following reasons:

1. This can be done automatically and can be approved easily because each such change is trivial,
   which is a pragmatically important argument given that there are more than 1200 such calls.

2. The new pseudo reason `ERR_R_CRYPTO_LIB` is not anymore misleading in case the actual failure reason is something else than `ERR_R_MALLOC_FAILURE`.

3. This keeps the existing traceability while not much polluting the error queue because actual malloc failures are rare.

OTC wanted to resolve them based on the context. If the ERR_raise() call is in the same function where the OPENSSL_*alloc() is done and is done solely as the result of the malloc failure, it should NOT be kept. For the rest, the replacement as you suggest would be OK.

@t8m t8m added approval: review pending This pull request needs review by a committer approval: otc review pending labels Aug 23, 2022
Copy link
Member

@t8m t8m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from placing the docs into a different .pod file.

…loc() and friends report ERR_R_MALLOC_FAILURE
crypto/err/err.c Outdated
void *CRYPTO_malloc_with_err(void *(*fn)(size_t, const char *, int), size_t num,
const char *file, int line, const char *func)
{
void *ptr = (*fn)(num, file, line);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of an indirect call here. We already do one to get to malloc et al and this makes it two. Moreover, this one won't be predicted at all well because so many functions are funneling through here.

The alternative would be having each of the allocation functions separately defined which is ugly but won't hit performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, a smart compiler might flatten this out, but likely we cannot hope for that.
Would be so nice to use just a macro for CRYPTO_malloc_with_err et al., but this is made impossible by the non-linear use of the num argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I do not really see the very slight decrease in performance using the function pointer as an issue here.
Memory allocation is in itself a pretty expensive operation, so an extra function pointer dereference should not make any noticeable difference.
Moreover, memory allocation is very unusual in tight loops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the functions static inline? Not sure if that doesn't open another can of worms though.

Copy link
Contributor Author

@DDvO DDvO Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain if it is possible to declare CRYPTO_malloc_with_err et al. static because they need to be declared in a public header (crypto.h) for use with public macros such as OPENSSL_malloc().
I was also considering making CRYPTO_malloc_with_err et al. inline, but as there is s pretty large number of calls to OPENSSL_malloc() et al., this would likely add quite a bit to the code size, while the performance gain would be tiny.
So also in this direction I see no point trying to optimize on performance.

Copy link
Contributor

@richsalz richsalz Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a spurious state to s change, that's not a big deal.

It adds new xxx_with_err functions that might not be needed if the decision is to use backtrace or its equivalent. (And if those functions are only for internal use, why are they in public header files and documentation?) It does no work to change the source base to use those with_err functions.

This seems like a partial change, no promise of future work to complete it and a known pending open issue about the direction in which to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richsalz occasionally it feels like you wanna re-enter the OpenSSL project?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malloc shouldn't be a huge expense most of the time. Modern libc's implement it pretty well. A branch misprediction takes hundreds of cycles which represents quite an expense.

The other concern is about what we make public. Once public, forever public (almost). Do we want CRYPTO_malloc_with_err() as a public API?

Copy link
Contributor Author

@DDvO DDvO Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more about all this, I found a solution that does not use function pointers and the like and
avoids using and adding (actually internal) aux functions to public headers. It even does not change header files at all.

It reports malloc errors once and for all in CRYPTO_malloc()
while making sure to prevent mem alloc error loops simply as follows:
ossl_err_get_state_int() in err.c uses the special call CRYPTO_zalloc(num, NULL, 0) for ERR_STATE allocation,
and CRYPTO_*alloc() do not report an malloc failure in this special case (i.e., file == NULL && line == 0).
All other OPENSSL_*malloc() and CRYPTO_*alloc() calls lead to file != NULL || line != 0 and thus do report any malloc error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

…PTO_malloc() and friends report ERR_R_MALLOC_FAILURE
@github-actions github-actions bot removed the severity: fips change The pull request changes FIPS provider sources label Aug 25, 2022
@levitte
Copy link
Member

levitte commented Aug 25, 2022

@t8m:

If the ERR_raise() call is in the same function where the OPENSSL_*alloc() is done and is done solely as the result of the malloc failure, it should NOT be kept.

I disagree with this, especially with the latest changes

@paulidale paulidale requested a review from t8m August 26, 2022 03:40
@t8m
Copy link
Member

t8m commented Aug 26, 2022

I disagree with this, especially with the latest changes

Well, with the latest changes keeping the original ERR_raises makes some sense. I'd still change them to ERR_R_CRYPTO_LIB.

@DDvO given this should be now a simple sed command, could you please create a PR to do this based on top of this PR?

@t8m
Copy link
Member

t8m commented Aug 26, 2022

Could you please test that if CRYPTO_*alloc() returns always NULL we do not get an infinite recursion on OPENSSL_init_crypto() call?

@DDvO
Copy link
Contributor Author

DDvO commented Aug 26, 2022

Well, with the latest changes keeping the original ERR_raises makes some sense. I'd still change them to ERR_R_CRYPTO_LIB.

@DDvO given this should be now a simple sed command, could you please create a PR to do this based on top of this PR?

Done in #19072.

@DDvO
Copy link
Contributor Author

DDvO commented Aug 26, 2022

Could you please test that if CRYPTO_*alloc() returns always NULL we do not get an infinite recursion on OPENSSL_init_crypto() call?

I just made all CRYPTO_*alloc() return NULL (after the changes done in #19072) and ran the tests.
Of course, a lot of them failed, but none of them hang.

@t8m t8m added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Aug 26, 2022
@DDvO
Copy link
Contributor Author

DDvO commented Aug 27, 2022

Merged - thanks a lot @t8m, @paulidale, @levitte, and @richsalz for the discussion and the approvals.
Very glad that we finally got this tricky thing through, even with a pretty elegant solution.

openssl-machine pushed a commit that referenced this pull request Aug 27, 2022
Fixes #6251

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: David von Oheimb <david.von.oheimb@siemens.com>
(Merged from #14833)
@openssl-machine
Copy link
Collaborator

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

@DDvO
Copy link
Contributor Author

DDvO commented Aug 27, 2022

Very unfortunate that the malloc error reporting was not handled in a good way right from the beginning.
This lead to much extra work and redundancy on the callers' side over all the years since,
and even to partly misleading reasons being reported, which is now handled by #19072.

Also finding, discussing, and trying out solutions was a considerable amount of work.
Thanks also to @bernd-edlinger, @mspncp , and @mattcaswell who contributed to the discussion in #6251.

@DDvO DDvO closed this Aug 27, 2022
sftcd pushed a commit to sftcd/openssl that referenced this pull request Sep 24, 2022
Fixes openssl#6251

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: David von Oheimb <david.von.oheimb@siemens.com>
(Merged from openssl#14833)
beldmit pushed a commit to beldmit/openssl that referenced this pull request Dec 26, 2022
Fixes openssl#6251

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: David von Oheimb <david.von.oheimb@siemens.com>
(Merged from openssl#14833)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: done This pull request has the required number of approvals branch: master Applies to master branch triaged: feature The issue/pr requests/adds a feature triaged: refactor The issue/pr requests/implements refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants