Skip to content

mgr: generate crash dumps for Python exceptions in mgr modules#41937

Merged
tchaikov merged 8 commits intoceph:masterfrom
liewegas:mgr-crash
Jun 26, 2021
Merged

mgr: generate crash dumps for Python exceptions in mgr modules#41937
tchaikov merged 8 commits intoceph:masterfrom
liewegas:mgr-crash

Conversation

@liewegas
Copy link
Member

@liewegas liewegas commented Jun 18, 2021

They look like so:

{
    "crash_id": "2021-06-18T20:52:36.836075Z_7a14ad79-c96a-484a-9ff5-828dc438148d",
    "timestamp": "2021-06-18T20:52:36.836075Z",
    "process_name": "ceph-mgr",
    "entity_name": "mgr.x",
    "ceph_version": "17.0.0-5306-gdfd9f85d5fd",
    "utsname_hostname": "dael",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-269.el8.x86_64",
    "utsname_version": "#1 SMP Tue Jan 12 17:55:05 UTC 2021",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Stream",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "backtrace": [
        "  File \"/home/sage/src/ceph/src/pybind/mgr/balancer/module.py\", line 652, in serve\n    self.ifail()",
        "  File \"/home/sage/src/ceph/src/pybind/mgr/balancer/module.py\", line 648, in ifail\n    raise RuntimeError('test')",
    ],
    "mgr_module": "balancer",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "RuntimeError",
}

Notably, we exclude the 'value' portion of the exception when reporting telemetry, as that may contain identifying information.

liewegas added 7 commits June 23, 2021 13:00
Signed-off-by: Sage Weil <sage@newdream.net>
This may seem a bit backwards: we take nice C++ list<string> and do the
C dance.  It's a bit defensive: this class is used in the segv handler
(in the backtrace() and backtrace_symbol() path), so we want to minimize
the work we do on the heap in that case.  (For the list<string> path,
we can do whatever we like.)

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Extend handle_pyerror() to generate a crash dump.  Pass some additional
context through from the callers (including the ability to not generate
a crash dump in the CLI handler case).

Extra crash dump fields look like so:

    "backtrace": [
        "  File \"/home/sage/src/ceph/src/pybind/mgr/balancer/module.py\", line 652, in serve\n    self.ifail()",
        "  File \"/home/sage/src/ceph/src/pybind/mgr/balancer/module.py\", line 648, in ifail\n    raise RuntimeError('test')",
    ],
    "mgr_module": "balancer",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "RuntimeError",

Notably, the backtrace deliberately excludes the 'value' of the exception,
as that may leak identifying information about the system.  Instead, we
only include the exception *type* and the portion of the traceback that
identifies the call path (where in the code we crashed).

Also note: a side-effect of this change is that module exceptions will
trigger cluster health warnings about daemon crashes.

Signed-off-by: Sage Weil <sage@newdream.net>
Generate a different warning for crashes in mgr module python code, as
they do not mean that the entire mgr daemon crashed.  Document.

Signed-off-by: Sage Weil <sage@newdream.net>
Broken by 8c009e2

Signed-off-by: Sage Weil <sage@newdream.net>
Include the exception value in teh crash dump, but redact it in telemetry.
That way the operator can see it (it's useful info!) but we don't risk
sharing identifying data via telemetry.

Signed-off-by: Sage Weil <sage@newdream.net>
@liewegas
Copy link
Member Author

jenkins test make check

@tchaikov tchaikov self-assigned this Jun 25, 2021
Copy link
Contributor

@tchaikov tchaikov left a comment

@liewegas
Copy link
Member Author

might want to address this one before merging this changeset

First two are addressed by #41946

Last one is a self-test, will add to ignorelist

One of the selftests triggers an exception from serve().

Signed-off-by: Sage Weil <sage@newdream.net>
@tchaikov
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants