Add support for Backtraces and profiling by scymtym · Pull Request #1550 · cqframework/clinical_quality_language

scymtym · 2025-05-14T10:44:16Z

This is a draft for extending the "stack of windows" concept of the CqlEngine to a stack of ActivationFrames which track more information than just arguments and local variables, namely

Expression/function definitions being evaluated in particular activation frames
Start and end times for the evaluation for activation frames
Cache hit information for certain activation frames

On top of that extension, this pull request adds two features to the CQLEngine

Backtraces in CQL exceptions that represent runtime error. The backtrace information can be used by clients to display backtraces with expression/function names, arguments, local variables and source information. Example:
Optional profiling. A new engine option can be used to turn enable profiling which the engine to aggregate information from ActivationFrames into a deterministic performance profile that can be used, for example, to render a flamegraph. Example:

Each call to CqlEngine.evaluate causes a call to State.init which, before this change, pushed one window. The actual evaluation associated with the CqlEngine.evaluate call pushed and popped a balanced number of windows but the one pushed by State.init remained on the stack in the end. Successive CqlEngine.evaluate calls would therefore accumulate an unbounded number of windows on the stack. As a slight interface change, clients that bypass CqlEngine.evaluate[Expressions] by using CqlEngine.getEvaluationVisitor now have to call State.beginEvaluation and State.endEvaluation themselves. The ExpressionCacheTest.expression_cache test had to be change in this way.

This makes the association between local variables introduced by, say, query expressions and the containing expression definition clear. For function definitions, there is no change since function invocation already pushed a fresh window.

The stack of ActivationFrames serves the same purpose as the window stack but records additional information compared to just groups of local variables: * Each ActivationFrame (except the dummy root frame) has an associated element which is an ExpressionDef (possibly a FunctionDef) or a Retrieve. * Each ActivationFrame has a start time and an end time so that the time spent evaluating the frame can be tracked. * Each ActivationFrame has an isCached field so that the cache hits and misses when evaluating ExpressionDef elements (but not FunctionDef elements) can be tracked. This additional information will be useful for at least two purposes: 1. The associated elements can be used to construct a backtrace with source information when an exception is thrown during evaluation. 2. The timing and cache hit information can be used to build a deterministic profile of a given evaluation which is useful for performance analysis. Subsequent commits will add code for these two use-cases.

A backtrace represents a stack of call-like frames from the root of an evaluation to a particular sub-expression, commonly a sub-expression in which an exception was thrown. Client code can extract this Backtrace object from a caught exception and use it to display a backtrace similar to those known from other programming languages.

Before this change, there were two performance issues in the profile construction code 1. Profile.register traversed the whole stack to find the appropriate profile node which is O(N) where N is the depth of the "call stack". 2. Children of profile nodes were stored in a HashMap the keys of which relied on object-equality involving Element objects. Object-equality for those is very expensive. This change solves issue 1 by remembering the stack of profile nodes that correspond to the stack of current activation frames and incrementally updating the top of the stack which is O(1). This change solves issue 2 by (basically; the context adds a slight complication) storing the children of profile nodes in an IdentityHashMap so that Element.equals is not used.

c-schuler · 2025-05-30T17:39:57Z

This is very cool, thank you @scymtym !

A couple of enhancements that I would suggest:

Introduce tiered profiling levels (e.g., summary vs. detailed). This being a binary (on/off) is fine for now, but introducing a toggle for verbosity or controlling granularity will likely be a desired enhancement in the future.
Only push activation frames if profiling or backtracing is enabled. This avoids overhead in case both are disabled.

As this PR goes through the paces I would also like to ensure the following take place:

Update the project’s documentation to include usage guidelines for the new backtrace and profiling features.
Provide configuration settings to enable or disable backtracing and profiling, allowing users to control the overhead associated with these features.
Assess the performance impact of the added features, especially in production environments, and optimize as necessary.
Tests should cover scenarios for both backtracing and profiling, including edge cases like nested evaluations and cached results.
Simulated errors and exceptions should be tested to verify that backtraces and debug logs provide meaningful insights.

With thorough testing and proper documentation, these features will be valuable additions to the project. Thanks again!

scymtym · 2025-06-02T15:01:32Z

@c-schuler Thank you for your feedback and suggestions. I can address some of the suggested changes by myself but will need more guidance for others.

Introduce tiered profiling levels (e.g., summary vs. detailed). This being a binary (on/off) is fine for now, but introducing a toggle for verbosity or controlling granularity will likely be a desired enhancement in the future.

I can see this as a useful distinction when presenting the results of profiling - the flamegraph-SVG renderer included in the pull request is mainly a demo of one particular result presentation style that works without dependencies and is not too complex (and happened to be sufficient for my use-case). One question could be whether this renderer should be omitted from the pull request so that clients have complete control and responsibility for the presentation of profiling results. None of this has any bearing on runtime performance, of course.

Regarding runtime behavior and the level of activation frames, I'm not sure what a summary level or other finer grained options could look like. The only information in activation frames that is specific to profiling are the start and end times of the respective evaluation and the cache hit/miss bit. The impact on execution time is (presumably) very minor compared to the profiling work of building the call tree and cumulative timings which are controlled by the singular engine option I added.

A related topic is coverage tracking. I remember seeing something about coverage tracking in the codebase but I can't remember where now. Since a flat profile (just call counts, no nesting) also provides coverage tracking, some integration work may be required in that area to avoid multiple mechanisms for the same functionality.

Only push activation frames if profiling or backtracing is enabled. This avoids overhead in case both are disabled.

Since the stack of activation frames replaces the previous stack of windows, activation frames themselves are not optional.

For clients that can tolerate worse backtraces, there could be an option to only push activation frames when absolutely necessary and only with the absolutely necessary information. I doubt the impact will be large, possibly to the point that querying the engine options frequently would negate the savings, but that would have to be measured.

Also, most of the work for backtraces is done only after an error has already occurred by executing code in catch blocks as the stack unwinds, not in advance. Granted, establishing the extra try/catch blocks might in itself have a slight performance impact but again repeatedly querying engine options could be the more expensive operation.

Update the project’s documentation to include usage guidelines for the new backtrace and profiling features.

I think I can do that. The documentation would mainly explain how get the backtrace and profiling information, not how to present them. Something like the backtrace in the above screenshots needs some work on the client's part. The flamegraph in the screenshot can be produced with just the code in this pull request - provided we don't decide against having the rendering code in the Profile class.

Provide configuration settings to enable or disable backtracing and profiling, allowing users to control the overhead associated with these features.

Such an option exists for profiling. Activation frames cannot be turned off completely since their main purpose, holding variable bindings, is not optional. Reducing the number of activation frames could be an option, as I mentioned above, if it turns out to have a noticeable performance impact.

Assess the performance impact of the added features, especially in production environments, and optimize as necessary.

I would need help with this one since I don't have access to any relevant production environments.

Tests should cover scenarios for both backtracing and profiling, including edge cases like nested evaluations and cached results.

Since the existing tests exercise those edge cases and activation frames are used throughout, the main missing piece would be running the existing tests with profiling enabled. Any suggestions regarding how I could do that?

Simulated errors and exceptions should be tested to verify that backtraces and debug logs provide meaningful insights.

I can add test cases that validate the information in backtraces.

scymtym · 2025-06-10T08:34:14Z

@c-schuler With the most recent commits, the points you brought up should be mostly addressed:

Options for enabled and disabling the features: I explained above that a single engine option should be enough.
Documentation: I completed the javadoc documentation of the added classes. I'm not sure whether there are other places in which the features should be documented.
Performance Impact: I ran CqlPerformanceTest for three configurations: without my changes, with my changes + profiling disabled and with my changes + profiling enabled. There were no significant differences in execution time. I also fixed some of the messages about unexpected results in CqlPerformanceTest (see #cql chat) to see whether that impacted execution time of the test, but there didn't seem to be a significant impact either.
Unit Tests: I added unit tests for the backtrace and profiling features

I haven't pushed the change that allows CqlPerformanceTest to run with profiling enabled since it seems like I should wait for #1556 (See discussion in #cql).

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java

Reported-by: Anton Vasetenkov <antvaset@gmail.com>

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java

...java/engine/src/main/java/org/opencds/cqf/cql/engine/elm/executing/FunctionRefEvaluator.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/EvaluationVisitor.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/Profile.java

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java

Src/java/engine/src/test/java/org/opencds/cqf/cql/engine/execution/ProfilingTest.java

Suggested-by: Anton Vasetenkov <antvaset@gmail.com>

The language level was changed in commit b55dc7b. Suggested-by: Anton Vasetenkov <antvaset@gmail.com>

antvaset · 2025-06-26T10:38:47Z

Hi @scymtym , this is so cool and very well done! And thanks for merging in my suggested changes.

Agree with @c-schuler that more toggles will be needed to control future profiling features like verbosity and granularity. In terms of what those toggles could look like, this TODO is captured by #1287 and will be addressed separately and perhaps more holistically. We don't currently have a README for CQL engine options in this repo (only code comments), and I can do that separately as well for all engine options (see new issue #1574).

scymtym · 2025-06-26T11:44:23Z

@antvaset Thank you for the review.

I agree that some way of controlling the profiling feature, that is capturing profiling info as well as processing the finished profile, in more detail will be needed. The ad-hoc flamegraph-SVG writer I included is more a demo than anything. Whether those fleshed out details should take the form of engine options, I'm not really sure but that can be worked later as you suggest.

c-schuler

Looks great! Thank you!

scymtym added 2 commits May 15, 2025 11:27

scymtym force-pushed the backtrace-and-profile branch from baa0b81 to 23b902f Compare May 15, 2025 11:14

scymtym added 7 commits May 19, 2025 16:48

New Profile class

fa2dc05

Track context in ActivationFrames

c8b768a

Try to produce better colors in flamegraph

17af9d6

Include context information in backtrace frames

c77aa6b

scymtym force-pushed the backtrace-and-profile branch from 23b902f to 5e89f72 Compare May 19, 2025 14:50

JPercival and others added 2 commits May 22, 2025 13:39

Merge branch 'master' into backtrace-and-profile

77cf0bf

Merge branch 'master' into backtrace-and-profile

c227925

c-schuler self-requested a review May 30, 2025 17:17

scymtym added 7 commits June 3, 2025 16:04

Add method Variable.toString

d975465

Add method Profile.Node.toString

2855935

In Backtrace.Frame, order of arguments matches function signature

93ce206

Improve documentation of Profile class

0bbc3b0

Improve documentation of Backtrace class

02df890

Initial test for Backtrace functionality

da27418

Initial test for profiling functionality

91017fc

JPercival and others added 6 commits June 10, 2025 11:59

Merge branch 'master' into backtrace-and-profile

cd75d1a

Merge branch 'master' into backtrace-and-profile

24e37a2

Merge branch 'master' into backtrace-and-profile

d8d58e2

Merge branch 'master' into backtrace-and-profile

de4d122

Merge branch 'master' into backtrace-and-profile

500a825

Use multiple different Engine Options in CqlPerformanceIT

336950f

Remove unused code in ProfilingTest

b0602c7

scymtym marked this pull request as ready for review June 24, 2025 14:58

antvaset self-assigned this Jun 25, 2025

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java Outdated Show resolved Hide resolved

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java Outdated Show resolved Hide resolved

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java Outdated Show resolved Hide resolved

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/exception/Backtrace.java Outdated Show resolved Hide resolved

Fix typos in comments

ef6216f

Reported-by: Anton Vasetenkov <antvaset@gmail.com>

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java Outdated Show resolved Hide resolved

antvaset reviewed Jun 25, 2025

View reviewed changes

Src/java/engine/src/main/java/org/opencds/cqf/cql/engine/execution/State.java Outdated Show resolved Hide resolved