-
Notifications
You must be signed in to change notification settings - Fork 868
fix(exp): Add made by traceloop evaluators #3503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis PR introduces comprehensive evaluator support to Traceloop SDK. It adds the EvaluatorDetails configuration model, EvaluatorMadeByTraceloop factory class for preconfigured evaluators, enhanced Evaluator class with config propagation and task output validation, refactored Experiment.run() to accept and configure evaluators, and six new sample experiment scripts demonstrating various evaluator types. Changes
Sequence DiagramsequenceDiagram
participant Client as Client Code
participant Exp as Experiment
participant Task as Task Function
participant Val as Validation
participant Eval as Evaluator
Client->>Exp: run(evaluators=[EvaluatorDetails(...)])
Exp->>Exp: _run_locally()
Exp->>Task: task(row)
Task-->>Exp: task_result (dict)
Exp->>Val: validate_task_output(task_result, evaluators)
alt Validation passes
Val-->>Exp: OK
Exp->>Eval: run_experiment_evaluator(task_result, evaluator_config)
Eval-->>Exp: evaluation_result
else Validation fails
Val-->>Exp: ValueError (missing fields)
Exp->>Exp: handle error (stop_on_error logic)
end
Exp-->>Client: experiment_results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to cff5fd4 in 2 minutes and 0 seconds. Click for details.
- Reviewed
1715lines of code in15files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py:57
- Draft comment:
The perplexity evaluator requires a field named 'prompt', but this task returns 'logprobs' (using 'completion'). Consider renaming the key to 'prompt' or update the evaluator configuration. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is making a specific claim about the API requirements of the perplexity evaluator. However, I cannot verify this without seeing the Traceloop SDK documentation or other parts of the codebase. The code comment on line 59 says "For perplexity" suggesting the author intended this field for the perplexity evaluator. Additionally, there's a problem with the logic: line 36 showsgenerate_responsereturns onlyresponse.choices[0].message.content(a string), not the full response object with logprobs. Socompletionon line 54 is just a string, not logprobs data. This is a separate issue from what the comment raises. The comment's claim about field naming cannot be verified without external context. I'm making assumptions about what the perplexity evaluator requires without having access to its documentation or implementation. The comment might be correct, but I cannot verify it from the diff alone. Additionally, there's actually a different bug (returning only content instead of logprobs) that the comment doesn't address, which makes me question if the reviewer understood the code correctly. Since I cannot verify the comment's claim without external documentation, and the rule states I must see STRONG EVIDENCE that the comment is correct, I should delete it. The comment requires knowledge of the Traceloop SDK API that isn't visible in this diff. Furthermore, the comment's phrasing is confusing and doesn't address the actual bug in the code (that logprobs aren't being returned at all). Delete this comment. It makes claims about API requirements that cannot be verified from the diff alone, violating the rule that I need strong evidence the comment is correct. The comment also appears to misunderstand the actual issue in the code.
2. packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py:48
- Draft comment:
The prompt injection evaluator expects a field named 'prompt', but the task returns 'query'. Update the key (or evaluator requirements) for consistency. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is making a specific claim about the API requirements of the prompt_injection evaluator - that it expects a field named "prompt" rather than "query". However, this is not something I can verify from the diff itself. The code is new (it's a new file), and there's no error message or documentation visible in the diff that confirms this requirement. The comment is essentially speculative - it assumes knowledge about the Traceloop SDK's internal requirements that isn't evident from the code shown. This violates the rule about not making speculative comments. If this were actually wrong, it would likely be caught at runtime or during testing. Perhaps the reviewer has external knowledge of the Traceloop SDK API that I don't have access to. Maybe there's documentation or previous examples in the codebase that show the correct field naming convention. The comment could be correct even if I can't verify it from this diff alone. Even if the reviewer has external knowledge, the rule states I must see STRONG EVIDENCE in the diff itself to keep a comment. Without documentation, error messages, or other clear indicators in the changed code, this is speculative. If the field naming is wrong, it would be caught when the code runs, not during code review. This comment should be deleted. There's no strong evidence in the diff that the field naming is incorrect. The comment is speculative about API requirements that aren't documented or evident in the code changes. If this were actually wrong, it would be caught at runtime.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:223
- Draft comment:
Minor suggestion: The comment currently reads "Print error to console so user can see it." Consider adding 'the' to improve readability, e.g., "Print error to console so the user can see it." - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =20%<= threshold50%This comment is purely about improving readability by adding a word. It doesn't suggest a code change or identify a potential issue. It seems to be more of a stylistic suggestion rather than a necessary change.
Workflow ID: wflow_SR8fxI3SxDzHInbm
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
96-170: Guard againstevaluatorsbeingNonewhen buildingevaluators_to_validate(currently a crash)In
_run_locally,evaluatorsis annotated as optional and often left as the defaultNone, butevaluators_to_validateis built with a bare comprehension:evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]When
evaluatorsisNonethis raisesTypeError: 'NoneType' object is not iterable, which would break the default “no evaluators” path. You can fix this by treatingNoneas an empty list:- evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)] + evaluators_to_validate = [ + evaluator + for evaluator in (evaluators or []) + if isinstance(evaluator, EvaluatorDetails) + ]While you’re here, consider updating the
_run_locallydocstringevaluatorsdescription (“List of evaluator slugs to run”) to mentionEvaluatorDetailsas well, to match the actual accepted types.
🧹 Nitpick comments (13)
packages/traceloop-sdk/traceloop/sdk/client/http.py (1)
97-130: Tighten_format_error_messageexception handling and address linter hintsThe helper is a solid improvement, but a couple of small tweaks will both satisfy the linters and make behavior clearer:
- Line 101 (RUF010): you can avoid the f-string conversion warning and the extra
str()call by using an explicit conversion flag:error_parts = [f"Error making request to {path}: {str(exception)}"]
error_parts = [f"Error making request to {path}: {exception!s}"]
- Lines 123–128 (S110, BLE001): catching bare
Exceptionand doingpassis a code smell. Narrowing the exception type and handling it explicitly (even in a minimal way) makes the intent clearer and should quiet the warnings. For example:except (ValueError, AttributeError): # Not JSON, try to get text try: error_text = response.text if error_text and len(error_text) < 500: # Only include short error messages error_parts.append(f"Server response: {error_text}")except Exception:pass
except (AttributeError, UnicodeDecodeError) as exc:# Don't let response text decoding failures break error formattingerror_parts.append(f"Failed to read server error text: {exc!s}")(Adjust the caught exception types if you see other concrete failures in practice.)Optionally, you might also append the HTTP status code when
exception.responseis present to further improve debuggability, but the current behavior is already a clear upgrade.packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)
90-98: Prefix unused unpacked variables with underscore.The
resultsanderrorsvariables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)
88-96: Prefix unused unpacked variables with underscore.The
resultsanderrorsvariables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)
104-112: Prefix unused unpacked variables with underscore.The
resultsanderrorsvariables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.Apply this diff:
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (2)
22-22: Use explicitOptionaltype hint.PEP 484 prohibits implicit
Optional. The parameter should explicitly declare the optional type.+from typing import Optional + -async def generate_response(prompt: str, context: str = None) -> str: +async def generate_response(prompt: str, context: Optional[str] = None) -> str:
90-98: Prefix unused variables with underscore.The
resultsanderrorsvariables are unpacked but never used. As flagged by static analysis (Ruff RUF059), prefix them with underscores.- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (2)
127-135: Prefix unused variables with underscore.As flagged by static analysis (Ruff RUF059), prefix unused variables with underscores.
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(
142-191:run_validation_examples()is defined but never called.This function provides useful live examples but is never invoked from the main entry point. Consider either calling it (e.g., after
run_formatting_experiment()) or removing it if it's not needed.packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)
84-92: Prefix unused variables with underscore.As flagged by static analysis (Ruff RUF059), prefix unused variables with underscores.
- results, errors = await client.experiment.run( + _results, _errors = await client.experiment.run(packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
211-223: Consider logging the exception in the fallback path.While the fallback to
response.textis appropriate, Ruff (S110) suggests logging exceptions rather than silently passing. This could aid debugging when error extraction fails unexpectedly.+import logging + +logger = logging.getLogger(__name__) + def _extract_error_from_response(response: httpx.Response) -> str: # ... try: error_json = response.json() # ... - except Exception: - pass # Use response.text as fallback + except Exception as e: + logger.debug("Failed to parse error response as JSON: %s", e) return error_detailpackages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)
43-57: Makeevaluatorsexplicitly optional to match the default and avoid implicit Optional
evaluatorshas a default ofNonebut is annotated as a non‑optionalList[EvaluatorSpec], which both misleads type checkers and triggers Ruff’s RUF013 (“implicit Optional”). Align it with the other methods and the default:- evaluators: List[EvaluatorSpec] = None, + evaluators: Optional[List[EvaluatorSpec]] = None,The updated docstring for
evaluatorsalready reflects the richer accepted types, so no further change needed there.
181-225: Simplify exception formatting in f-strings and consider using logging instead ofThe new error handling is welcome, but you’re wrapping
ewithstr(e)inside f-strings:error_msg = f"Error: {str(e)}" print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m") error_msg = f"Error processing row: {str(e)}" print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m")Given f-strings already call
str()by default, and Ruff flags this as RUF010, you can simplify and satisfy the linter:- error_msg = f"Error: {str(e)}" + error_msg = f"Error: {e}" ... - print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m") + print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e}\033[0m") ... - error_msg = f"Error processing row: {str(e)}" + error_msg = f"Error processing row: {e}" ... - print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m") + print(f"\033[91m❌ Task execution failed: {e}\033[0m")Longer term, you might want to route these through the standard logging framework instead of
262-365: Align_run_in_githubevaluator handling and docstring with supported types
_run_in_githubnow acceptsevaluators: Optional[List[EvaluatorSpec]]and extracts slugs via:if evaluators: evaluator_slugs = [] for evaluator in evaluators: if isinstance(evaluator, str): evaluator_slugs.append(evaluator) elif isinstance(evaluator, EvaluatorDetails): evaluator_slugs.append(evaluator.slug)while the docstring still says:
evaluators: List of evaluator slugs or (slug, version) tuples to runEither reintroduce tuple support in the implementation or (more likely) update the docstring to mention
str/EvaluatorDetails(orEvaluatorSpec) rather than(slug, version)tuples so users aren’t misled about accepted shapes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (15)
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py(1 hunks)packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py(1 hunks)packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py(1 hunks)packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py(1 hunks)packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py(1 hunks)packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py(1 hunks)packages/sample-app/sample_app/experiment/run_research_experiment.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/client/http.py(4 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/config.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py(9 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/model.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py(11 hunks)packages/traceloop-sdk/traceloop/sdk/experiment/model.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/model.pypackages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.pypackages/traceloop-sdk/traceloop/sdk/evaluator/config.pypackages/sample-app/sample_app/experiment/run_research_experiment.pypackages/traceloop-sdk/traceloop/sdk/evaluator/__init__.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.pypackages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.pypackages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.pypackages/traceloop-sdk/traceloop/sdk/experiment/model.pypackages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.pypackages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.pypackages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.pypackages/traceloop-sdk/traceloop/sdk/experiment/experiment.pypackages/traceloop-sdk/traceloop/sdk/client/http.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧠 Learnings (1)
📚 Learning: 2025-08-04T15:35:30.188Z
Learnt from: nina-kollman
Repo: traceloop/openllmetry PR: 3219
File: packages/traceloop-sdk/traceloop/sdk/datasets/dataset.py:357-360
Timestamp: 2025-08-04T15:35:30.188Z
Learning: In the traceloop SDK Dataset class, the established error handling pattern is that HTTP client methods return None on failure (after catching and logging RequestException), and the Dataset API methods check for None return values and raise Exception with descriptive messages. The update_row_api method is inconsistent with this pattern.
Applied to files:
packages/traceloop-sdk/traceloop/sdk/client/http.py
🧬 Code graph analysis (8)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
Traceloop(36-267)init(48-198)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (5)
EvaluatorMadeByTraceloop(5-407)perplexity(394-407)agent_goal_accuracy(360-374)semantic_similarity(343-357)topic_adherence(377-391)
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (2)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
Traceloop(36-267)init(48-198)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
answer_relevancy(242-256)faithfulness(259-274)
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
EvaluatorMadeByTraceloop(5-407)pii_detector(22-38)secrets_detector(312-324)prompt_injection(61-77)
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-22)
packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
char_count(177-190)word_count(210-222)char_count_ratio(193-207)word_count_ratio(225-239)
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
json_validator(114-137)sql_validator(327-340)regex_validator(80-111)placeholder_regex(140-174)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
Evaluator(16-146)validate_task_output(149-195)packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)
InitExperimentRequest(17-25)packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-22)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-22)
🪛 Flake8 (7.3.0)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
[error] 2-2: 'inspect' imported but unused
(F401)
[error] 3-3: 'typing.Callable' imported but unused
(F401)
[error] 3-3: 'typing.get_type_hints' imported but unused
(F401)
[error] 3-3: 'typing.get_origin' imported but unused
(F401)
[error] 3-3: 'typing.get_args' imported but unused
(F401)
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
104-104: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
104-104: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
221-222: try-except-pass detected, consider logging the exception
(S110)
221-221: Do not catch blind exception: Exception
(BLE001)
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py
22-22: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
90-90: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
90-90: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
84-84: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
84-84: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py
88-88: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
88-88: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py
90-90: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
90-90: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py
127-127: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
127-127: Unpacked variable errors is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
43-43: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
212-212: Use explicit conversion flag
Replace with conversion flag
(RUF010)
215-215: Use explicit conversion flag
Replace with conversion flag
(RUF010)
222-222: Use explicit conversion flag
Replace with conversion flag
(RUF010)
224-224: Use explicit conversion flag
Replace with conversion flag
(RUF010)
packages/traceloop-sdk/traceloop/sdk/client/http.py
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
127-128: try-except-pass detected, consider logging the exception
(S110)
127-127: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Lint
- GitHub Check: Build Packages (3.11)
🔇 Additional comments (18)
packages/traceloop-sdk/traceloop/sdk/client/http.py (4)
36-37: Centralized error formatting inpostlooks consistent with existing patternUsing
_format_error_messagewhile still returningNoneon failure keeps behavior aligned with the established “HTTP client returns None and caller raises” convention. Based on learnings, this matches the existing error‑handling approach.
58-59: Consistent error reporting ingetReusing
_format_error_messagehere keeps GET error handling consistent with POST/PUT/DELETE without changing the return contract.
73-74: DELETE error handling is aligned with other verbs
_format_error_message+Falseon failure matches the pattern used by the other HTTP methods.
93-94: PUT error handling reuses the shared formatter correctlyThe new helper is applied here in the same way as for POST, preserving the
Noneon error behavior while improving log detail.packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)
1-9: LGTM! Clean public API expansion.The module correctly exposes the new evaluator configuration infrastructure (
EvaluatorDetailsandEvaluatorMadeByTraceloop) alongside the existingEvaluatorclass, following standard Python module patterns.packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)
16-23: LGTM! Backward-compatible configuration field added.The addition of
evaluator_configtoExecuteEvaluatorRequestenables per-evaluator configuration propagation without breaking existing usage. The flexibleDict[str, Any]type is appropriate for extensible configuration.packages/sample-app/sample_app/experiment/run_research_experiment.py (2)
49-53: LGTM! Field name updated for consistency.The output key change from
"sentence"to"text"promotes consistency with other experiment tasks in the codebase (e.g., security_exp.py).
17-21: Good practice: API keys sourced from environment.Properly retrieves sensitive credentials from environment variables, following the coding guidelines.
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)
1-6: LGTM! Type definitions refactored for better organization.Centralizing
EvaluatorDetailsin the config module and introducingEvaluatorSpecas a union type improves maintainability and enables flexible evaluator specification (either by slug string or detailed configuration object).packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)
18-23: Good practice: API keys sourced from environment.Properly retrieves sensitive credentials from environment variables, following the coding guidelines.
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)
19-24: Good practice: API keys sourced from environment.Properly retrieves sensitive credentials from environment variables, following the coding guidelines.
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)
21-26: Good practice: API keys sourced from environment.Properly retrieves sensitive credentials from environment variables, following the coding guidelines.
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (1)
40-56: Task output correctly provides required fields for configured evaluators.The
quality_taskreturnsquestion,answer,completion, andcontext, which satisfies:
answer_relevancy: requires["question", "answer"]✓faithfulness: requires["question", "completion", "context"]✓packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)
54-70: Task output correctly provides required fields for all configured evaluators.The
validation_taskreturnstextandplaceholder_value, satisfying all four evaluators' requirements.packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
61-66: Improved error messages with evaluator slug context.Including the evaluator slug and extracted error detail in exceptions significantly improves debuggability.
149-195: Well-structured validation with actionable error messages.The
validate_task_outputfunction provides clear feedback by listing missing fields per evaluator and showing available keys. This will help developers quickly identify and fix configuration issues.packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
21-38: Well-structured factory pattern with clear configuration.The static factory methods provide a clean API for creating preconfigured evaluators with proper type hints and sensible defaults.
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
8-21: Evaluator imports and type wiring look consistentThe new imports (
Evaluator,validate_task_output,EvaluatorSpec,EvaluatorDetails) align with the later usage in this file and keep public API (EvaluatorSpec) separated from internal configuration (EvaluatorDetails). No issues here.
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
Show resolved
Hide resolved
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Outdated
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Outdated
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed ac8a852 in 2 minutes and 44 seconds. Click for details.
- Reviewed
61lines of code in1files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:309
- Draft comment:
Passing the 'evaluators' parameter to _execute_tasks in the GitHub flow ensures validation is applied. Verify that the error handling remains consistent with the local execution path. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to verify that error handling remains consistent, which violates the rule against asking for confirmation or verification. It does not provide a specific suggestion or point out a specific issue.
2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:472
- Draft comment:
Filtering for EvaluatorDetails instances is appropriate. If non-EvaluatorDetails evaluators also need validation, consider extending this logic. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This comment says "consider extending this logic" which is speculative and not actionable. It's a "what if" suggestion without clear evidence that non-EvaluatorDetails evaluators need validation. The code intentionally filters to only EvaluatorDetails because string evaluators (the other type in EvaluatorSpec) don't have the required_input_fields needed for validation. This is working as designed. The comment violates the rule about not making speculative comments like "If X, then Y is an issue" - it's suggesting a potential future enhancement without evidence it's needed. Maybe the comment is pointing out a real limitation - perhaps string evaluators should also support validation somehow? Without seeing the EvaluatorSpec type definition and understanding the full design, I might be missing context about whether this is a genuine issue. Even if there's a potential limitation, the comment is still speculative ("consider extending") rather than pointing to a concrete bug or required change. The rules explicitly state to avoid speculative comments. If string evaluators needed validation, that would be a design decision requiring broader context beyond this PR, not something to address in this code review. This comment should be deleted. It's speculative ("consider extending") and doesn't point to a concrete issue with the changed code. The filtering to only EvaluatorDetails appears intentional since only those objects have the validation fields needed.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:497
- Draft comment:
Propagating ValueError in the outer exception block (via 'if isinstance(e, ValueError): raise e') might prevent error aggregation. Consider handling all errors uniformly so that one failed validation doesn’t cancel processing of other rows, unless that behavior is intended. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment raises a valid point about error handling behavior. The code re-raises ValueError (validation errors) but returns TaskResult with error for other exceptions. This creates inconsistent error handling. However, looking at line 513-515, there's no exception handling in the loop that collects results, so ANY exception (not just ValueError) would stop processing. This suggests the comment might be partially correct but incomplete in its analysis. The comment is speculative ("might prevent", "unless that behavior is intended") which violates the rule about not making speculative comments. The comment also doesn't clearly state what code change is required - it says "Consider handling all errors uniformly" which is vague. The comment is somewhat speculative with phrases like "might prevent" and "unless that behavior is intended". It's asking the author to confirm their intention rather than pointing out a definite bug. Also, the comment doesn't provide a clear, actionable fix - it just suggests to "consider" uniform handling. While the comment is somewhat speculative, it does point to a real inconsistency in error handling that was introduced in the diff. The ValueError is re-raised while other exceptions return TaskResult with error. However, the lack of exception handling at line 514 means this distinction doesn't actually matter in practice - both will stop processing. The comment is not clearly actionable and asks the author to confirm intent. The comment violates the rule against speculative comments ("might prevent", "unless that behavior is intended") and asks the author to confirm their intention rather than pointing out a clear bug. While there is an inconsistency in error handling, the comment doesn't provide clear, actionable guidance on what to fix.
Workflow ID: wflow_ginYKbQs0BEMASIq
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
96-171: Fix crash whenevaluatorsisNoneand align error logging with lint hintsThere is a real bug here:
evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]If
evaluatorsis left as the defaultNone(the common case), this will raiseTypeError: 'NoneType' object is not iterablebefore any rows are processed.Guard this to prevent the crash:
- evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)] + evaluators_to_validate = [ + evaluator + for evaluator in (evaluators or []) + if isinstance(evaluator, EvaluatorDetails) + ]Additionally, replace the four instances of
str(e)inside f-strings with the more concisee!sformat to comply with Ruff's RUF010 rule:- error_msg = f"Error: {str(e)}" + error_msg = f"Error: {e!s}" @@ - print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m") + print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e!s}\033[0m") @@ - error_msg = f"Error processing row: {str(e)}" + error_msg = f"Error processing row: {e!s}" @@ - print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m") + print(f"\033[91m❌ Task execution failed: {e!s}\033[0m") @@ - error_msg = f"Task execution error: {str(e)}" + error_msg = f"Task execution error: {e!s}"
130-151: Docstring advertises unsupported tuple format; evaluators not in EvaluatorSpec union are silently droppedThe
EvaluatorSpectype is defined asUnion[str, EvaluatorDetails](no tuples), yet the_run_in_githubdocstring at line 280 states: "List of evaluator slugs or (slug, version) tuples to run". Both_run_locally(lines 130-151) and_run_in_github(lines 357-366) only handlestrandEvaluatorDetailsviaisinstancechecks; anything else is silently skipped:
- In
_run_locally: unsupported specs never enterevaluator_details, so they won't run.- In
_run_in_github: unsupported specs never contribute toevaluator_slugs, so backend evaluators won't trigger despite the docstring promising tuple support.This causes silent failures that are difficult to debug. Either (a) update the docstring and implementation to support tuples as shown in the review, or (b) update the docstring to match the actual
EvaluatorSpectype and add explicitTypeErrorfor unsupported specs instead of silently dropping them.
🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
38-58: Tightenevaluatorstype annotation inrunto match defaultNone
run()declaresevaluators: List[EvaluatorSpec] = None, which is an implicitOptionaland what Ruff is flagging (RUF013). The rest of the file usesOptional[List[EvaluatorSpec]]whenNoneis allowed.To keep type checkers and linters happy and make the intent explicit, consider:
- evaluators: List[EvaluatorSpec] = None, + evaluators: Optional[List[EvaluatorSpec]] = None,
454-502: Make_execute_tasksexception flow more idiomatic and preserve original tracebacksIn
_execute_tasks.run_single_row:
- You catch
ValueErrorfromvalidate_task_output, print, then raise a newValueError(str(validation_error)), losing the original traceback.- In the outer
except, you checkisinstance(e, ValueError)and doraise e, which again resets the traceback (TRY201 / B904).- Ruff also flags the
str(...)inside the f-strings (RUF010).You can both simplify the flow and keep original traces by re-raising the current exception and separating
ValueErrorfrom other failures:async def run_single_row(row: Optional[Dict[str, Any]]) -> TaskResult: try: task_output = await task(row) - # Validate task output schema on first execution - if evaluators_to_validate: - try: - validate_task_output(task_output, evaluators_to_validate) - except ValueError as validation_error: - print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m") - raise ValueError(str(validation_error)) - - return TaskResult( - input=row, - output=task_output, - ) - except Exception as e: - if isinstance(e, ValueError): - raise e - print(f"\033[91m❌ Task execution error: {str(e)}\033[0m") - return TaskResult( - input=row, - error=str(e), - ) + # Validate task output schema each execution (fail fast on mismatch) + if evaluators_to_validate: + try: + validate_task_output(task_output, evaluators_to_validate) + except ValueError as validation_error: + print(f"\033[91m❌ Task validation failed: {validation_error!s}\033[0m") + # Re-raise original ValueError with its traceback + raise + + return TaskResult( + input=row, + output=task_output, + ) + except ValueError: + # Propagate validation/type errors unchanged to abort the run + raise + except Exception as e: + print(f"\033[91m❌ Task execution error: {e!s}\033[0m") + return TaskResult( + input=row, + error=str(e), + )Semantics stay the same (all
ValueErrors still bubble out and causerun_in_githubto fail), but you keep meaningful stack traces and address the static-analysis warnings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py(13 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
43-43: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
212-212: Use explicit conversion flag
Replace with conversion flag
(RUF010)
215-215: Use explicit conversion flag
Replace with conversion flag
(RUF010)
222-222: Use explicit conversion flag
Replace with conversion flag
(RUF010)
224-224: Use explicit conversion flag
Replace with conversion flag
(RUF010)
488-488: Use explicit conversion flag
Replace with conversion flag
(RUF010)
489-489: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
497-497: Use raise without specifying exception name
Remove exception name
(TRY201)
498-498: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 7f1729d in 2 minutes and 10 seconds. Click for details.
- Reviewed
87lines of code in3files - Skipped
0files when reviewing. - Skipped posting
7draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:16
- Draft comment:
Docstring example now uses 'required_input_fields' instead of 'required_fields' for consistency with the model. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:2
- Draft comment:
Removed unused imports ('inspect' and extra typing helpers); clean-up looks good. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:13
- Draft comment:
Updated example import to use 'EvaluatorMadeByTraceloop' instead of 'Predefined'. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
4. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:65
- Draft comment:
Updated 'prompt_injection' evaluator doc: required field changed from 'text' to 'prompt'. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
5. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:196
- Draft comment:
For 'char_count_ratio', required fields have been updated to 'numerator_text' and 'denominator_text' for clarity. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
6. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:212
- Draft comment:
In 'word_count', changed required field from 'completion' to 'text' to match actual input. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
7. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:263
- Draft comment:
In 'faithfulness', updated required field from 'answer' to 'completion' for consistency. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_HqrJV37uV00rUJGe
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
2-2: LGTM! Unused imports removed.The unused imports flagged by Flake8 have been properly removed, keeping only the types actually used in this file.
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
5-19: LGTM! Clear factory class design.The class provides a clean factory pattern for creating preconfigured evaluators with proper documentation and examples.
21-78: LGTM! Detection evaluators properly configured.The detector evaluators (PII, toxicity, prompt injection) are correctly implemented with matching docstrings and required_input_fields.
241-274: LGTM! RAG evaluators properly configured.The answer_relevancy and faithfulness evaluators have correct docstrings matching their required_input_fields.
224-239: Remove stray "+" character on Line 232.There's a "+" character at the beginning of Line 232 that appears to be a formatting artifact or merge leftover.
Apply this diff:
Required task output fields: - numerator_text: The numerator text for ratio calculation - - denominator_text: The denominator text for ratio calculation + - denominator_text: The denominator text for ratio calculation
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
210-221: Consider narrowing the exception handler.The broad
except Exceptionis flagged by static analysis. While using a fallback here is appropriate, consider either catchingjson.JSONDecodeErrorspecifically or adding a comment explaining why a broad exception handler is intentional.Apply this diff to narrow the exception:
error_detail = response.text try: error_json = response.json() if isinstance(error_json, dict): if 'error' in error_json: error_detail = error_json['error'] elif 'message' in error_json: error_detail = error_json['message'] elif 'msg' in error_json: error_detail = error_json['msg'] - except Exception: - pass # Use response.text as fallback + except (ValueError, TypeError): + pass # JSON parsing failed, use response.text as fallbackAlternatively, keep the broad exception but add a clarifying comment:
- except Exception: - pass # Use response.text as fallback + except Exception: + # Intentionally catch all exceptions to ensure error extraction never fails + # Falls back to response.text which is already set above + passBased on static analysis hints from Ruff.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py(9 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.pypackages/traceloop-sdk/traceloop/sdk/evaluator/config.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-22)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
220-221: try-except-pass detected, consider logging the exception
(S110)
220-220: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (6)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (5)
26-45: LGTM! Clean API extension for evaluator configuration.The evaluator_config parameter is properly typed, documented, and propagated to the request model.
60-65: LGTM! Improved error reporting.The enhanced error message now includes both the evaluator slug and extracted error details, which will significantly improve debugging when evaluators fail.
69-109: LGTM! Consistent API extension.The evaluator_config parameter is properly added to the signature, documented in the docstring, and correctly forwarded to the request builder.
111-145: LGTM! Consistent API extension.The evaluator_config parameter follows the same pattern as run_experiment_evaluator, maintaining API consistency.
148-194: No changes needed—the code is correct for this project's Python version.The project requires Python 3.10+ (as specified in
pyproject.toml), which natively supports PEP 585 generic type syntax. Usingset[str]instead ofSet[str]from typing is the modern, recommended approach for Python 3.9+. The implementation is correct as-is.packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
1-22: LGTM! Clean Pydantic model design.The EvaluatorDetails model is well-structured with proper typing, clear documentation, and helpful examples. This provides a solid foundation for the evaluator configuration system.
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Outdated
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed f64177c in 1 minute and 26 seconds. Click for details.
- Reviewed
170lines of code in3files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:10
- Draft comment:
Docstring for 'required_input_fields' is split over two lines. Consider aligning the continuation for clarity. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:171
- Draft comment:
Multi-line formatting for EvaluatorDetails instantiation improves readability. Ensure consistency across all evaluator factory methods. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is purely informative and suggests ensuring consistency, which violates the rules. It doesn't provide a specific code suggestion or ask for a specific test to be written.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:163
- Draft comment:
Added newline after filtering evaluators improves visual separation; this minor formatting change is clear. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that a newline was added and that it improves visual separation, which is not necessary for the PR author to know.
4. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:13
- Draft comment:
Typographical note: The description for 'required_input_fields' is split awkwardly over two lines ('required fields to the evaluator' on line 13 and 'input. These fields must be present in the task output.' on line 14). Consider merging these lines or rephrasing for clarity if this split is not intentional. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The change in the diff is actually IMPROVING the formatting by splitting a long line into two lines with proper indentation, which is a common Python docstring practice to keep lines under 80-100 characters. The comment is criticizing this improvement and suggesting to undo it or change it. The split is intentional and follows Python conventions. The comment is incorrect because it's treating an improvement as a problem. The description reads naturally: "Optional list of required fields to the evaluator input. These fields must be present in the task output." The split happens at a natural point between "evaluator" and "input". Could the line split actually be awkward to read? Maybe the comment has a point about the phrasing being unclear regardless of the line split. Perhaps "required fields to the evaluator input" is grammatically awkward and could be rephrased to "required input fields for the evaluator" or similar. While the phrasing could potentially be improved, the comment is specifically criticizing the line split itself ("split awkwardly over two lines"), not the grammar or clarity of the text. The line split is a standard formatting improvement and the comment is asking to undo it or questioning if it's intentional. This is not a useful comment. The comment should be deleted. It's criticizing a standard and intentional formatting improvement (splitting a long docstring line). The change follows Python conventions and improves readability by keeping lines shorter. The comment is not identifying a real issue.
Workflow ID: wflow_3iUGxL2v85DZDBHb
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
361-373: Evaluator configurations are not passed in GitHub execution mode.The code extracts evaluator slugs but discards version and config information. This means evaluator configurations work in local execution (
_run_locally) but are silently ignored in GitHub CI/CD mode (_run_in_github).This inconsistency could confuse users who configure evaluators expecting the same behavior across execution modes.
To fix this, you'll need to:
- Update
RunInGithubRequestmodel to accept evaluator details (slug, version, config) instead of just slugs- Modify the backend to handle evaluator configs in GitHub execution
- Update this code to pass the full evaluator details
If GitHub execution doesn't support evaluator configs yet, consider:
- Adding a warning when configs are provided in GitHub mode
- Documenting this limitation
- Opening an issue to track full support
♻️ Duplicate comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
235-246: Remove stray+and fix bullet indentation inword_count_ratiodocstring.The docstring line for
denominator_textstill has a leading+, and indentation for the bullets is inconsistent. This is a leftover from a diff and hurts readability, even though it doesn't affect runtime behavior.Suggested fix:
Required task output fields: - - numerator_text: The numerator text for ratio calculation -+ - denominator_text: The denominator text for ratio calculation + - numerator_text: The numerator text for ratio calculation + - denominator_text: The denominator text for ratio calculationpackages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
484-490: Consider graceful handling of validation errors.The current implementation re-raises
ValueErroron validation failure, which will abort all remaining tasks in GitHub execution mode. This could be unexpected if users want to process other rows despite validation failures in one row.Additionally, the exception re-raise should use
fromfor better exception chaining.Apply this diff to improve exception chaining:
except ValueError as validation_error: print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m") - raise ValueError(str(validation_error)) + raise ValueError(str(validation_error)) from validation_errorConsider whether to:
- Return
TaskResult(input=row, error=str(validation_error))instead of raising (as suggested in the past review)- Add a parameter to control whether validation failures should abort execution
🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
114-137: Consider enforcingschema_stringwhenenable_schema_validation=True.The docstring says
schema_stringis required whenenable_schema_validationisTrue, but the implementation does not enforce this. To avoid silent misconfiguration, you could validate inputs:def json_validator( enable_schema_validation: bool = False, schema_string: Optional[str] = None, ) -> EvaluatorDetails: @@ - config: Dict[str, Any] = { - "enable_schema_validation": enable_schema_validation, - } - if schema_string: - config["schema_string"] = schema_string + config: Dict[str, Any] = {"enable_schema_validation": enable_schema_validation} + if enable_schema_validation and schema_string is None: + raise ValueError("schema_string is required when enable_schema_validation is True") + if schema_string is not None: + config["schema_string"] = schema_stringThis keeps the runtime behavior aligned with the documented contract.
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
497-498: Use bareraiseto preserve the original exception traceback.When re-raising an exception without modification, use bare
raiseinstead ofraise eto maintain the complete traceback chain.Apply this diff:
if isinstance(e, ValueError): - raise e + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py(13 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.pypackages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🧬 Code graph analysis (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
Evaluator(15-145)validate_task_output(148-194)packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
43-43: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
213-213: Use explicit conversion flag
Replace with conversion flag
(RUF010)
216-216: Use explicit conversion flag
Replace with conversion flag
(RUF010)
223-223: Use explicit conversion flag
Replace with conversion flag
(RUF010)
225-225: Use explicit conversion flag
Replace with conversion flag
(RUF010)
489-489: Use explicit conversion flag
Replace with conversion flag
(RUF010)
490-490: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
498-498: Use raise without specifying exception name
Remove exception name
(TRY201)
499-499: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (6)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
5-19: Factory class and usage example look consistent.The
EvaluatorMadeByTraceloopfactory and the docstring example (import + staticmethod calls) line up correctly with the actual API surface. No functional or documentation issues here.packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (5)
8-8: LGTM: Imports correctly support the new evaluator configuration system.The addition of
validate_task_output,EvaluatorSpec, andEvaluatorDetailsimports properly supports the enhanced evaluator functionality introduced in this PR.Also applies to: 14-14, 21-21
130-140: LGTM: Evaluator conversion logic correctly handles both string slugs and EvaluatorDetails.The conversion properly extracts slug, version, and config for each evaluator type, creating the structured tuples needed for downstream processing.
163-171: LGTM: Task output validation correctly filters and validates against EvaluatorDetails.The logic appropriately extracts only
EvaluatorDetailsinstances (which can haverequired_input_fields) and validates task output against them.
182-202: LGTM: Evaluator execution properly propagates configuration.The evaluator invocation now correctly passes
evaluator_configto bothrun_experiment_evaluatorandtrigger_experiment_evaluator, enabling per-evaluator configuration.
213-216: LGTM: Enhanced error visibility with colored console output.The colored error messages improve the user experience by making failures immediately visible in console output.
Also applies to: 224-225
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed a8719d8 in 2 minutes and 0 seconds. Click for details.
- Reviewed
71lines of code in2files - Skipped
0files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:149
- Draft comment:
Doc update: Added the missing 'text' field in the required task output fields for placeholder_regex. Ensure the input validation matches this change. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative as it mentions a documentation update and suggests ensuring input validation matches the change. It doesn't provide a specific code suggestion or ask for a test to be written.
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:243
- Draft comment:
Doc update: Added 'denominator_text' to the required task output fields in word_count_ratio. This ensures the docs match the EvaluatorDetails configuration. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:41
- Draft comment:
API change: The 'evaluators' parameter in run() is now required (no default of None). Confirm that caller code is updated to supply at least an empty list if needed. - Reason this comment was not posted:
Comment looked like it was already resolved.
4. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:92
- Draft comment:
Signature update: In _run_locally(), 'evaluators' is now a required parameter. Consider if a default empty list might be preferable to maintain backward compatibility. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is about a private method_run_locally(). Private methods (prefixed with_) are internal implementation details and not part of the public API. The publicrun()method already requiresevaluatorsas a mandatory parameter, and it's the only caller of_run_locally(). Since this is an internal method and the public API already requires evaluators, there's no backward compatibility concern. The comment is suggesting a change that isn't necessary and is based on a misunderstanding of the scope of the change. This violates the rule about not making comments unless there's clearly a code change required. Could there be other internal callers of_run_locally()that I'm not seeing in this file? Could this method be called from other files in the codebase? However, the rules say to ignore cross-file issues and only think about the file being reviewed. Even if there were other callers, the rules explicitly state to ignore cross-file issues. Within this file,_run_locally()is only called fromrun()which always passes evaluators. The comment is speculative about backward compatibility without evidence of an actual problem in the code being reviewed. This comment should be deleted. It's about a private method's signature change, and the public API already requires the evaluators parameter. There's no backward compatibility issue within this file, and the comment is speculative without clear evidence of a problem.
5. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:262
- Draft comment:
Signature update: In _run_in_github(), the 'evaluators' parameter is now non-optional. Verify that this change is intended and that callers pass a list (even if empty). - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_fWAMHwi9aQr4ygXc
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
8-50: Makeevaluatorsparameter optional to align type signature with defensive implementationThe
evaluators: List[EvaluatorSpec]parameter inrun(),_run_locally(), and_run_in_github()has no default, making it required and breaking existing code that omits it. However, the implementation uses defensiveif evaluators:guards (lines 132, 360, 475) that safely handle empty or falsy cases.To align the type signature with the defensive implementation intent and restore backward compatibility, make evaluators optional in all three methods:
- async def run(..., evaluators: List[EvaluatorSpec], ...): + async def run(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...): - async def _run_locally(..., evaluators: List[EvaluatorSpec], ...): + async def _run_locally(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...): - async def _run_in_github(..., evaluators: List[EvaluatorSpec], ...): + async def _run_in_github(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...):The existing guards will safely handle
Nonevalues.
♻️ Duplicate comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
243-243: Remove the extra+character from the docstring.Line 243 contains a stray
+character at the beginning of the line, likely a leftover diff marker. This will appear in the rendered documentation.Apply this diff to remove the extra character:
-+ - denominator_text: The denominator text for ratio calculation + - denominator_text: The denominator text for ratio calculationpackages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
455-503:_execute_tasksvalidation andValueErrorhandling may abort all tasks; also a few linter-aligned cleanupsThe new
_execute_taskshelper is a nice abstraction and the validation hook viavalidate_task_outputis a good safeguard. A few points to tighten up:
ValueErrorhandling aborts the entire GitHub run (previous concern still applies)Current logic:
try: task_output = await task(row) if evaluators_to_validate: try: validate_task_output(task_output, evaluators_to_validate) except ValueError as validation_error: print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m") raise ValueError(str(validation_error)) return TaskResult(...) except Exception as e: if isinstance(e, ValueError): raise e print(f"\033[91m❌ Task execution error: {str(e)}\033[0m") return TaskResult(input=row, error=str(e))
- Any
ValueError(from the task or from validation) is propagated out of_execute_tasks, aborting the loop at line 514 on the first failing row.- That means later rows are never awaited, and GitHub runs become “fail fast” on
ValueError.This is very similar to the existing review comment at Line 490 questioning whether this should instead return a
TaskResultwith anerrorfield. If you want behavior consistent with local execution (and to avoid partial processing), you could treat validation failures like other errors:if evaluators_to_validate: try: validate_task_output(task_output, evaluators_to_validate) except ValueError as validation_error: - print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m") - raise ValueError(str(validation_error)) + msg = str(validation_error) + print(f"\033[91m❌ Task validation failed: {msg}\033[0m") + return TaskResult(input=row, error=msg) - except Exception as e: - if isinstance(e, ValueError): - raise e - print(f"\033[91m❌ Task execution error: {str(e)}\033[0m") - return TaskResult( - input=row, - error=str(e), - ) + except Exception as e: + print(f"\033[91m❌ Task execution error: {e!s}\033[0m") + return TaskResult(input=row, error=str(e))This preserves validation and logging but avoids aborting the whole run on first
ValueError.
- Comment vs. implementation mismatch
The comment says:
# Validate task output schema on first executionbut validation runs on every row. Either adjust the comment to reflect that it validates each row, or short-circuit validation after the first success/failure if only schema shape is important.
- Linter-style improvements (RUF010, B904, TRY201)
If you decide to keep the “fail fast on
ValueError” behavior, you can still clean up the exception handling and prints:
- Prefer explicit conversion flags in f-strings:
- print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m") + print(f"\033[91m❌ Task validation failed: {validation_error!s}\033[0m") - print(f"\033[91m❌ Task execution error: {str(e)}\033[0m") + print(f"\033[91m❌ Task execution error: {e!s}\033[0m")
- If you keep re-wrapping the
ValueError, chain it:- raise ValueError(str(validation_error)) + raise ValueError(str(validation_error)) from validation_error
- If you keep the current branchy
exceptlogic, prefer a dedicatedexcept ValueErrorfor re-raising instead ofraise e:except ValueError: raise except Exception as e: print(f"\033[91m❌ Task execution error: {e!s}\033[0m") return TaskResult(input=row, error=str(e))These adjustments improve tracebacks and align with the static analysis hints without changing semantics.
🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
91-226: Evaluator normalization and local validation flow look sound; minor robustness & logging tweaksThe new
_run_locallyflow correctly:
- Normalizes
evaluators: List[EvaluatorSpec]into(slug, version, config)tuples for evaluator execution.- Extracts
EvaluatorDetailsinstances intoevaluators_to_validateand usesvalidate_task_outputbefore creating tasks.- Passes
evaluator_configthrough to bothrun_experiment_evaluatorandtrigger_experiment_evaluator.A couple of small improvements to consider:
- Defensive handling of unsupported
EvaluatorSpecvariantsRight now only
strandEvaluatorDetailsare handled:for evaluator in evaluators: if isinstance(evaluator, str): ... elif isinstance(evaluator, EvaluatorDetails): ...If
EvaluatorSpecis extended in the future, such evaluators will silently be ignored. You may want to either:
- Raise a
TypeErrorfor unknown types, or- At least log a warning so misconfigured specs are visible.
- Use f-string conversion instead of
str()for errors (RUF010)For the new logging, you can simplify and satisfy the linter by using conversion flags instead of
str()inside f-strings, e.g.:- error_msg = f"Error: {str(e)}" + error_msg = f"Error: {e!s}" - print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m") + print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e!s}\033[0m") - print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m") + print(f"\033[91m❌ Task execution failed: {e!s}\033[0m")Functionally everything here looks correct; these are just polish and future-proofing suggestions.
258-378: GitHub flow wiring to_execute_tasksand evaluator slugs is consistent, but error behavior now hinges onValueErrorsemanticsThe GitHub path now:
- Executes tasks via
_execute_tasks(rows, task, evaluators), which performs the same validation used locally.- Derives
evaluator_slugsfromevaluators, supporting bothstrandEvaluatorDetails, then sends only slugs to the backend. This matches how_init_experimentis fed slugs locally.One thing to be aware of:
_execute_tasksre-raises anyValueError(see comment below), so in GitHub mode aValueErrorfrom either:
- The task itself, or
validate_task_outputwill abort
_run_in_githuband fail the workflow, rather than returning a partialtask_resultslist with per-row errors. That might be exactly what you want for CI, but it is a behavior change compared to local execution, where errors are wrapped inTaskResponseunlessstop_on_erroris set.If the intent is to keep GitHub runs “all-or-nothing” on validation failures, this is fine; otherwise, consider aligning
_execute_tasks’sValueErrorhandling with_run_locally(returning error-bearingTaskResultobjects instead of aborting).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py(14 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.pypackages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
213-213: Use explicit conversion flag
Replace with conversion flag
(RUF010)
216-216: Use explicit conversion flag
Replace with conversion flag
(RUF010)
223-223: Use explicit conversion flag
Replace with conversion flag
(RUF010)
225-225: Use explicit conversion flag
Replace with conversion flag
(RUF010)
489-489: Use explicit conversion flag
Replace with conversion flag
(RUF010)
490-490: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
498-498: Use raise without specifying exception name
Remove exception name
(TRY201)
499-499: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Lint
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.10)
🔇 Additional comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
1-455: Well-structured factory implementation!The
EvaluatorMadeByTraceloopclass provides a clean, type-safe factory pattern for instantiating evaluators. All previous docstring/implementation inconsistencies have been resolved, and the code demonstrates:
- Consistent structure across all 20 evaluator factory methods
- Proper type hints for configuration parameters
- Clear documentation of required task output fields
- Correct instantiation of
EvaluatorDetailswith appropriate parametersOnce the minor formatting issue on line 243 is addressed, this will be ready to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 91c0096 in 2 minutes and 13 seconds. Click for details.
- Reviewed
565lines of code in3files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:98
- Draft comment:
Consider adding a comment explaining why the asyncio_default_fixture_loop_scope is set to 'function'. This can help future maintainers understand the reasoning behind this choice. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is asking the PR author to add an explanation for a specific setting. It is not a direct code suggestion or a request for a test, and it doesn't point out a potential issue or improvement in the code itself. It seems to be more about documentation, which is not allowed by the rules.
2. packages/traceloop-sdk/tests/experiment/test_experiment.py:168
- Draft comment:
The assertion 'assert len(results) + len(errors) <= 3' is a bit loose since the dataset contains 3 rows. If the intent is to stop processing immediately after a validation error, consider asserting that only one row is processed (or a stricter condition) to verify early termination. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The reviewer has a valid point about test precision. The current assertion<= 3is indeed very loose - it would pass even if all 3 rows were processed, which would mean the early termination isn't working. However, I need to consider: (1) Without seeing the actual implementation of_run_locally, I can't be 100% certain that exactly 1 row should be processed. There might be legitimate reasons for the looser assertion (e.g., async processing, batching, or other implementation details). (2) The comment is about code quality/test improvement, which is generally good. (3) The comment is actionable and specific. (4) However, the test author may have intentionally used<= 3for a reason - perhaps the implementation doesn't guarantee exactly 1 row is processed, or there are edge cases. Without seeing the implementation, I can't be certain the stricter assertion is correct. I'm making an assumption that the stricter assertion would be correct without seeing the actual implementation of_run_locally. The test author may have used the looser assertion deliberately because the implementation doesn't guarantee exactly one row is processed before stopping - there could be async processing, buffering, or other factors that make the exact count unpredictable. While the comment makes a reasonable suggestion for improving test precision, without seeing the implementation details of_run_locally, I cannot confirm that== 1is the correct assertion. The test author may have intentionally used a looser assertion to account for implementation details. This is speculative advice about test quality rather than a clear bug. This comment suggests a test improvement but is speculative about what the correct behavior should be. Without seeing the implementation, we can't confirm that exactly 1 row should be processed. The current assertion may be intentionally loose to handle implementation details. This falls into the category of "speculative" or "code quality" suggestions that aren't clearly necessary.
Workflow ID: wflow_ipi0eNz9YZ0SP7P6
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| error_message = str(exc_info.value) | ||
| # Should only mention the failing evaluator | ||
| assert "relevance-checker requires:" in error_message | ||
| assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert using or on line 119 is ambiguous. It's clearer to assert that 'pii-detector requires:' does not appear in error_message. Consider replacing the condition with a more explicit check.
| assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message | |
| assert "pii-detector requires:" not in error_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
packages/traceloop-sdk/tests/evaluator/test_evaluator.py (2)
103-120: Simplify the assertion logic for clarity.The assertion on Line 120 is logically correct but unnecessarily complex. Since the test verifies that only failing evaluators appear in the error message, you can simplify it.
Apply this diff:
- assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message + assert "pii-detector requires:" not in error_messageThis directly checks that the passing evaluator doesn't appear in the "requires:" format, which is clearer and achieves the same goal.
180-200: Simplify the assertion logic for clarity.Similar to Line 120, the assertion on Line 200 can be simplified for better readability.
Apply this diff:
- assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message + assert "pii-detector requires:" not in error_messageThis directly verifies that the passing evaluator doesn't appear in the error message's "requires:" format.
packages/traceloop-sdk/tests/experiment/test_experiment.py (1)
268-314: Consider checking or prefixing the unusedresultsvariable.The test focuses on validation errors, which is fine, but Line 301 assigns
resultswithout using it. For clarity and to silence the linter, either:
- Check that
resultsis empty:assert len(results) == 0- Prefix with underscore:
_results, errors = ...Apply this diff:
- results, errors = await experiment._run_locally( + _results, errors = await experiment._run_locally(Note: The unused
rowparameter on Line 283 is intentional for signature compatibility.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/traceloop-sdk/pyproject.toml(1 hunks)packages/traceloop-sdk/tests/evaluator/test_evaluator.py(1 hunks)packages/traceloop-sdk/tests/experiment/test_experiment.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/tests/evaluator/test_evaluator.pypackages/traceloop-sdk/tests/experiment/test_experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/tests/evaluator/test_evaluator.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
validate_task_output(148-194)packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/tests/experiment/test_experiment.py
140-140: Unused function argument: row
(ARG001)
187-187: Unused function argument: row
(ARG001)
235-235: Unused function argument: row
(ARG001)
283-283: Unused function argument: row
(ARG001)
301-301: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
336-336: Unused function argument: row
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (13)
packages/traceloop-sdk/pyproject.toml (2)
101-102: No compatibility issues with asyncio fixture scope configuration.The
asyncio_default_fixture_loop_scope = "function"setting is appropriate for pytest-asyncio 0.23.7 and compatible with the existing test suite. All current fixtures use@pytest.fixture(not@pytest_asyncio.fixture) with explicit scopes (session, module, or function), so this setting will not affect them. Async tests already use@pytest.mark.asynciodecorators, which work seamlessly with function-scoped fixture defaults.
104-135: The evaluator code meets mypy strictness requirements.All new evaluator code is properly typed and complies with the strict mypy configuration (
disallow_untyped_defs = true):
EvaluatorMadeByTraceloop: All 22 static factory methods have complete type annotations for parameters and return types (-> EvaluatorDetails).Evaluatorclass: All methods (including async methods) have full parameter and return type hints.- Supporting modules (
model.py,config.py): All Pydantic models properly typed.- Helper functions:
validate_task_output()and_extract_error_from_response()have complete signatures.The evaluator module is not in the mypy exclusion list, confirming it must meet the strict requirements—and it does. The excluded modules (
decorators,prompts,tracing,utils) do not impact evaluator code typing.packages/traceloop-sdk/tests/evaluator/test_evaluator.py (6)
1-8: LGTM!The imports and test class setup are clean and follow standard pytest conventions.
9-40: LGTM!These baseline tests properly verify that validation passes when:
- No evaluators are provided
- Evaluators have no required fields
- All required fields are present in the output
42-101: LGTM!Excellent coverage of missing field scenarios. The tests properly verify:
- Error messages include evaluator names and missing fields
- Task output contents are shown in errors
- Helpful hints are provided
122-149: LGTM!Good edge case coverage:
- Empty task output properly fails validation with clear error message
- Extra fields are correctly allowed (only required fields are enforced)
151-178: LGTM!These tests properly verify:
- Field name matching is case-sensitive (important for correctness)
- Evaluator config doesn't affect validation logic
202-220: LGTM!This test properly verifies that when multiple evaluators require the same field, all failing evaluators are reported in the error message. Good coverage of this edge case.
packages/traceloop-sdk/tests/experiment/test_experiment.py (5)
2-2: LGTM!The new imports (
AsyncMockandEvaluatorDetails) are necessary for the expanded test coverage and properly placed.Also applies to: 5-5
125-170: LGTM!This test properly verifies that validation failures stop processing when
stop_on_error=True. The test setup, mocking, and assertions are all correct.Note: The unused
rowparameter on Line 140 (flagged by static analysis) is intentional—it matches the expected task function signature.
172-213: LGTM!This test correctly verifies that validation errors are captured without stopping execution when
stop_on_error=False. Good complementary coverage to the previous test.Note: The unused
rowparameter on Line 187 is intentional for signature compatibility.
215-266: LGTM!This test properly verifies the success path when task output contains all required fields. The mocking strategy and assertions are correct.
Note: The unused
rowparameter on Line 235 is intentional for signature compatibility.
316-364: LGTM!This test correctly verifies that validation is only performed for
EvaluatorDetailsobjects, not string evaluators. This is an important behavioral distinction and good to have test coverage for.Note: The unused
rowparameter on Line 336 is intentional for signature compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 791ca28 in 26 seconds. Click for details.
- Reviewed
13lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/tests/evaluator/test_evaluator.py:120
- Draft comment:
Change improves clarity by explicitly asserting that 'pii-detector requires:' is absent from the error message. This directly enforces that only failing evaluators are mentioned. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only states that the change improves clarity without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest any improvements.
Workflow ID: wflow_B0kqq0wTBcuwQFzc
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 74ecdf4 in 1 minute and 8 seconds. Click for details.
- Reviewed
76lines of code in3files - Skipped
0files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:98
- Draft comment:
Removed the [tool.pytest.ini_options] block. Please double-check that dropping 'asyncio_default_fixture_loop_scope = "function"' won’t affect async tests. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. packages/traceloop-sdk/tests/conftest.py:237
- Draft comment:
The added 'anyio_backend' fixture effectively forces the asyncio backend. This is appropriate—just ensure it aligns with any future async tests. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. packages/traceloop-sdk/tests/experiment/test_experiment.py:124
- Draft comment:
Consider applying the @pytest.mark.anyio(backends=["asyncio"]) marker at the class level for TestRunLocallyValidation to avoid repetition, if all its methods require the same backend. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
4. packages/traceloop-sdk/tests/experiment/test_experiment.py:136
- Draft comment:
If _init_experiment is asynchronous, consider using AsyncMock instead of Mock for consistency. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
5. packages/traceloop-sdk/tests/experiment/test_experiment.py:360
- Draft comment:
In test_run_locally_no_validation_for_string_evaluators, consider asserting more details about the evaluator’s output if appropriate. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_P96uZpYKAaXaV4GL
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/traceloop-sdk/tests/conftest.py(1 hunks)packages/traceloop-sdk/tests/experiment/test_experiment.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/tests/conftest.pypackages/traceloop-sdk/tests/experiment/test_experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/tests/experiment/test_experiment.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)
get_version_jsonl(218-223)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
run_experiment_evaluator(69-109)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/tests/experiment/test_experiment.py
140-140: Unused function argument: row
(ARG001)
188-188: Unused function argument: row
(ARG001)
237-237: Unused function argument: row
(ARG001)
286-286: Unused function argument: row
(ARG001)
304-304: Unpacked variable results is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
340-340: Unused function argument: row
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Lint
🔇 Additional comments (7)
packages/traceloop-sdk/tests/conftest.py (1)
236-239: LGTM! Standard pytest-anyio configuration.The session-scoped fixture correctly configures anyio to use the asyncio backend, which is necessary for the new async tests introduced in this PR.
packages/traceloop-sdk/tests/experiment/test_experiment.py (6)
2-2: LGTM! Necessary imports for new test functionality.AsyncMock and EvaluatorDetails are required for the new async validation tests in TestRunLocallyValidation.
Also applies to: 5-5
125-170: Well-structured validation test.The test correctly verifies that validation failures stop processing when
stop_on_error=True, and the assertions appropriately check for both error messages and early termination behavior.
172-214: LGTM! Complementary test for stop_on_error=False.The test correctly verifies that validation errors are captured without stopping processing when
stop_on_error=False, providing good coverage alongside the first test.
216-268: LGTM! Comprehensive positive test case.The test correctly verifies the happy path when task output contains all required fields, with appropriate mocking and assertions.
319-368: LGTM! Important test for backward compatibility.This test correctly verifies that string evaluators (legacy format) bypass validation, ensuring backward compatibility while allowing EvaluatorDetails to opt into validation.
121-123: Excellent test coverage for validation scenarios.The new TestRunLocallyValidation class provides comprehensive coverage of the validation functionality, testing various edge cases including stop_on_error behavior, missing fields, multiple evaluators, and backward compatibility with string evaluators. The tests are well-structured and use appropriate mocking patterns.
feat(instrumentation): ...orfix(instrumentation): ....Important
This PR adds new evaluator configurations, task validation, and experiment execution enhancements to the Traceloop SDK, with comprehensive tests and improved error handling.
compliance_exp.py,correctness_exp.py,formatting_exp.py) showcasing various evaluators like content compliance, correctness, and formatting.EvaluatorMadeByTraceloopclass for creating evaluators with specific configurations.EvaluatorDetailsinconfig.pyfor detailed evaluator configuration.evaluator.pyto ensure required fields are present before evaluation.http.pywith detailed contextual messages.run_research_experiment.py.test_evaluator.py.test_experiment.py.Experimentclass inexperiment.pyto handle evaluator configurations and task validations.evaluator.pyandevaluators_made_by_traceloop.py.This description was created by
for 74ecdf4. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.