Skip to content

Conversation

@nina-kollman
Copy link
Contributor

@nina-kollman nina-kollman commented Dec 7, 2025

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

Important

This PR adds new evaluator configurations, task validation, and experiment execution enhancements to the Traceloop SDK, with comprehensive tests and improved error handling.

  • New Features:
    • Added multiple experiment demos (compliance_exp.py, correctness_exp.py, formatting_exp.py) showcasing various evaluators like content compliance, correctness, and formatting.
    • Introduced EvaluatorMadeByTraceloop class for creating evaluators with specific configurations.
    • Added EvaluatorDetails in config.py for detailed evaluator configuration.
  • Enhancements:
    • Implemented task-output validation in evaluator.py to ensure required fields are present before evaluation.
    • Improved HTTP error reporting in http.py with detailed contextual messages.
    • Standardized task output field from "sentence" to "text" in run_research_experiment.py.
  • Tests:
    • Added comprehensive tests for task-output validation in test_evaluator.py.
    • Enhanced experiment validation flows in test_experiment.py.
  • Miscellaneous:
    • Updated Experiment class in experiment.py to handle evaluator configurations and task validations.
    • Added support for new evaluators in evaluator.py and evaluators_made_by_traceloop.py.

This description was created by Ellipsis for 74ecdf4. You can customize this summary. It will automatically update as commits are pushed.


Summary by CodeRabbit

Release Notes

  • New Features
    • Added 20+ pre-built evaluators for common scenarios (PII detection, content safety, SQL/JSON validation, semantic similarity, and more).
    • Task output validation enforces required input fields before evaluation.
    • Evaluator configuration support for customized evaluation behavior.
    • Enhanced error messages with detailed server-provided context.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 7, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR introduces comprehensive evaluator support to Traceloop SDK. It adds the EvaluatorDetails configuration model, EvaluatorMadeByTraceloop factory class for preconfigured evaluators, enhanced Evaluator class with config propagation and task output validation, refactored Experiment.run() to accept and configure evaluators, and six new sample experiment scripts demonstrating various evaluator types.

Changes

Cohort / File(s) Summary
Sample experiment scripts
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py, correctness_exp.py, formatting_exp.py, quality_exp.py, security_exp.py, style_exp.py
New Python scripts demonstrating Traceloop experiments for content compliance, correctness, formatting, quality, security, and style evaluators. Each defines async task functions, response generation via OpenAI, and experiment execution with specific evaluators and configurations.
Research experiment update
packages/sample-app/sample_app/experiment/run_research_experiment.py
Changed returned dictionary key in research_task from "sentence" to "text" for consistency with evaluator expectations.
Evaluator public API
packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
Exported EvaluatorDetails and EvaluatorMadeByTraceloop to public API alongside existing Evaluator export.
Evaluator configuration model
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
Added EvaluatorDetails Pydantic model with fields: slug, version, config, and required_input_fields for configuring evaluators.
Evaluator factory class
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
New EvaluatorMadeByTraceloop factory providing static methods for 20+ preconfigured evaluators (PII detector, toxicity, regex validator, JSON validator, semantic similarity, perplexity, etc.) returning EvaluatorDetails instances.
Evaluator core enhancements
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
Added evaluator_config parameter propagation through _build_evaluator_request, run_experiment_evaluator, and trigger_experiment_evaluator; introduced _extract_error_from_response for detailed error reporting; added validate_task_output utility for field validation against evaluator requirements.
Evaluator request model
packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
Added optional evaluator_config field (Dict[str, Any]) to ExecuteEvaluatorRequest.
HTTP client error handling
packages/traceloop-sdk/traceloop/sdk/client/http.py
Introduced _format_error_message method for centralized, contextual error reporting; all HTTP methods (post, get, delete, put) now use this formatter to provide detailed server error details.
Experiment type definitions
packages/traceloop-sdk/traceloop/sdk/experiment/model.py
Replaced local EvaluatorDetails/EvaluatorSlug/EvaluatorVersion aliases with new EvaluatorSpec union type (Union[str, EvaluatorDetails]) to represent either string slug or detailed config.
Experiment orchestration
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
Updated run() signature to accept List[EvaluatorSpec]; added evaluator_config propagation to evaluator invocations; integrated validate_task_output to verify task outputs meet evaluator requirements before evaluation; enhanced error handling with detailed console logging.
Evaluator validation tests
packages/traceloop-sdk/tests/evaluator/test_evaluator.py
Added comprehensive TestValidateTaskOutput test suite validating missing/present required fields, error messaging, partial matches, empty outputs, and evaluator configurations.
Experiment validation tests
packages/traceloop-sdk/tests/experiment/test_experiment.py
Added TestRunLocallyValidation class with async tests validating _run_locally behavior under various validation conditions (stop_on_error true/false) with failing and successful outputs; extended imports and test data formatting.
Test configuration
packages/traceloop-sdk/tests/conftest.py
Added anyio_backend session fixture forcing asyncio backend for test execution.

Sequence Diagram

sequenceDiagram
    participant Client as Client Code
    participant Exp as Experiment
    participant Task as Task Function
    participant Val as Validation
    participant Eval as Evaluator
    
    Client->>Exp: run(evaluators=[EvaluatorDetails(...)])
    Exp->>Exp: _run_locally()
    Exp->>Task: task(row)
    Task-->>Exp: task_result (dict)
    Exp->>Val: validate_task_output(task_result, evaluators)
    alt Validation passes
        Val-->>Exp: OK
        Exp->>Eval: run_experiment_evaluator(task_result, evaluator_config)
        Eval-->>Exp: evaluation_result
    else Validation fails
        Val-->>Exp: ValueError (missing fields)
        Exp->>Exp: handle error (stop_on_error logic)
    end
    Exp-->>Client: experiment_results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Evaluator config propagation chain: Verify evaluator_config flows correctly through run() → run_experiment_evaluator → ExecuteEvaluatorRequest across both local and GitHub experiment paths.
  • Task output validation logic: Ensure validate_task_output correctly identifies missing required_input_fields for each evaluator and error messages are actionable; test edge cases (empty outputs, case sensitivity, extra fields).
  • EvaluatorSpec type flexibility: Confirm run() correctly handles both string slugs (no validation) and EvaluatorDetails objects (with validation) and that backward compatibility is maintained.
  • Sample script consistency: Verify all six experiment scripts follow consistent patterns for evaluator setup, OpenAI integration, and result handling.
  • HTTP error extraction: Confirm _format_error_message robustly handles various server error response formats (JSON vs plain text) and preserves existing control flow.

Possibly related PRs

Suggested reviewers

  • nirga
  • doronkopit5

Poem

🐰 Evaluators now configured with flair,
Validation checks ensure quality's there!
EvaluatorMadeByTraceloop so grand,
Makes quality testing across the land!
Five plus one experiments, all working together,
Building assessment that lasts forever! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main addition of Traceloop evaluators via the EvaluatorMadeByTraceloop class and related factory methods across multiple experiment demos.
Docstring Coverage ✅ Passed Docstring coverage is 91.76% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch nk/exp_validation

Comment @coderabbitai help to get the list of available commands and usage tips.

@nina-kollman nina-kollman marked this pull request as ready for review December 7, 2025 10:27
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to cff5fd4 in 2 minutes and 0 seconds. Click for details.
  • Reviewed 1715 lines of code in 15 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py:57
  • Draft comment:
    The perplexity evaluator requires a field named 'prompt', but this task returns 'logprobs' (using 'completion'). Consider renaming the key to 'prompt' or update the evaluator configuration.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is making a specific claim about the API requirements of the perplexity evaluator. However, I cannot verify this without seeing the Traceloop SDK documentation or other parts of the codebase. The code comment on line 59 says "For perplexity" suggesting the author intended this field for the perplexity evaluator. Additionally, there's a problem with the logic: line 36 shows generate_response returns only response.choices[0].message.content (a string), not the full response object with logprobs. So completion on line 54 is just a string, not logprobs data. This is a separate issue from what the comment raises. The comment's claim about field naming cannot be verified without external context. I'm making assumptions about what the perplexity evaluator requires without having access to its documentation or implementation. The comment might be correct, but I cannot verify it from the diff alone. Additionally, there's actually a different bug (returning only content instead of logprobs) that the comment doesn't address, which makes me question if the reviewer understood the code correctly. Since I cannot verify the comment's claim without external documentation, and the rule states I must see STRONG EVIDENCE that the comment is correct, I should delete it. The comment requires knowledge of the Traceloop SDK API that isn't visible in this diff. Furthermore, the comment's phrasing is confusing and doesn't address the actual bug in the code (that logprobs aren't being returned at all). Delete this comment. It makes claims about API requirements that cannot be verified from the diff alone, violating the rule that I need strong evidence the comment is correct. The comment also appears to misunderstand the actual issue in the code.
2. packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py:48
  • Draft comment:
    The prompt injection evaluator expects a field named 'prompt', but the task returns 'query'. Update the key (or evaluator requirements) for consistency.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is making a specific claim about the API requirements of the prompt_injection evaluator - that it expects a field named "prompt" rather than "query". However, this is not something I can verify from the diff itself. The code is new (it's a new file), and there's no error message or documentation visible in the diff that confirms this requirement. The comment is essentially speculative - it assumes knowledge about the Traceloop SDK's internal requirements that isn't evident from the code shown. This violates the rule about not making speculative comments. If this were actually wrong, it would likely be caught at runtime or during testing. Perhaps the reviewer has external knowledge of the Traceloop SDK API that I don't have access to. Maybe there's documentation or previous examples in the codebase that show the correct field naming convention. The comment could be correct even if I can't verify it from this diff alone. Even if the reviewer has external knowledge, the rule states I must see STRONG EVIDENCE in the diff itself to keep a comment. Without documentation, error messages, or other clear indicators in the changed code, this is speculative. If the field naming is wrong, it would be caught when the code runs, not during code review. This comment should be deleted. There's no strong evidence in the diff that the field naming is incorrect. The comment is speculative about API requirements that aren't documented or evident in the code changes. If this were actually wrong, it would be caught at runtime.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:223
  • Draft comment:
    Minor suggestion: The comment currently reads "Print error to console so user can see it." Consider adding 'the' to improve readability, e.g., "Print error to console so the user can see it."
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 20% <= threshold 50% This comment is purely about improving readability by adding a word. It doesn't suggest a code change or identify a potential issue. It seems to be more of a stylistic suggestion rather than a necessary change.

Workflow ID: wflow_SR8fxI3SxDzHInbm

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

96-170: Guard against evaluators being None when building evaluators_to_validate (currently a crash)

In _run_locally, evaluators is annotated as optional and often left as the default None, but evaluators_to_validate is built with a bare comprehension:

evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]

When evaluators is None this raises TypeError: 'NoneType' object is not iterable, which would break the default “no evaluators” path. You can fix this by treating None as an empty list:

-        evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]
+        evaluators_to_validate = [
+            evaluator
+            for evaluator in (evaluators or [])
+            if isinstance(evaluator, EvaluatorDetails)
+        ]

While you’re here, consider updating the _run_locally docstring evaluators description (“List of evaluator slugs to run”) to mention EvaluatorDetails as well, to match the actual accepted types.

🧹 Nitpick comments (13)
packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

97-130: Tighten _format_error_message exception handling and address linter hints

The helper is a solid improvement, but a couple of small tweaks will both satisfy the linters and make behavior clearer:

  • Line 101 (RUF010): you can avoid the f-string conversion warning and the extra str() call by using an explicit conversion flag:
  •    error_parts = [f"Error making request to {path}: {str(exception)}"]
    
  •    error_parts = [f"Error making request to {path}: {exception!s}"]
    
  • Lines 123–128 (S110, BLE001): catching bare Exception and doing pass is a code smell. Narrowing the exception type and handling it explicitly (even in a minimal way) makes the intent clearer and should quiet the warnings. For example:
               except (ValueError, AttributeError):
                   # Not JSON, try to get text
                   try:
                       error_text = response.text
                       if error_text and len(error_text) < 500:  # Only include short error messages
                           error_parts.append(f"Server response: {error_text}")
  •            except Exception:
    
  •                pass
    
  •            except (AttributeError, UnicodeDecodeError) as exc:
    
  •                # Don't let response text decoding failures break error formatting
    
  •                error_parts.append(f"Failed to read server error text: {exc!s}")
    
    (Adjust the caught exception types if you see other concrete failures in practice.)
    
    

Optionally, you might also append the HTTP status code when exception.response is present to further improve debuggability, but the current behavior is already a clear upgrade.

packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)

90-98: Prefix unused unpacked variables with underscore.

The results and errors variables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.

Apply this diff:

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(
packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)

88-96: Prefix unused unpacked variables with underscore.

The results and errors variables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.

Apply this diff:

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

104-112: Prefix unused unpacked variables with underscore.

The results and errors variables are unpacked but never used. Following Python conventions, prefix them with underscores to indicate intentional non-use.

Apply this diff:

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (2)

22-22: Use explicit Optional type hint.

PEP 484 prohibits implicit Optional. The parameter should explicitly declare the optional type.

+from typing import Optional
+
-async def generate_response(prompt: str, context: str = None) -> str:
+async def generate_response(prompt: str, context: Optional[str] = None) -> str:

90-98: Prefix unused variables with underscore.

The results and errors variables are unpacked but never used. As flagged by static analysis (Ruff RUF059), prefix them with underscores.

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (2)

127-135: Prefix unused variables with underscore.

As flagged by static analysis (Ruff RUF059), prefix unused variables with underscores.

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(

142-191: run_validation_examples() is defined but never called.

This function provides useful live examples but is never invoked from the main entry point. Consider either calling it (e.g., after run_formatting_experiment()) or removing it if it's not needed.

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)

84-92: Prefix unused variables with underscore.

As flagged by static analysis (Ruff RUF059), prefix unused variables with underscores.

-    results, errors = await client.experiment.run(
+    _results, _errors = await client.experiment.run(
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

211-223: Consider logging the exception in the fallback path.

While the fallback to response.text is appropriate, Ruff (S110) suggests logging exceptions rather than silently passing. This could aid debugging when error extraction fails unexpectedly.

+import logging
+
+logger = logging.getLogger(__name__)
+
 def _extract_error_from_response(response: httpx.Response) -> str:
     # ...
     try:
         error_json = response.json()
         # ...
-    except Exception:
-        pass  # Use response.text as fallback
+    except Exception as e:
+        logger.debug("Failed to parse error response as JSON: %s", e)

     return error_detail
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

43-57: Make evaluators explicitly optional to match the default and avoid implicit Optional

evaluators has a default of None but is annotated as a non‑optional List[EvaluatorSpec], which both misleads type checkers and triggers Ruff’s RUF013 (“implicit Optional”). Align it with the other methods and the default:

-        evaluators: List[EvaluatorSpec] = None,
+        evaluators: Optional[List[EvaluatorSpec]] = None,

The updated docstring for evaluators already reflects the richer accepted types, so no further change needed there.


181-225: Simplify exception formatting in f-strings and consider using logging instead of print

The new error handling is welcome, but you’re wrapping e with str(e) inside f-strings:

error_msg = f"Error: {str(e)}"
print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m")
error_msg = f"Error processing row: {str(e)}"
print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m")

Given f-strings already call str() by default, and Ruff flags this as RUF010, you can simplify and satisfy the linter:

-                            error_msg = f"Error: {str(e)}"
+                            error_msg = f"Error: {e}"
...
-                            print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m")
+                            print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e}\033[0m")
...
-                error_msg = f"Error processing row: {str(e)}"
+                error_msg = f"Error processing row: {e}"
...
-                print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m")
+                print(f"\033[91m❌ Task execution failed: {e}\033[0m")

Longer term, you might want to route these through the standard logging framework instead of print for better control in libraries.


262-365: Align _run_in_github evaluator handling and docstring with supported types

_run_in_github now accepts evaluators: Optional[List[EvaluatorSpec]] and extracts slugs via:

if evaluators:
    evaluator_slugs = []
    for evaluator in evaluators:
        if isinstance(evaluator, str):
            evaluator_slugs.append(evaluator)
        elif isinstance(evaluator, EvaluatorDetails):
            evaluator_slugs.append(evaluator.slug)

while the docstring still says:

evaluators: List of evaluator slugs or (slug, version) tuples to run

Either reintroduce tuple support in the implementation or (more likely) update the docstring to mention str / EvaluatorDetails (or EvaluatorSpec) rather than (slug, version) tuples so users aren’t misled about accepted shapes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f782741 and cff5fd4.

📒 Files selected for processing (15)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1 hunks)
  • packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/client/http.py (4 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (9 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (11 hunks)
  • packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/traceloop/sdk/evaluator/model.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
  • packages/sample-app/sample_app/experiment/run_research_experiment.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py
  • packages/traceloop-sdk/traceloop/sdk/experiment/model.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py
  • packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
  • packages/traceloop-sdk/traceloop/sdk/client/http.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧠 Learnings (1)
📚 Learning: 2025-08-04T15:35:30.188Z
Learnt from: nina-kollman
Repo: traceloop/openllmetry PR: 3219
File: packages/traceloop-sdk/traceloop/sdk/datasets/dataset.py:357-360
Timestamp: 2025-08-04T15:35:30.188Z
Learning: In the traceloop SDK Dataset class, the established error handling pattern is that HTTP client methods return None on failure (after catching and logging RequestException), and the Dataset API methods check for None return values and raise Exception with descriptive messages. The update_row_api method is inconsistent with this pattern.

Applied to files:

  • packages/traceloop-sdk/traceloop/sdk/client/http.py
🧬 Code graph analysis (8)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (2)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
  • Traceloop (36-267)
  • init (48-198)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (5)
  • EvaluatorMadeByTraceloop (5-407)
  • perplexity (394-407)
  • agent_goal_accuracy (360-374)
  • semantic_similarity (343-357)
  • topic_adherence (377-391)
packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (2)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
  • Traceloop (36-267)
  • init (48-198)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2)
  • answer_relevancy (242-256)
  • faithfulness (259-274)
packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
  • EvaluatorMadeByTraceloop (5-407)
  • pii_detector (22-38)
  • secrets_detector (312-324)
  • prompt_injection (61-77)
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-22)
packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
  • char_count (177-190)
  • word_count (210-222)
  • char_count_ratio (193-207)
  • word_count_ratio (225-239)
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
  • json_validator (114-137)
  • sql_validator (327-340)
  • regex_validator (80-111)
  • placeholder_regex (140-174)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
  • Evaluator (16-146)
  • validate_task_output (149-195)
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)
  • InitExperimentRequest (17-25)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-22)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-22)
🪛 Flake8 (7.3.0)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

[error] 2-2: 'inspect' imported but unused

(F401)


[error] 3-3: 'typing.Callable' imported but unused

(F401)


[error] 3-3: 'typing.get_type_hints' imported but unused

(F401)


[error] 3-3: 'typing.get_origin' imported but unused

(F401)


[error] 3-3: 'typing.get_args' imported but unused

(F401)

🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py

104-104: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


104-104: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

221-222: try-except-pass detected, consider logging the exception

(S110)


221-221: Do not catch blind exception: Exception

(BLE001)

packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py

22-22: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


90-90: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


90-90: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/security_exp.py

84-84: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


84-84: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py

88-88: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


88-88: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py

90-90: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


90-90: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py

127-127: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


127-127: Unpacked variable errors is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

43-43: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


212-212: Use explicit conversion flag

Replace with conversion flag

(RUF010)


215-215: Use explicit conversion flag

Replace with conversion flag

(RUF010)


222-222: Use explicit conversion flag

Replace with conversion flag

(RUF010)


224-224: Use explicit conversion flag

Replace with conversion flag

(RUF010)

packages/traceloop-sdk/traceloop/sdk/client/http.py

101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


127-128: try-except-pass detected, consider logging the exception

(S110)


127-127: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Lint
  • GitHub Check: Build Packages (3.11)
🔇 Additional comments (18)
packages/traceloop-sdk/traceloop/sdk/client/http.py (4)

36-37: Centralized error formatting in post looks consistent with existing pattern

Using _format_error_message while still returning None on failure keeps behavior aligned with the established “HTTP client returns None and caller raises” convention. Based on learnings, this matches the existing error‑handling approach.


58-59: Consistent error reporting in get

Reusing _format_error_message here keeps GET error handling consistent with POST/PUT/DELETE without changing the return contract.


73-74: DELETE error handling is aligned with other verbs

_format_error_message + False on failure matches the pattern used by the other HTTP methods.


93-94: PUT error handling reuses the shared formatter correctly

The new helper is applied here in the same way as for POST, preserving the None on error behavior while improving log detail.

packages/traceloop-sdk/traceloop/sdk/evaluator/__init__.py (1)

1-9: LGTM! Clean public API expansion.

The module correctly exposes the new evaluator configuration infrastructure (EvaluatorDetails and EvaluatorMadeByTraceloop) alongside the existing Evaluator class, following standard Python module patterns.

packages/traceloop-sdk/traceloop/sdk/evaluator/model.py (1)

16-23: LGTM! Backward-compatible configuration field added.

The addition of evaluator_config to ExecuteEvaluatorRequest enables per-evaluator configuration propagation without breaking existing usage. The flexible Dict[str, Any] type is appropriate for extensible configuration.

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

49-53: LGTM! Field name updated for consistency.

The output key change from "sentence" to "text" promotes consistency with other experiment tasks in the codebase (e.g., security_exp.py).


17-21: Good practice: API keys sourced from environment.

Properly retrieves sensitive credentials from environment variables, following the coding guidelines.

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)

1-6: LGTM! Type definitions refactored for better organization.

Centralizing EvaluatorDetails in the config module and introducing EvaluatorSpec as a union type improves maintainability and enables flexible evaluator specification (either by slug string or detailed configuration object).

packages/sample-app/sample_app/experiment/made_by_traceloop/style_exp.py (1)

18-23: Good practice: API keys sourced from environment.

Properly retrieves sensitive credentials from environment variables, following the coding guidelines.

packages/sample-app/sample_app/experiment/made_by_traceloop/compliance_exp.py (1)

19-24: Good practice: API keys sourced from environment.

Properly retrieves sensitive credentials from environment variables, following the coding guidelines.

packages/sample-app/sample_app/experiment/made_by_traceloop/quality_exp.py (1)

21-26: Good practice: API keys sourced from environment.

Properly retrieves sensitive credentials from environment variables, following the coding guidelines.

packages/sample-app/sample_app/experiment/made_by_traceloop/correctness_exp.py (1)

40-56: Task output correctly provides required fields for configured evaluators.

The quality_task returns question, answer, completion, and context, which satisfies:

  • answer_relevancy: requires ["question", "answer"]
  • faithfulness: requires ["question", "completion", "context"]
packages/sample-app/sample_app/experiment/made_by_traceloop/formatting_exp.py (1)

54-70: Task output correctly provides required fields for all configured evaluators.

The validation_task returns text and placeholder_value, satisfying all four evaluators' requirements.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)

61-66: Improved error messages with evaluator slug context.

Including the evaluator slug and extracted error detail in exceptions significantly improves debuggability.


149-195: Well-structured validation with actionable error messages.

The validate_task_output function provides clear feedback by listing missing fields per evaluator and showing available keys. This will help developers quickly identify and fix configuration issues.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

21-38: Well-structured factory pattern with clear configuration.

The static factory methods provide a clean API for creating preconfigured evaluators with proper type hints and sensible defaults.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

8-21: Evaluator imports and type wiring look consistent

The new imports (Evaluator, validate_task_output, EvaluatorSpec, EvaluatorDetails) align with the later usage in this file and keep public API (EvaluatorSpec) separated from internal configuration (EvaluatorDetails). No issues here.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed ac8a852 in 2 minutes and 44 seconds. Click for details.
  • Reviewed 61 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:309
  • Draft comment:
    Passing the 'evaluators' parameter to _execute_tasks in the GitHub flow ensures validation is applied. Verify that the error handling remains consistent with the local execution path.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to verify that error handling remains consistent, which violates the rule against asking for confirmation or verification. It does not provide a specific suggestion or point out a specific issue.
2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:472
  • Draft comment:
    Filtering for EvaluatorDetails instances is appropriate. If non-EvaluatorDetails evaluators also need validation, consider extending this logic.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This comment says "consider extending this logic" which is speculative and not actionable. It's a "what if" suggestion without clear evidence that non-EvaluatorDetails evaluators need validation. The code intentionally filters to only EvaluatorDetails because string evaluators (the other type in EvaluatorSpec) don't have the required_input_fields needed for validation. This is working as designed. The comment violates the rule about not making speculative comments like "If X, then Y is an issue" - it's suggesting a potential future enhancement without evidence it's needed. Maybe the comment is pointing out a real limitation - perhaps string evaluators should also support validation somehow? Without seeing the EvaluatorSpec type definition and understanding the full design, I might be missing context about whether this is a genuine issue. Even if there's a potential limitation, the comment is still speculative ("consider extending") rather than pointing to a concrete bug or required change. The rules explicitly state to avoid speculative comments. If string evaluators needed validation, that would be a design decision requiring broader context beyond this PR, not something to address in this code review. This comment should be deleted. It's speculative ("consider extending") and doesn't point to a concrete issue with the changed code. The filtering to only EvaluatorDetails appears intentional since only those objects have the validation fields needed.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:497
  • Draft comment:
    Propagating ValueError in the outer exception block (via 'if isinstance(e, ValueError): raise e') might prevent error aggregation. Consider handling all errors uniformly so that one failed validation doesn’t cancel processing of other rows, unless that behavior is intended.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment raises a valid point about error handling behavior. The code re-raises ValueError (validation errors) but returns TaskResult with error for other exceptions. This creates inconsistent error handling. However, looking at line 513-515, there's no exception handling in the loop that collects results, so ANY exception (not just ValueError) would stop processing. This suggests the comment might be partially correct but incomplete in its analysis. The comment is speculative ("might prevent", "unless that behavior is intended") which violates the rule about not making speculative comments. The comment also doesn't clearly state what code change is required - it says "Consider handling all errors uniformly" which is vague. The comment is somewhat speculative with phrases like "might prevent" and "unless that behavior is intended". It's asking the author to confirm their intention rather than pointing out a definite bug. Also, the comment doesn't provide a clear, actionable fix - it just suggests to "consider" uniform handling. While the comment is somewhat speculative, it does point to a real inconsistency in error handling that was introduced in the diff. The ValueError is re-raised while other exceptions return TaskResult with error. However, the lack of exception handling at line 514 means this distinction doesn't actually matter in practice - both will stop processing. The comment is not clearly actionable and asks the author to confirm intent. The comment violates the rule against speculative comments ("might prevent", "unless that behavior is intended") and asks the author to confirm their intention rather than pointing out a clear bug. While there is an inconsistency in error handling, the comment doesn't provide clear, actionable guidance on what to fix.

Workflow ID: wflow_ginYKbQs0BEMASIq

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

96-171: Fix crash when evaluators is None and align error logging with lint hints

There is a real bug here:

evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]

If evaluators is left as the default None (the common case), this will raise TypeError: 'NoneType' object is not iterable before any rows are processed.

Guard this to prevent the crash:

-        evaluators_to_validate = [evaluator for evaluator in evaluators if isinstance(evaluator, EvaluatorDetails)]
+        evaluators_to_validate = [
+            evaluator
+            for evaluator in (evaluators or [])
+            if isinstance(evaluator, EvaluatorDetails)
+        ]

Additionally, replace the four instances of str(e) inside f-strings with the more concise e!s format to comply with Ruff's RUF010 rule:

-                            error_msg = f"Error: {str(e)}"
+                            error_msg = f"Error: {e!s}"
@@
-                            print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m")
+                            print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e!s}\033[0m")
@@
-                error_msg = f"Error processing row: {str(e)}"
+                error_msg = f"Error processing row: {e!s}"
@@
-                print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m")
+                print(f"\033[91m❌ Task execution failed: {e!s}\033[0m")
@@
-                error_msg = f"Task execution error: {str(e)}"
+                error_msg = f"Task execution error: {e!s}"

130-151: Docstring advertises unsupported tuple format; evaluators not in EvaluatorSpec union are silently dropped

The EvaluatorSpec type is defined as Union[str, EvaluatorDetails] (no tuples), yet the _run_in_github docstring at line 280 states: "List of evaluator slugs or (slug, version) tuples to run". Both _run_locally (lines 130-151) and _run_in_github (lines 357-366) only handle str and EvaluatorDetails via isinstance checks; anything else is silently skipped:

  • In _run_locally: unsupported specs never enter evaluator_details, so they won't run.
  • In _run_in_github: unsupported specs never contribute to evaluator_slugs, so backend evaluators won't trigger despite the docstring promising tuple support.

This causes silent failures that are difficult to debug. Either (a) update the docstring and implementation to support tuples as shown in the review, or (b) update the docstring to match the actual EvaluatorSpec type and add explicit TypeError for unsupported specs instead of silently dropping them.

🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

38-58: Tighten evaluators type annotation in run to match default None

run() declares evaluators: List[EvaluatorSpec] = None, which is an implicit Optional and what Ruff is flagging (RUF013). The rest of the file uses Optional[List[EvaluatorSpec]] when None is allowed.

To keep type checkers and linters happy and make the intent explicit, consider:

-        evaluators: List[EvaluatorSpec] = None,
+        evaluators: Optional[List[EvaluatorSpec]] = None,

454-502: Make _execute_tasks exception flow more idiomatic and preserve original tracebacks

In _execute_tasks.run_single_row:

  • You catch ValueError from validate_task_output, print, then raise a new ValueError(str(validation_error)), losing the original traceback.
  • In the outer except, you check isinstance(e, ValueError) and do raise e, which again resets the traceback (TRY201 / B904).
  • Ruff also flags the str(...) inside the f-strings (RUF010).

You can both simplify the flow and keep original traces by re-raising the current exception and separating ValueError from other failures:

        async def run_single_row(row: Optional[Dict[str, Any]]) -> TaskResult:
             try:
                 task_output = await task(row)
 
-                # Validate task output schema on first execution
-                if evaluators_to_validate:
-                    try:
-                        validate_task_output(task_output, evaluators_to_validate)
-                    except ValueError as validation_error:
-                        print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m")
-                        raise ValueError(str(validation_error))
-
-                return TaskResult(
-                    input=row,
-                    output=task_output,
-                )
-            except Exception as e:
-                if isinstance(e, ValueError):
-                    raise e
-                print(f"\033[91m❌ Task execution error: {str(e)}\033[0m")
-                return TaskResult(
-                    input=row,
-                    error=str(e),
-                )
+                # Validate task output schema each execution (fail fast on mismatch)
+                if evaluators_to_validate:
+                    try:
+                        validate_task_output(task_output, evaluators_to_validate)
+                    except ValueError as validation_error:
+                        print(f"\033[91m❌ Task validation failed: {validation_error!s}\033[0m")
+                        # Re-raise original ValueError with its traceback
+                        raise
+
+                return TaskResult(
+                    input=row,
+                    output=task_output,
+                )
+            except ValueError:
+                # Propagate validation/type errors unchanged to abort the run
+                raise
+            except Exception as e:
+                print(f"\033[91m❌ Task execution error: {e!s}\033[0m")
+                return TaskResult(
+                    input=row,
+                    error=str(e),
+                )

Semantics stay the same (all ValueErrors still bubble out and cause run_in_github to fail), but you keep meaningful stack traces and address the static-analysis warnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between cff5fd4 and ac8a852.

📒 Files selected for processing (1)
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (13 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

43-43: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


212-212: Use explicit conversion flag

Replace with conversion flag

(RUF010)


215-215: Use explicit conversion flag

Replace with conversion flag

(RUF010)


222-222: Use explicit conversion flag

Replace with conversion flag

(RUF010)


224-224: Use explicit conversion flag

Replace with conversion flag

(RUF010)


488-488: Use explicit conversion flag

Replace with conversion flag

(RUF010)


489-489: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


497-497: Use raise without specifying exception name

Remove exception name

(TRY201)


498-498: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Lint
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Build Packages (3.11)

@nina-kollman nina-kollman changed the title fix(exp): Improve exp error handling fix(exp): Add made by traceloop evaluators Dec 7, 2025
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed 7f1729d in 2 minutes and 10 seconds. Click for details.
  • Reviewed 87 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 7 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:16
  • Draft comment:
    Docstring example now uses 'required_input_fields' instead of 'required_fields' for consistency with the model.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py:2
  • Draft comment:
    Removed unused imports ('inspect' and extra typing helpers); clean-up looks good.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
3. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:13
  • Draft comment:
    Updated example import to use 'EvaluatorMadeByTraceloop' instead of 'Predefined'.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
4. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:65
  • Draft comment:
    Updated 'prompt_injection' evaluator doc: required field changed from 'text' to 'prompt'.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
5. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:196
  • Draft comment:
    For 'char_count_ratio', required fields have been updated to 'numerator_text' and 'denominator_text' for clarity.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
6. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:212
  • Draft comment:
    In 'word_count', changed required field from 'completion' to 'text' to match actual input.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
7. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:263
  • Draft comment:
    In 'faithfulness', updated required field from 'answer' to 'completion' for consistency.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_HqrJV37uV00rUJGe

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

2-2: LGTM! Unused imports removed.

The unused imports flagged by Flake8 have been properly removed, keeping only the types actually used in this file.

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

5-19: LGTM! Clear factory class design.

The class provides a clean factory pattern for creating preconfigured evaluators with proper documentation and examples.


21-78: LGTM! Detection evaluators properly configured.

The detector evaluators (PII, toxicity, prompt injection) are correctly implemented with matching docstrings and required_input_fields.


241-274: LGTM! RAG evaluators properly configured.

The answer_relevancy and faithfulness evaluators have correct docstrings matching their required_input_fields.


224-239: Remove stray "+" character on Line 232.

There's a "+" character at the beginning of Line 232 that appears to be a formatting artifact or merge leftover.

Apply this diff:

         Required task output fields:
            - numerator_text: The numerator text for ratio calculation
-          - denominator_text: The denominator text for ratio calculation
+           - denominator_text: The denominator text for ratio calculation
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)

210-221: Consider narrowing the exception handler.

The broad except Exception is flagged by static analysis. While using a fallback here is appropriate, consider either catching json.JSONDecodeError specifically or adding a comment explaining why a broad exception handler is intentional.

Apply this diff to narrow the exception:

     error_detail = response.text
     try:
         error_json = response.json()
         if isinstance(error_json, dict):
             if 'error' in error_json:
                 error_detail = error_json['error']
             elif 'message' in error_json:
                 error_detail = error_json['message']
             elif 'msg' in error_json:
                 error_detail = error_json['msg']
-    except Exception:
-        pass  # Use response.text as fallback
+    except (ValueError, TypeError):
+        pass  # JSON parsing failed, use response.text as fallback

Alternatively, keep the broad exception but add a clarifying comment:

-    except Exception:
-        pass  # Use response.text as fallback
+    except Exception:
+        # Intentionally catch all exceptions to ensure error extraction never fails
+        # Falls back to response.text which is already set above
+        pass

Based on static analysis hints from Ruff.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ac8a852 and 7f1729d.

📒 Files selected for processing (3)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (9 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-22)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py

220-221: try-except-pass detected, consider logging the exception

(S110)


220-220: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Lint
🔇 Additional comments (6)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (5)

26-45: LGTM! Clean API extension for evaluator configuration.

The evaluator_config parameter is properly typed, documented, and propagated to the request model.


60-65: LGTM! Improved error reporting.

The enhanced error message now includes both the evaluator slug and extracted error details, which will significantly improve debugging when evaluators fail.


69-109: LGTM! Consistent API extension.

The evaluator_config parameter is properly added to the signature, documented in the docstring, and correctly forwarded to the request builder.


111-145: LGTM! Consistent API extension.

The evaluator_config parameter follows the same pattern as run_experiment_evaluator, maintaining API consistency.


148-194: No changes needed—the code is correct for this project's Python version.

The project requires Python 3.10+ (as specified in pyproject.toml), which natively supports PEP 585 generic type syntax. Using set[str] instead of Set[str] from typing is the modern, recommended approach for Python 3.9+. The implementation is correct as-is.

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

1-22: LGTM! Clean Pydantic model design.

The EvaluatorDetails model is well-structured with proper typing, clear documentation, and helpful examples. This provides a solid foundation for the evaluator configuration system.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed f64177c in 1 minute and 26 seconds. Click for details.
  • Reviewed 170 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:10
  • Draft comment:
    Docstring for 'required_input_fields' is split over two lines. Consider aligning the continuation for clarity.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:171
  • Draft comment:
    Multi-line formatting for EvaluatorDetails instantiation improves readability. Ensure consistency across all evaluator factory methods.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is purely informative and suggests ensuring consistency, which violates the rules. It doesn't provide a specific code suggestion or ask for a specific test to be written.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:163
  • Draft comment:
    Added newline after filtering evaluators improves visual separation; this minor formatting change is clear.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that a newline was added and that it improves visual separation, which is not necessary for the PR author to know.
4. packages/traceloop-sdk/traceloop/sdk/evaluator/config.py:13
  • Draft comment:
    Typographical note: The description for 'required_input_fields' is split awkwardly over two lines ('required fields to the evaluator' on line 13 and 'input. These fields must be present in the task output.' on line 14). Consider merging these lines or rephrasing for clarity if this split is not intentional.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The change in the diff is actually IMPROVING the formatting by splitting a long line into two lines with proper indentation, which is a common Python docstring practice to keep lines under 80-100 characters. The comment is criticizing this improvement and suggesting to undo it or change it. The split is intentional and follows Python conventions. The comment is incorrect because it's treating an improvement as a problem. The description reads naturally: "Optional list of required fields to the evaluator input. These fields must be present in the task output." The split happens at a natural point between "evaluator" and "input". Could the line split actually be awkward to read? Maybe the comment has a point about the phrasing being unclear regardless of the line split. Perhaps "required fields to the evaluator input" is grammatically awkward and could be rephrased to "required input fields for the evaluator" or similar. While the phrasing could potentially be improved, the comment is specifically criticizing the line split itself ("split awkwardly over two lines"), not the grammar or clarity of the text. The line split is a standard formatting improvement and the comment is asking to undo it or questioning if it's intentional. This is not a useful comment. The comment should be deleted. It's criticizing a standard and intentional formatting improvement (splitting a long docstring line). The change follows Python conventions and improves readability by keeping lines shorter. The comment is not identifying a real issue.

Workflow ID: wflow_3iUGxL2v85DZDBHb

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

361-373: Evaluator configurations are not passed in GitHub execution mode.

The code extracts evaluator slugs but discards version and config information. This means evaluator configurations work in local execution (_run_locally) but are silently ignored in GitHub CI/CD mode (_run_in_github).

This inconsistency could confuse users who configure evaluators expecting the same behavior across execution modes.

To fix this, you'll need to:

  1. Update RunInGithubRequest model to accept evaluator details (slug, version, config) instead of just slugs
  2. Modify the backend to handle evaluator configs in GitHub execution
  3. Update this code to pass the full evaluator details

If GitHub execution doesn't support evaluator configs yet, consider:

  • Adding a warning when configs are provided in GitHub mode
  • Documenting this limitation
  • Opening an issue to track full support
♻️ Duplicate comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

235-246: Remove stray + and fix bullet indentation in word_count_ratio docstring.

The docstring line for denominator_text still has a leading +, and indentation for the bullets is inconsistent. This is a leftover from a diff and hurts readability, even though it doesn't affect runtime behavior.

Suggested fix:

         Required task output fields:
-           - numerator_text: The numerator text for ratio calculation
-+          - denominator_text: The denominator text for ratio calculation
+            - numerator_text: The numerator text for ratio calculation
+            - denominator_text: The denominator text for ratio calculation
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

484-490: Consider graceful handling of validation errors.

The current implementation re-raises ValueError on validation failure, which will abort all remaining tasks in GitHub execution mode. This could be unexpected if users want to process other rows despite validation failures in one row.

Additionally, the exception re-raise should use from for better exception chaining.

Apply this diff to improve exception chaining:

                     except ValueError as validation_error:
                         print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m")
-                        raise ValueError(str(validation_error))
+                        raise ValueError(str(validation_error)) from validation_error

Consider whether to:

  1. Return TaskResult(input=row, error=str(validation_error)) instead of raising (as suggested in the past review)
  2. Add a parameter to control whether validation failures should abort execution
🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

114-137: Consider enforcing schema_string when enable_schema_validation=True.

The docstring says schema_string is required when enable_schema_validation is True, but the implementation does not enforce this. To avoid silent misconfiguration, you could validate inputs:

     def json_validator(
         enable_schema_validation: bool = False,
         schema_string: Optional[str] = None,
     ) -> EvaluatorDetails:
@@
-        config: Dict[str, Any] = {
-            "enable_schema_validation": enable_schema_validation,
-        }
-        if schema_string:
-            config["schema_string"] = schema_string
+        config: Dict[str, Any] = {"enable_schema_validation": enable_schema_validation}
+        if enable_schema_validation and schema_string is None:
+            raise ValueError("schema_string is required when enable_schema_validation is True")
+        if schema_string is not None:
+            config["schema_string"] = schema_string

This keeps the runtime behavior aligned with the documented contract.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

497-498: Use bare raise to preserve the original exception traceback.

When re-raising an exception without modification, use bare raise instead of raise e to maintain the complete traceback chain.

Apply this diff:

                 if isinstance(e, ValueError):
-                    raise e
+                    raise
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7f1729d and f64177c.

📒 Files selected for processing (3)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (13 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/config.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🧬 Code graph analysis (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-23)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (2)
  • Evaluator (15-145)
  • validate_task_output (148-194)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

43-43: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


213-213: Use explicit conversion flag

Replace with conversion flag

(RUF010)


216-216: Use explicit conversion flag

Replace with conversion flag

(RUF010)


223-223: Use explicit conversion flag

Replace with conversion flag

(RUF010)


225-225: Use explicit conversion flag

Replace with conversion flag

(RUF010)


489-489: Use explicit conversion flag

Replace with conversion flag

(RUF010)


490-490: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


498-498: Use raise without specifying exception name

Remove exception name

(TRY201)


499-499: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Lint
🔇 Additional comments (6)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

5-19: Factory class and usage example look consistent.

The EvaluatorMadeByTraceloop factory and the docstring example (import + staticmethod calls) line up correctly with the actual API surface. No functional or documentation issues here.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (5)

8-8: LGTM: Imports correctly support the new evaluator configuration system.

The addition of validate_task_output, EvaluatorSpec, and EvaluatorDetails imports properly supports the enhanced evaluator functionality introduced in this PR.

Also applies to: 14-14, 21-21


130-140: LGTM: Evaluator conversion logic correctly handles both string slugs and EvaluatorDetails.

The conversion properly extracts slug, version, and config for each evaluator type, creating the structured tuples needed for downstream processing.


163-171: LGTM: Task output validation correctly filters and validates against EvaluatorDetails.

The logic appropriately extracts only EvaluatorDetails instances (which can have required_input_fields) and validates task output against them.


182-202: LGTM: Evaluator execution properly propagates configuration.

The evaluator invocation now correctly passes evaluator_config to both run_experiment_evaluator and trigger_experiment_evaluator, enabling per-evaluator configuration.


213-216: LGTM: Enhanced error visibility with colored console output.

The colored error messages improve the user experience by making failures immediately visible in console output.

Also applies to: 224-225

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed a8719d8 in 2 minutes and 0 seconds. Click for details.
  • Reviewed 71 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:149
  • Draft comment:
    Doc update: Added the missing 'text' field in the required task output fields for placeholder_regex. Ensure the input validation matches this change.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative as it mentions a documentation update and suggests ensuring input validation matches the change. It doesn't provide a specific code suggestion or ask for a test to be written.
2. packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py:243
  • Draft comment:
    Doc update: Added 'denominator_text' to the required task output fields in word_count_ratio. This ensures the docs match the EvaluatorDetails configuration.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:41
  • Draft comment:
    API change: The 'evaluators' parameter in run() is now required (no default of None). Confirm that caller code is updated to supply at least an empty list if needed.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
4. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:92
  • Draft comment:
    Signature update: In _run_locally(), 'evaluators' is now a required parameter. Consider if a default empty list might be preferable to maintain backward compatibility.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is about a private method _run_locally(). Private methods (prefixed with _) are internal implementation details and not part of the public API. The public run() method already requires evaluators as a mandatory parameter, and it's the only caller of _run_locally(). Since this is an internal method and the public API already requires evaluators, there's no backward compatibility concern. The comment is suggesting a change that isn't necessary and is based on a misunderstanding of the scope of the change. This violates the rule about not making comments unless there's clearly a code change required. Could there be other internal callers of _run_locally() that I'm not seeing in this file? Could this method be called from other files in the codebase? However, the rules say to ignore cross-file issues and only think about the file being reviewed. Even if there were other callers, the rules explicitly state to ignore cross-file issues. Within this file, _run_locally() is only called from run() which always passes evaluators. The comment is speculative about backward compatibility without evidence of an actual problem in the code being reviewed. This comment should be deleted. It's about a private method's signature change, and the public API already requires the evaluators parameter. There's no backward compatibility issue within this file, and the comment is speculative without clear evidence of a problem.
5. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:262
  • Draft comment:
    Signature update: In _run_in_github(), the 'evaluators' parameter is now non-optional. Verify that this change is intended and that callers pass a list (even if empty).
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_fWAMHwi9aQr4ygXc

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

8-50: Make evaluators parameter optional to align type signature with defensive implementation

The evaluators: List[EvaluatorSpec] parameter in run(), _run_locally(), and _run_in_github() has no default, making it required and breaking existing code that omits it. However, the implementation uses defensive if evaluators: guards (lines 132, 360, 475) that safely handle empty or falsy cases.

To align the type signature with the defensive implementation intent and restore backward compatibility, make evaluators optional in all three methods:

-    async def run(..., evaluators: List[EvaluatorSpec], ...):
+    async def run(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...):

-    async def _run_locally(..., evaluators: List[EvaluatorSpec], ...):
+    async def _run_locally(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...):

-    async def _run_in_github(..., evaluators: List[EvaluatorSpec], ...):
+    async def _run_in_github(..., evaluators: Optional[List[EvaluatorSpec]] = None, ...):

The existing guards will safely handle None values.

♻️ Duplicate comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

243-243: Remove the extra + character from the docstring.

Line 243 contains a stray + character at the beginning of the line, likely a leftover diff marker. This will appear in the rendered documentation.

Apply this diff to remove the extra character:

-+          - denominator_text: The denominator text for ratio calculation
+           - denominator_text: The denominator text for ratio calculation
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

455-503: _execute_tasks validation and ValueError handling may abort all tasks; also a few linter-aligned cleanups

The new _execute_tasks helper is a nice abstraction and the validation hook via validate_task_output is a good safeguard. A few points to tighten up:

  1. ValueError handling aborts the entire GitHub run (previous concern still applies)

Current logic:

try:
    task_output = await task(row)
    if evaluators_to_validate:
        try:
            validate_task_output(task_output, evaluators_to_validate)
        except ValueError as validation_error:
            print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m")
            raise ValueError(str(validation_error))
    return TaskResult(...)
except Exception as e:
    if isinstance(e, ValueError):
        raise e
    print(f"\033[91m❌ Task execution error: {str(e)}\033[0m")
    return TaskResult(input=row, error=str(e))
  • Any ValueError (from the task or from validation) is propagated out of _execute_tasks, aborting the loop at line 514 on the first failing row.
  • That means later rows are never awaited, and GitHub runs become “fail fast” on ValueError.

This is very similar to the existing review comment at Line 490 questioning whether this should instead return a TaskResult with an error field. If you want behavior consistent with local execution (and to avoid partial processing), you could treat validation failures like other errors:

         if evaluators_to_validate:
             try:
                 validate_task_output(task_output, evaluators_to_validate)
             except ValueError as validation_error:
-                print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m")
-                raise ValueError(str(validation_error))
+                msg = str(validation_error)
+                print(f"\033[91m❌ Task validation failed: {msg}\033[0m")
+                return TaskResult(input=row, error=msg)

-    except Exception as e:
-        if isinstance(e, ValueError):
-            raise e
-        print(f"\033[91m❌ Task execution error: {str(e)}\033[0m")
-        return TaskResult(
-            input=row,
-            error=str(e),
-        )
+    except Exception as e:
+        print(f"\033[91m❌ Task execution error: {e!s}\033[0m")
+        return TaskResult(input=row, error=str(e))

This preserves validation and logging but avoids aborting the whole run on first ValueError.

  1. Comment vs. implementation mismatch

The comment says:

# Validate task output schema on first execution

but validation runs on every row. Either adjust the comment to reflect that it validates each row, or short-circuit validation after the first success/failure if only schema shape is important.

  1. Linter-style improvements (RUF010, B904, TRY201)

If you decide to keep the “fail fast on ValueError” behavior, you can still clean up the exception handling and prints:

  • Prefer explicit conversion flags in f-strings:
- print(f"\033[91m❌ Task validation failed: {str(validation_error)}\033[0m")
+ print(f"\033[91m❌ Task validation failed: {validation_error!s}\033[0m")

- print(f"\033[91m❌ Task execution error: {str(e)}\033[0m")
+ print(f"\033[91m❌ Task execution error: {e!s}\033[0m")
  • If you keep re-wrapping the ValueError, chain it:
- raise ValueError(str(validation_error))
+ raise ValueError(str(validation_error)) from validation_error
  • If you keep the current branchy except logic, prefer a dedicated except ValueError for re-raising instead of raise e:
except ValueError:
    raise
except Exception as e:
    print(f"\033[91m❌ Task execution error: {e!s}\033[0m")
    return TaskResult(input=row, error=str(e))

These adjustments improve tracebacks and align with the static analysis hints without changing semantics.

🧹 Nitpick comments (2)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

91-226: Evaluator normalization and local validation flow look sound; minor robustness & logging tweaks

The new _run_locally flow correctly:

  • Normalizes evaluators: List[EvaluatorSpec] into (slug, version, config) tuples for evaluator execution.
  • Extracts EvaluatorDetails instances into evaluators_to_validate and uses validate_task_output before creating tasks.
  • Passes evaluator_config through to both run_experiment_evaluator and trigger_experiment_evaluator.

A couple of small improvements to consider:

  1. Defensive handling of unsupported EvaluatorSpec variants

Right now only str and EvaluatorDetails are handled:

for evaluator in evaluators:
    if isinstance(evaluator, str):
        ...
    elif isinstance(evaluator, EvaluatorDetails):
        ...

If EvaluatorSpec is extended in the future, such evaluators will silently be ignored. You may want to either:

  • Raise a TypeError for unknown types, or
  • At least log a warning so misconfigured specs are visible.
  1. Use f-string conversion instead of str() for errors (RUF010)

For the new logging, you can simplify and satisfy the linter by using conversion flags instead of str() inside f-strings, e.g.:

- error_msg = f"Error: {str(e)}"
+ error_msg = f"Error: {e!s}"

- print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {str(e)}\033[0m")
+ print(f"\033[91m❌ Evaluator '{evaluator_slug}' failed: {e!s}\033[0m")

- print(f"\033[91m❌ Task execution failed: {str(e)}\033[0m")
+ print(f"\033[91m❌ Task execution failed: {e!s}\033[0m")

Functionally everything here looks correct; these are just polish and future-proofing suggestions.


258-378: GitHub flow wiring to _execute_tasks and evaluator slugs is consistent, but error behavior now hinges on ValueError semantics

The GitHub path now:

  • Executes tasks via _execute_tasks(rows, task, evaluators), which performs the same validation used locally.
  • Derives evaluator_slugs from evaluators, supporting both str and EvaluatorDetails, then sends only slugs to the backend. This matches how _init_experiment is fed slugs locally.

One thing to be aware of: _execute_tasks re-raises any ValueError (see comment below), so in GitHub mode a ValueError from either:

  • The task itself, or
  • validate_task_output

will abort _run_in_github and fail the workflow, rather than returning a partial task_results list with per-row errors. That might be exactly what you want for CI, but it is a behavior change compared to local execution, where errors are wrapped in TaskResponse unless stop_on_error is set.

If the intent is to keep GitHub runs “all-or-nothing” on validation failures, this is fine; otherwise, consider aligning _execute_tasks’s ValueError handling with _run_locally (returning error-bearing TaskResult objects instead of aborting).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f64177c and a8719d8.

📒 Files selected for processing (2)
  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1 hunks)
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (14 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
  • packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

213-213: Use explicit conversion flag

Replace with conversion flag

(RUF010)


216-216: Use explicit conversion flag

Replace with conversion flag

(RUF010)


223-223: Use explicit conversion flag

Replace with conversion flag

(RUF010)


225-225: Use explicit conversion flag

Replace with conversion flag

(RUF010)


489-489: Use explicit conversion flag

Replace with conversion flag

(RUF010)


490-490: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


498-498: Use raise without specifying exception name

Remove exception name

(TRY201)


499-499: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Lint
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Test Packages (3.10)
🔇 Additional comments (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

1-455: Well-structured factory implementation!

The EvaluatorMadeByTraceloop class provides a clean, type-safe factory pattern for instantiating evaluators. All previous docstring/implementation inconsistencies have been resolved, and the code demonstrates:

  • Consistent structure across all 20 evaluator factory methods
  • Proper type hints for configuration parameters
  • Clear documentation of required task output fields
  • Correct instantiation of EvaluatorDetails with appropriate parameters

Once the minor formatting issue on line 243 is addressed, this will be ready to merge.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed 91c0096 in 2 minutes and 13 seconds. Click for details.
  • Reviewed 565 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:98
  • Draft comment:
    Consider adding a comment explaining why the asyncio_default_fixture_loop_scope is set to 'function'. This can help future maintainers understand the reasoning behind this choice.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to add an explanation for a specific setting. It is not a direct code suggestion or a request for a test, and it doesn't point out a potential issue or improvement in the code itself. It seems to be more about documentation, which is not allowed by the rules.
2. packages/traceloop-sdk/tests/experiment/test_experiment.py:168
  • Draft comment:
    The assertion 'assert len(results) + len(errors) <= 3' is a bit loose since the dataset contains 3 rows. If the intent is to stop processing immediately after a validation error, consider asserting that only one row is processed (or a stricter condition) to verify early termination.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% The reviewer has a valid point about test precision. The current assertion <= 3 is indeed very loose - it would pass even if all 3 rows were processed, which would mean the early termination isn't working. However, I need to consider: (1) Without seeing the actual implementation of _run_locally, I can't be 100% certain that exactly 1 row should be processed. There might be legitimate reasons for the looser assertion (e.g., async processing, batching, or other implementation details). (2) The comment is about code quality/test improvement, which is generally good. (3) The comment is actionable and specific. (4) However, the test author may have intentionally used <= 3 for a reason - perhaps the implementation doesn't guarantee exactly 1 row is processed, or there are edge cases. Without seeing the implementation, I can't be certain the stricter assertion is correct. I'm making an assumption that the stricter assertion would be correct without seeing the actual implementation of _run_locally. The test author may have used the looser assertion deliberately because the implementation doesn't guarantee exactly one row is processed before stopping - there could be async processing, buffering, or other factors that make the exact count unpredictable. While the comment makes a reasonable suggestion for improving test precision, without seeing the implementation details of _run_locally, I cannot confirm that == 1 is the correct assertion. The test author may have intentionally used a looser assertion to account for implementation details. This is speculative advice about test quality rather than a clear bug. This comment suggests a test improvement but is speculative about what the correct behavior should be. Without seeing the implementation, we can't confirm that exactly 1 row should be processed. The current assertion may be intentionally loose to handle implementation details. This falls into the category of "speculative" or "code quality" suggestions that aren't clearly necessary.

Workflow ID: wflow_ipi0eNz9YZ0SP7P6

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

error_message = str(exc_info.value)
# Should only mention the failing evaluator
assert "relevance-checker requires:" in error_message
assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert using or on line 119 is ambiguous. It's clearer to assert that 'pii-detector requires:' does not appear in error_message. Consider replacing the condition with a more explicit check.

Suggested change
assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message
assert "pii-detector requires:" not in error_message

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
packages/traceloop-sdk/tests/evaluator/test_evaluator.py (2)

103-120: Simplify the assertion logic for clarity.

The assertion on Line 120 is logically correct but unnecessarily complex. Since the test verifies that only failing evaluators appear in the error message, you can simplify it.

Apply this diff:

-        assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message
+        assert "pii-detector requires:" not in error_message

This directly checks that the passing evaluator doesn't appear in the "requires:" format, which is clearer and achieves the same goal.


180-200: Simplify the assertion logic for clarity.

Similar to Line 120, the assertion on Line 200 can be simplified for better readability.

Apply this diff:

-        assert "pii-detector" not in error_message or "pii-detector requires:" not in error_message
+        assert "pii-detector requires:" not in error_message

This directly verifies that the passing evaluator doesn't appear in the error message's "requires:" format.

packages/traceloop-sdk/tests/experiment/test_experiment.py (1)

268-314: Consider checking or prefixing the unused results variable.

The test focuses on validation errors, which is fine, but Line 301 assigns results without using it. For clarity and to silence the linter, either:

  • Check that results is empty: assert len(results) == 0
  • Prefix with underscore: _results, errors = ...

Apply this diff:

-        results, errors = await experiment._run_locally(
+        _results, errors = await experiment._run_locally(

Note: The unused row parameter on Line 283 is intentional for signature compatibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a8719d8 and 91c0096.

📒 Files selected for processing (3)
  • packages/traceloop-sdk/pyproject.toml (1 hunks)
  • packages/traceloop-sdk/tests/evaluator/test_evaluator.py (1 hunks)
  • packages/traceloop-sdk/tests/experiment/test_experiment.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/tests/evaluator/test_evaluator.py
  • packages/traceloop-sdk/tests/experiment/test_experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/tests/evaluator/test_evaluator.py (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
  • validate_task_output (148-194)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-23)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/tests/experiment/test_experiment.py

140-140: Unused function argument: row

(ARG001)


187-187: Unused function argument: row

(ARG001)


235-235: Unused function argument: row

(ARG001)


283-283: Unused function argument: row

(ARG001)


301-301: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


336-336: Unused function argument: row

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Lint
🔇 Additional comments (13)
packages/traceloop-sdk/pyproject.toml (2)

101-102: No compatibility issues with asyncio fixture scope configuration.

The asyncio_default_fixture_loop_scope = "function" setting is appropriate for pytest-asyncio 0.23.7 and compatible with the existing test suite. All current fixtures use @pytest.fixture (not @pytest_asyncio.fixture) with explicit scopes (session, module, or function), so this setting will not affect them. Async tests already use @pytest.mark.asyncio decorators, which work seamlessly with function-scoped fixture defaults.


104-135: The evaluator code meets mypy strictness requirements.

All new evaluator code is properly typed and complies with the strict mypy configuration (disallow_untyped_defs = true):

  • EvaluatorMadeByTraceloop: All 22 static factory methods have complete type annotations for parameters and return types (-> EvaluatorDetails).
  • Evaluator class: All methods (including async methods) have full parameter and return type hints.
  • Supporting modules (model.py, config.py): All Pydantic models properly typed.
  • Helper functions: validate_task_output() and _extract_error_from_response() have complete signatures.

The evaluator module is not in the mypy exclusion list, confirming it must meet the strict requirements—and it does. The excluded modules (decorators, prompts, tracing, utils) do not impact evaluator code typing.

packages/traceloop-sdk/tests/evaluator/test_evaluator.py (6)

1-8: LGTM!

The imports and test class setup are clean and follow standard pytest conventions.


9-40: LGTM!

These baseline tests properly verify that validation passes when:

  • No evaluators are provided
  • Evaluators have no required fields
  • All required fields are present in the output

42-101: LGTM!

Excellent coverage of missing field scenarios. The tests properly verify:

  • Error messages include evaluator names and missing fields
  • Task output contents are shown in errors
  • Helpful hints are provided

122-149: LGTM!

Good edge case coverage:

  • Empty task output properly fails validation with clear error message
  • Extra fields are correctly allowed (only required fields are enforced)

151-178: LGTM!

These tests properly verify:

  • Field name matching is case-sensitive (important for correctness)
  • Evaluator config doesn't affect validation logic

202-220: LGTM!

This test properly verifies that when multiple evaluators require the same field, all failing evaluators are reported in the error message. Good coverage of this edge case.

packages/traceloop-sdk/tests/experiment/test_experiment.py (5)

2-2: LGTM!

The new imports (AsyncMock and EvaluatorDetails) are necessary for the expanded test coverage and properly placed.

Also applies to: 5-5


125-170: LGTM!

This test properly verifies that validation failures stop processing when stop_on_error=True. The test setup, mocking, and assertions are all correct.

Note: The unused row parameter on Line 140 (flagged by static analysis) is intentional—it matches the expected task function signature.


172-213: LGTM!

This test correctly verifies that validation errors are captured without stopping execution when stop_on_error=False. Good complementary coverage to the previous test.

Note: The unused row parameter on Line 187 is intentional for signature compatibility.


215-266: LGTM!

This test properly verifies the success path when task output contains all required fields. The mocking strategy and assertions are correct.

Note: The unused row parameter on Line 235 is intentional for signature compatibility.


316-364: LGTM!

This test correctly verifies that validation is only performed for EvaluatorDetails objects, not string evaluators. This is an important behavioral distinction and good to have test coverage for.

Note: The unused row parameter on Line 336 is intentional for signature compatibility.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed 791ca28 in 26 seconds. Click for details.
  • Reviewed 13 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/tests/evaluator/test_evaluator.py:120
  • Draft comment:
    Change improves clarity by explicitly asserting that 'pii-detector requires:' is absent from the error message. This directly enforces that only failing evaluators are mentioned.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states that the change improves clarity without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest any improvements.

Workflow ID: wflow_B0kqq0wTBcuwQFzc

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed 74ecdf4 in 1 minute and 8 seconds. Click for details.
  • Reviewed 76 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:98
  • Draft comment:
    Removed the [tool.pytest.ini_options] block. Please double-check that dropping 'asyncio_default_fixture_loop_scope = "function"' won’t affect async tests.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
2. packages/traceloop-sdk/tests/conftest.py:237
  • Draft comment:
    The added 'anyio_backend' fixture effectively forces the asyncio backend. This is appropriate—just ensure it aligns with any future async tests.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
3. packages/traceloop-sdk/tests/experiment/test_experiment.py:124
  • Draft comment:
    Consider applying the @pytest.mark.anyio(backends=["asyncio"]) marker at the class level for TestRunLocallyValidation to avoid repetition, if all its methods require the same backend.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
4. packages/traceloop-sdk/tests/experiment/test_experiment.py:136
  • Draft comment:
    If _init_experiment is asynchronous, consider using AsyncMock instead of Mock for consistency.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
5. packages/traceloop-sdk/tests/experiment/test_experiment.py:360
  • Draft comment:
    In test_run_locally_no_validation_for_string_evaluators, consider asserting more details about the evaluator’s output if appropriate.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_P96uZpYKAaXaV4GL

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 791ca28 and 74ecdf4.

📒 Files selected for processing (2)
  • packages/traceloop-sdk/tests/conftest.py (1 hunks)
  • packages/traceloop-sdk/tests/experiment/test_experiment.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/traceloop-sdk/tests/conftest.py
  • packages/traceloop-sdk/tests/experiment/test_experiment.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/tests/experiment/test_experiment.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
  • EvaluatorDetails (5-23)
packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)
  • get_version_jsonl (218-223)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluator.py (1)
  • run_experiment_evaluator (69-109)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/tests/experiment/test_experiment.py

140-140: Unused function argument: row

(ARG001)


188-188: Unused function argument: row

(ARG001)


237-237: Unused function argument: row

(ARG001)


286-286: Unused function argument: row

(ARG001)


304-304: Unpacked variable results is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


340-340: Unused function argument: row

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Test Packages (3.10)
  • GitHub Check: Test Packages (3.11)
  • GitHub Check: Build Packages (3.11)
  • GitHub Check: Test Packages (3.12)
  • GitHub Check: Lint
🔇 Additional comments (7)
packages/traceloop-sdk/tests/conftest.py (1)

236-239: LGTM! Standard pytest-anyio configuration.

The session-scoped fixture correctly configures anyio to use the asyncio backend, which is necessary for the new async tests introduced in this PR.

packages/traceloop-sdk/tests/experiment/test_experiment.py (6)

2-2: LGTM! Necessary imports for new test functionality.

AsyncMock and EvaluatorDetails are required for the new async validation tests in TestRunLocallyValidation.

Also applies to: 5-5


125-170: Well-structured validation test.

The test correctly verifies that validation failures stop processing when stop_on_error=True, and the assertions appropriately check for both error messages and early termination behavior.


172-214: LGTM! Complementary test for stop_on_error=False.

The test correctly verifies that validation errors are captured without stopping processing when stop_on_error=False, providing good coverage alongside the first test.


216-268: LGTM! Comprehensive positive test case.

The test correctly verifies the happy path when task output contains all required fields, with appropriate mocking and assertions.


319-368: LGTM! Important test for backward compatibility.

This test correctly verifies that string evaluators (legacy format) bypass validation, ensuring backward compatibility while allowing EvaluatorDetails to opt into validation.


121-123: Excellent test coverage for validation scenarios.

The new TestRunLocallyValidation class provides comprehensive coverage of the validation functionality, testing various edge cases including stop_on_error behavior, missing fields, multiple evaluators, and backward compatibility with string evaluators. The tests are well-structured and use appropriate mocking patterns.

@nina-kollman nina-kollman merged commit f12aaec into main Dec 7, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants