Skip to content

Conversation

@sunxilin
Copy link
Contributor

@sunxilin sunxilin commented Jan 5, 2026

No description provided.

@sunxilin sunxilin self-assigned this Jan 5, 2026
@claude
Copy link

claude bot commented Jan 6, 2026

Code Review: PR #1946 - Fix Azure ASR Fatal Error

Overview

This PR introduces proper fatal error handling for Azure ASR by distinguishing between fatal errors (authentication failures, bad requests) that should stop retries and non-fatal errors that allow reconnection attempts.

Strengths

  1. Correct Problem Identification: The PR properly identifies that authentication failures and bad requests are unrecoverable errors that should not trigger infinite retry loops.

  2. Good Code Organization: Adding the FATAL_ERROR_CODES constant to const.py follows the repository's pattern of centralizing configuration values.

  3. Version Bump: Appropriately incremented version from 0.2.3 to 0.2.4 in manifest.json.

  4. Clear Logic: The conditional error code assignment is straightforward and easy to understand.

Issues and Concerns

1. Incomplete Fatal Error Coverage (Medium Priority)

Location: const.py:15-18

The PR only handles two error codes: AuthenticationFailure and BadRequest. Consider adding other fatal error codes like Forbidden for authorization issues. Documentation on which specific Azure error codes should be fatal vs. retryable would help future maintainers.

2. Reconnection Loop Termination (Important)

Location: extension.py:435

Setting self.stopped = True prevents reconnection, but there's a potential race condition. The canceled event handler may fire AFTER session_stopped has already initiated a reconnect attempt, since the order of events from Azure SDK is not guaranteed.

Recommendation: Consider adding a check at the start of _handle_reconnect() to respect the stopped flag immediately and prevent race conditions.

3. Test Coverage Gap (Important)

Location: tests/test_vendor_error.py:60

The existing test expects a NON_FATAL_ERROR (code 1000) for any cancellation. With this PR, fatal errors should return code -1000.

Required: Update test_vendor_error.py or create test_fatal_vendor_error.py to verify fatal error codes produce FATAL_ERROR (-1000), non-fatal codes produce NON_FATAL_ERROR (1000), self.stopped flag prevents reconnection after fatal errors, and reconnection still works for non-fatal errors.

4. Logging Enhancement (Minor)

Location: extension.py:433-435

The code sets self.stopped = True but doesn't explicitly log that reconnection has been disabled. Add a log message when stopping reconnection attempts due to fatal error to improve debugging.

Security Considerations

No security issues identified. The PR properly handles authentication failures and prevents resource exhaustion from infinite retries on auth failures.

Performance Considerations

Positive performance impact: Eliminates infinite retry loops for unrecoverable errors, reduces unnecessary API calls to Azure when credentials are invalid, and properly fails fast for configuration errors.

Additional Recommendations

  1. Add a comment explaining WHY these specific error codes are fatal
  2. Link to Azure Speech SDK error code documentation for future reference
  3. Monitor logs after deployment for unexpected fatal error classifications

Summary

This is a good fix that addresses a real issue with infinite retry loops on fatal errors. The implementation is clean and follows existing patterns in the codebase.

Before merging, please address:

  1. Add test coverage for fatal vs non-fatal error handling (Required)
  2. Consider adding Forbidden to fatal error list
  3. Add safety check in _handle_reconnect() to respect stopped flag
  4. Enhance logging when disabling reconnection

Approval Status: Approve with recommendations for test coverage.

@claude
Copy link

claude bot commented Jan 6, 2026

Code Review: Azure ASR Fatal Error Handling

Summary

This PR introduces fatal error handling for Azure ASR to prevent unnecessary reconnection attempts when encountering authentication failures or bad requests. The changes properly differentiate between fatal and non-fatal errors, with corresponding test coverage.


Positive Aspects

  1. Well-Defined Fatal Errors: The introduction of FATAL_ERROR_CODES in const.py clearly identifies AuthenticationFailure and BadRequest as non-retriable errors - this is correct behavior.

  2. Comprehensive Test Coverage: The test improvements in test_reconnection.py are excellent:

    • Tracks error codes to validate error sequences
    • Tests both fatal and non-fatal error paths
    • Validates that fatal errors stop retries
    • Verifies non-fatal errors trigger continued retries
  3. Proper Version Bump: Incrementing the manifest version from 0.2.3 to 0.2.4 follows semantic versioning for a bug fix.

  4. Clear Logic Flow: The error handling logic in extension.py:431-452 is clear and well-structured.


🔍 Issues & Concerns

1. Critical Bug: Incorrect Error Code in ReconnectManager ⚠️

Location: reconnect_manager.py:106

await error_handler(
    ModuleError(
        module=MODULE_NAME_ASR,
        code=ModuleErrorCode.FATAL_ERROR.value,  # ❌ Wrong error code
        message=f"Reconnection attempt #{self.attempts} failed: {str(e)}",
    )
)

Problem: When a reconnection attempt fails due to an exception, it reports FATAL_ERROR but then continues retrying (returns False allowing retries). This is inconsistent - reconnection failures should be NON_FATAL_ERROR since the system continues attempting reconnection.

Recommended Fix:

code=ModuleErrorCode.NON_FATAL_ERROR.value,  # Reconnection will be retried

2. Missing State Management When Fatal Error Occurs

Location: extension.py:433-435

if is_fatal:
    # Stop retrying for fatal errors
    self.stopped = True

Issue: When a fatal error occurs:

  • The code sets self.stopped = True
  • But it doesn't clean up resources:
    • Should call stop_connection() to clean up the Azure client
    • Should close the stream
    • Should prevent _handle_reconnect() from being called in on_session_stopped

Current Flow:

  1. Fatal error occurs → sets self.stopped = True
  2. Session stopped event → calls _handle_reconnect() because not self.stopped check is in on_session_stopped
  3. But the client/stream remain in an undefined state

Recommendation: Add explicit cleanup:

if is_fatal:
    # Stop retrying for fatal errors
    self.stopped = True
    await self.stop_connection()  # Clean up resources

3. Test Assertion Logic Could Be More Robust

Location: test_reconnection.py:384-386

# After fatal error, no more errors should be received
# (fatal error should be the last error, or there might be some cleanup errors)

The comment acknowledges "cleanup errors" might occur after fatal errors, but the assertion doesn't account for this. If cleanup errors DO occur, they would appear after the fatal error in error_codes, but the test doesn't validate this is acceptable.

Suggestion: Either:

  • Assert fatal error is the last error: assert fatal_error_index == len(tester.error_codes) - 1
  • Or explicitly allow and count cleanup errors after fatal error

🎯 Performance Considerations

  • ✅ No performance regressions introduced
  • ✅ Prevents unnecessary reconnection attempts for fatal errors (reduces network overhead)
  • ✅ Exponential backoff strategy remains appropriate for non-fatal errors

🔒 Security Considerations

  • ✅ No security vulnerabilities introduced
  • ✅ Authentication failures are properly classified as fatal
  • ✅ Error messages don't expose sensitive credentials (already handled by config's sensitive_handling=True)

📝 Code Quality & Best Practices

  1. Import Organization: The new import in const.py is at the top ✅
  2. Type Hints: Consistent with existing codebase ✅
  3. Naming: FATAL_ERROR_CODES is descriptive and follows convention ✅
  4. Code Formatting: The multi-line print statement formatting in test is correct ✅

🧪 Testing Recommendations

  1. Add explicit test case for the fatal error path with known bad credentials
  2. Verify resource cleanup after fatal errors in integration tests
  3. Test reconnection manager error handling path separately

📋 Summary & Recommendations

Must Fix Before Merge:

  1. Fix the error code in reconnect_manager.py:106 from FATAL_ERROR to NON_FATAL_ERROR

Should Fix:
2. Add resource cleanup when fatal errors occur (stop_connection() call)
3. Strengthen test assertions for post-fatal error behavior

Nice to Have:
4. Add explicit test case for fatal error scenarios with bad credentials

Overall, this is a solid PR that addresses an important issue. The fatal error detection logic is sound, but the reconnection manager has an inconsistent error code, and resource cleanup could be more explicit.


Verdict: ⚠️ Approve with changes - The critical bug in reconnect_manager.py should be fixed before merging.

@claude
Copy link

claude bot commented Jan 6, 2026

Code Review for PR #1946: Fix Azure ASR Fatal Error

Summary

This PR adds fatal error handling to the Azure ASR extension to prevent infinite retry loops on non-recoverable errors like authentication failures and bad requests. The changes include categorizing error codes, stopping retries for fatal errors, and adding comprehensive test coverage.

Strengths

  1. Well-Targeted Fix: The PR correctly identifies and addresses the root cause - certain Azure error types should not trigger reconnection attempts
  2. Good Error Code Selection: AuthenticationFailure and BadRequest are appropriate choices for fatal errors
  3. Consistent Pattern: Follows the same pattern used in the ByteDance ASR extension (bytedance_asr/const.py:46-48)
  4. Comprehensive Testing: The test additions properly validate both fatal and non-fatal error paths
  5. Version Bump: Correctly increments the version in manifest.json

Code Quality Observations

extension.py:431-435 - The fatal error handling logic is clean and well-placed. Setting self.stopped = True is the correct mechanism to prevent reconnection attempts.

const.py:14-18 - The constant definition is clear. Consider adding inline documentation explaining why each error code is considered fatal.

Potential Issues

Test Logic Error (test_reconnection.py:396-399) - There appears to be a logic error in the non-fatal error validation. The current code filters for codes that are NOT NON_FATAL_ERROR and asserts this equals the total length, which seems backwards. Should verify that NO fatal errors are present instead.

Missing Coverage

  1. Other Fatal Error Codes: Consider whether other Azure error codes should be fatal (TooManyRequests, Forbidden, ConnectionFailure types)
  2. Reconnect Manager Integration: The reconnect_manager isn't notified about fatal errors - check if cleanup is needed
  3. Error Message Clarity: Error messages could explicitly state no retry for fatal errors

Security & Performance

No security issues identified. Positive performance impact - prevents wasteful infinite retry loops.

Recommendations

Must Fix: Fix the test logic error in test_reconnection.py:396-399

Should Consider: Review more fatal error codes, add documentation, notify reconnect_manager

Nice to Have: Improve error message clarity, add test verifying reconnection prevention

Overall Assessment

This is a solid fix that addresses a real problem with appropriate error handling. The main issue is a logic error in the test validation. Once corrected, this should be good to merge.

@claude
Copy link

claude bot commented Jan 6, 2026

PR Review: Fix Azure ASR Fatal Error

Summary

This PR fixes Azure ASR error handling by distinguishing between fatal and non-fatal errors, preventing unnecessary retry attempts for authentication/bad request failures, and adding connection delay metrics.


Code Quality and Best Practices

Strengths:

  • ✅ Clean separation of concerns with FATAL_ERROR_CODES constant in const.py:13-18
  • ✅ Good use of type hints throughout the codebase
  • ✅ Comprehensive test coverage for both the fatal error handling and connection delay metrics
  • ✅ Well-documented test cases with clear descriptions of objectives

Issues:

  1. Logic Error in Test Validation (test_reconnection.py:396-399)

    non_fatal_codes = [code for code in tester.error_codes if code != ModuleErrorCode.NON_FATAL_ERROR.value]
    assert len(non_fatal_codes) == len(
        tester.error_codes
    ), f"All errors should be non-fatal, but found fatal error in sequence: {tester.error_codes}"

    This logic is inverted. The assertion checks if non_fatal_codes (which filters OUT non-fatal errors) has the same length as all error codes, which would be true only if there are NO non-fatal errors. The intended logic should be:

    fatal_codes = [code for code in tester.error_codes if code == ModuleErrorCode.FATAL_ERROR.value]
    assert len(fatal_codes) == 0, f"Expected only non-fatal errors, but found fatal error in sequence: {tester.error_codes}"
  2. Missing Import in const.py (const.py:1)
    The import statement import azure.cognitiveservices.speech as speechsdk is added at the top but this creates a dependency that wasn't there before. While necessary for FATAL_ERROR_CODES, consider if this should be in a different location to maintain separation of concerns (constants vs imports).

  3. Incomplete Error Handling (extension.py:438-440)
    When a fatal error is detected, only self.stopped is set to True. However, the code doesn't explicitly stop the recognition or clean up resources:

    if is_fatal:
        # Stop retrying for fatal errors
        self.stopped = True

    Consider adding cleanup logic:

    if is_fatal:
        self.stopped = True
        if self.client:
            self.client.stop_continuous_recognition()

Potential Bugs

  1. Race Condition in Connection Timestamp (extension.py:196-198, 481-484)
    The connection_start_timestamp is set immediately before start_continuous_recognition(), but if the connection happens synchronously or very quickly, there could be a race where the connected event fires before the timestamp is set. While unlikely, consider setting the timestamp BEFORE calling _register_azure_event_handlers().

  2. Unhandled Negative Delay (extension.py:481-484)
    If system time changes (NTP sync, daylight savings, etc.), the connection delay calculation could theoretically be negative:

    connection_delay_ms = (
        int(datetime.now().timestamp() * 1000)
        - self.connection_start_timestamp
    )

    Add a safeguard:

    connection_delay_ms = max(0, int(datetime.now().timestamp() * 1000) - self.connection_start_timestamp)
  3. Missing Method Definition
    The code calls await self.send_connect_delay_metrics(connection_delay_ms) at extension.py:491, but I couldn't find the definition of this method in the diff. Ensure this method exists in the base class AsyncASRBaseExtension or is defined in this extension.


Performance Considerations

  1. Timestamp Calculation (extension.py:481-484)
    Using datetime.now().timestamp() * 1000 and int() conversion is fine, but consider using time.perf_counter() for more accurate timing measurements that aren't affected by system clock changes:

    import time
    self.connection_start_timestamp = time.perf_counter()
    # Later:
    connection_delay_ms = int((time.perf_counter() - self.connection_start_timestamp) * 1000)
  2. Fatal Error Stopping (extension.py:438-440)
    Setting self.stopped = True is good, but ensure the reconnection loop checks this flag promptly to avoid unnecessary processing.


Security Concerns

  1. Error Message Exposure (extension.py:450)
    The error details from Azure (cancellation_details.error_details) are directly passed to error messages. Ensure these don't contain sensitive information like API keys or tokens that might be logged or sent to clients.

  2. Test Credentials (test_connection_delay_metrics.py:218-223)
    Using "fake_key" and "fake_region" in tests is good practice. Ensure no real credentials are ever committed.


Test Coverage

Excellent test coverage:

  • ✅ Test for connection delay metrics with realistic timing simulation
  • ✅ Test for fatal vs non-fatal error handling
  • ✅ Proper async test structure with timeout handling
  • ✅ Comprehensive validation of metrics structure

Suggestions:

  1. Add a test case for the edge case where connection_start_timestamp is not initialized (defensive programming)
  2. Add a test to verify that fatal errors actually stop reconnection attempts
  3. Consider adding a test for multiple fatal error types (AuthenticationFailure and BadRequest)

Additional Observations

  1. Version Bump (manifest.json:4)
    Version bumped from 0.2.3 to 0.2.4 - good practice for tracking changes.

  2. Code Formatting (test_reconnection.py:45-47)
    Inconsistent string formatting - some places use line continuation for long strings, which is good for readability.

  3. Documentation
    The new test file (test_connection_delay_metrics.py:177-186) has excellent docstring documentation explaining what is being tested.


Recommendations

Critical:

  1. Fix the inverted logic in test_reconnection.py:396-399
  2. Verify that send_connect_delay_metrics() method exists

High Priority:
3. Add safeguard for negative connection delay
4. Consider adding cleanup logic when fatal error occurs
5. Use time.perf_counter() instead of datetime.now().timestamp()

Medium Priority:
6. Add test case for fatal error stopping reconnection
7. Review error message content for sensitive information exposure

Low Priority:
8. Consider moving Azure SDK import from const.py to a more appropriate location
9. Add edge case tests for uninitialized timestamp


Overall Assessment

This is a well-structured PR that addresses an important issue with fatal error handling. The code quality is good, tests are comprehensive, and the implementation is mostly sound. The main concern is the logic error in the test validation which should be fixed before merging. The other issues are minor and can be addressed in follow-up PRs if needed.

Recommendation: Request changes for the critical logic error, then approve after fix.

@plutoless plutoless merged commit f07769b into main Jan 7, 2026
34 checks passed
@plutoless plutoless deleted the fix/azure_asr branch January 7, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants