Skip to content

gtid: Refactor reconnect logic & prevent events_count reset#5226

Merged
renecannao merged 3 commits intov3.0from
refactor-gtid-events-count
Nov 25, 2025
Merged

gtid: Refactor reconnect logic & prevent events_count reset#5226
renecannao merged 3 commits intov3.0from
refactor-gtid-events-count

Conversation

@wazir-ahmed
Copy link
Member

  • This patch was originally added by commit 0a70fd5 and reverted by 8d1b5b5, prior to the release of v3.0.3.
  • The following issues are addressed in this update,
    • Fix for use-after-free issue which occured during CI test.
    • Fix for deadlock issue between GTID_syncer and MySQL_Worker.

@gemini-code-assist
Copy link

Summary of Changes

Hello @wazir-ahmed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reintroduces and refines previously reverted GTID synchronization logic to resolve critical stability issues. The changes focus on robustly managing GTID server connections and data, specifically preventing memory corruption (use-after-free) and deadlocks that could occur during binlog reader reconnections. The updated logic ensures that GTID transaction intervals are correctly merged and events_count metrics are accurately maintained across connection fluctuations.

Highlights

  • Resolved Critical Bugs: Addressed a use-after-free vulnerability and a deadlock condition between the GTID_syncer and MySQL_Worker components, which were previously identified and caused a revert of similar changes.
  • Refactored GTID Reconnect Logic: Significantly improved the handling of GTID server data during reconnections, ensuring proper lifecycle management of GTID_Server_Data objects and their associated ev_io watchers.
  • Prevented events_count Reset: Introduced a new addGtidInterval function to intelligently update GTID transaction ID intervals, preventing the events_read counter from being reset when a binlog reader reconnects and provides updated GTID sets.
  • Improved Error Handling: Enhanced network read operations (readall) to better distinguish between EOF, actual errors, and non-blocking EINTR/EAGAIN conditions, leading to more robust connection management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and much-needed refactoring of the GTID synchronization logic. The changes address critical issues like a use-after-free bug and a deadlock, making the system more robust and stable. The introduction of addGtidInterval and the revised logic in read_next_gtid correctly prevent the events_count from being reset on reconnection, which is a key functional improvement. The overall code structure is now cleaner, more readable, and easier to maintain. I have found one critical issue that needs to be addressed.

- This patch was originally added by commit 0a70fd5 and
  reverted by 8d1b5b5, prior to the release of `v3.0.3`.
- The following issues are addressed in this update,
  - Fix for `use-after-free` issue which occured during CI test.
  - Fix for deadlock issue between `GTID_syncer` and `MySQL_Worker`.

Signed-off-by: Wazir Ahmed <wazir@proxysql.com>
@wazir-ahmed wazir-ahmed force-pushed the refactor-gtid-events-count branch from 35ad33f to 50c6028 Compare November 25, 2025 06:59
@noizu noizu added this to the Release 3.0.4 milestone Nov 25, 2025
@renecannao
Copy link
Contributor

Detailed Code Review Analysis

Overall Assessment ✅

This PR successfully addresses critical stability issues in GTID synchronization while maintaining backward compatibility. The refactoring is well-executed and resolves the memory corruption and deadlock problems that caused the previous revert.

Key Strengths

Memory Safety Improvements:

  • Fixed use-after-free bug with proper cleanup of GTID_Server_Data objects and ev_io watchers
  • Enhanced resource management with clean separation between connection lifecycle and GTID data management
  • Better error handling that distinguishes between EOF, actual errors, and non-blocking conditions in readall()

Improved Connection Logic:

  • Refactored connect_cb() with cleaner separation of connection establishment vs. read operations (lines 65-85)
  • Better watcher lifecycle that eliminates potential memory leaks and use-after-free scenarios
  • Enhanced reconnection handling where servers marked as inactive can be properly reconnected without full recreation

GTID Processing Enhancements:

  • New addGtidInterval() function prevents events_count reset during binlog reader reconnection
  • Smart interval merging that correctly handles cases like server_UUID:1-10 to server_UUID:1-19
  • Accurate metrics that maintain correct GTID event counts across connection fluctuations

Critical Issues Addressed

  1. ✅ Use-after-free: Fixed memory corruption during CI tests
  2. ✅ Deadlock: Resolved GTID_syncer to MySQL_Worker deadlock
  3. ✅ Metrics accuracy: Prevented events_count reset on reconnection
  4. ✅ Resource leaks: Proper cleanup of watchers and GTID data

Technical Details

Connection Lifecycle Management:

  • Before: Complex intertwined connection/reading logic prone to memory issues
  • After: Clear separation with proper cleanup at each stage
  • Impact: Reduced memory footprint and eliminated crashes

GTID Interval Processing:

  • Before: Simple replacement causing count resets
  • After: Intelligent merging preserving historical counts
  • Impact: Accurate Prometheus metrics across reconnections

Deployment Considerations

  • Low risk: Maintains backward compatibility
  • High value: Fixes critical stability issues affecting production
  • Ready for 3.0.4: Solid candidate for next release

Summary

This is a well-engineered fix that addresses serious stability issues while improving code quality. The refactoring maintains clean separation of concerns and introduces robust error handling. The new addGtidInterval() function is particularly elegant in solving the metrics reset problem.

renecannao and others added 2 commits November 25, 2025 16:34
- Document addGtidInterval() function with parameter details and reconnection behavior
- Add documentation for readall() method explaining robust error handling
- Document connect_cb() and reader_cb() callbacks with resource management details
- Document generate_mysql_gtid_executed_tables() with multi-phase process explanation
- Focus on functionality, thread safety, and performance improvements
- Provide clear parameter descriptions and return value semantics
docs: Add comprehensive Doxygen documentation for GTID refactoring
@renecannao renecannao merged commit 05ab128 into v3.0 Nov 25, 2025
1 check passed
@renecannao
Copy link
Contributor

renecannao commented Nov 25, 2025

Detailed Code Review Analysis

Overall Assessment ✅

This PR successfully addresses critical stability issues in GTID synchronization while maintaining backward compatibility. The refactoring is well-executed and resolves the memory corruption and deadlock problems that caused the previous revert.

Key Strengths

🔧 Memory Safety Improvements

  • Fixed use-after-free bug: Proper cleanup of objects and watchers
  • Enhanced resource management: Clean separation between connection lifecycle and GTID data management
  • Better error handling: Distinguishes between EOF, actual errors, and non-blocking conditions in

🔄 Improved Connection Logic

  • **Refactored **: Cleaner separation of connection establishment vs. read operations (lines 65-85)
  • Better watcher lifecycle: Eliminates potential memory leaks and use-after-free scenarios
  • Enhanced reconnection handling: Servers marked as inactive can be properly reconnected without full recreation

📊 GTID Processing Enhancements

  • New function: Prevents reset during binlog reader reconnection
  • Smart interval merging: Correctly handles cases like →
  • Accurate metrics: Maintains correct GTID event counts across connection fluctuations

Code Quality Highlights

Critical Issues Addressed

  1. Use-after-free: Fixed memory corruption during CI tests
  2. Deadlock: Resolved GTID_syncer ↔ MySQL_Worker deadlock
  3. Metrics accuracy: Prevented reset on reconnection
  4. Resource leaks: Proper cleanup of watchers and GTID data

Technical Details

Connection Lifecycle Management

  • Before: Complex intertwined connection/reading logic prone to memory issues
  • After: Clear separation with proper cleanup at each stage
  • Impact: Reduced memory footprint and eliminated crashes

GTID Interval Processing

  • Before: Simple replacement causing count resets
  • After: Intelligent merging preserving historical counts
  • Impact: Accurate Prometheus metrics across reconnections

Deployment Considerations

  • Low risk: Maintains backward compatibility
  • High value: Fixes critical stability issues affecting production
  • Ready for 3.0.4: Solid candidate for next release

Summary

This is a well-engineered fix that addresses serious stability issues while improving code quality. The refactoring maintains clean separation of concerns and introduces robust error handling. The new function is particularly elegant in solving the metrics reset problem.

@sonarqubecloud
Copy link

@renecannao renecannao deleted the refactor-gtid-events-count branch March 7, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants