Apache Airflow Provider(s)
fab
Versions of Apache Airflow Providers
apache-airflow-providers-fab==3.3.0
apache-airflow-providers-common-sql==1.28.2
apache-airflow-providers-mysql==6.3.4
apache-airflow-providers-cncf-kubernetes==10.11.0
apache-airflow-providers-celery==3.13.0
apache-airflow-providers-standard==1.10.0
Apache Airflow version
3.1.6
Operating System
Debian 12 (bookworm) — official Airflow Docker image
Deployment
Official Apache Airflow Helm Chart
Deployment details
- Kubernetes: Amazon EKS
- Metadata DB: Amazon Aurora MySQL (MySQL 8.0 compatible)
- MySQL
wait_timeout: 28800 seconds (8 hours)
- SQLAlchemy pool config:
pool_recycle=60, pool_pre_ping=true, pool_size=3, max_overflow=2
- api-server replicas: 2 pods
- Airflow image: Custom image based on
apache/airflow:3.1.6 with providers-fab==3.3.0
What happened
The cleanup_session_middleware introduced in PR #61480 (included in providers-fab 3.3.0) calls Session.remove() in a bare finally block without any error handling. When the underlying MySQL connection has been closed server-side (e.g., due to timeout, network interruption, or Aurora failover), Session.remove() internally attempts a ROLLBACK on the dead connection, which raises MySQLdb.OperationalError: (2006, 'Server has gone away').
This unhandled exception propagates up as a 500 Internal Server Error to the client, even though the original request may have completed successfully.
Error log from api-server pod:
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
File ".../uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
...
File ".../airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
settings.Session.remove()
File ".../sqlalchemy/orm/scoping.py", line 246, in remove
self.registry().close()
File ".../sqlalchemy/orm/session.py", line 2081, in close
self._close_impl(invalidate=False)
File ".../sqlalchemy/orm/session.py", line 2124, in _close_impl
self.rollback()
...
File ".../MySQLdb/connections.py", line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
Relevant source code (fab_auth_manager.py, lines 235-243):
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
settings.Session.remove() # <-- unhandled exception here
The finally block does not catch exceptions from Session.remove(). Since this is a cleanup operation, any failure here should be logged and suppressed — not propagated to the client.
What you think should happen instead
Session.remove() in the finally block should be wrapped with suppress(Exception) to gracefully handle database connection errors during cleanup. The cleanup middleware's purpose is to prevent stale sessions — if cleanup itself fails because the connection is already dead, that's not an error that should affect the HTTP response.
Suggested fix:
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
with suppress(Exception):
settings.Session.remove()
This is consistent with the suppress(Exception) pattern already used in deserialize_user (PR #62153, merged 2026-02-19) for identical session cleanup error handling. The from contextlib import suppress import already exists in the file.
How to reproduce
- Deploy Airflow 3.1.6 with
providers-fab==3.3.0 using MySQL (Aurora MySQL) as metadata DB
- Configure SQLAlchemy with
pool_pre_ping=true and pool_recycle=60
- Have api-server running with multiple replicas
- Wait for a MySQL connection in the SQLAlchemy pool to become stale (connection closed server-side due to timeout, network issue, or Aurora maintenance)
- Send a request to the api-server (e.g., login via
/auth/fab/v1/login) that triggers cleanup_session_middleware
- The stale connection causes
Session.remove() → ROLLBACK → MySQLdb.OperationalError: (2006, 'Server has gone away') → 500 error
Note: This is timing-dependent and occurs intermittently. In our production environment, it appeared on 1 of 2 api-server pods. The issue is more likely to manifest with MySQL than PostgreSQL, since MySQL's Server has gone away error has no automatic retry at the driver level.
Anything else
Context — this is a follow-up to PR #61480:
PR #61480 correctly addressed the root cause of PendingRollbackError (issue #59349) by adding cleanup_session_middleware to ensure Session.remove() runs after every request. However, the finally block assumes Session.remove() always succeeds. When the DB connection is already dead, the cleanup itself fails and turns a successful request into a 500 error.
Impact:
- Intermittent 500 errors on api-server login/UI pages
- Self-recovers on retry (next request gets a fresh connection from the pool)
- In our case: 2 occurrences over several days, both on the same pod
Related issues and PRs:
#59349 — Original PendingRollbackError issue that motivated PR #61480
#61480 — PR that introduced cleanup_session_middleware
#62153 — PR that established the suppress(Exception) pattern for session cleanup in deserialize_user (same class of problem, different code path)
#57470, #57859 — Earlier reports of the same session lifecycle problem
Environment evidence:
pool_pre_ping=true is enabled, which means SQLAlchemy validates connections before use — but Session.remove() bypasses this check since it operates on an already-bound session
- MySQL
wait_timeout=28800 (8h) and pool_recycle=60 should prevent most stale connections, but edge cases (Aurora failover, network blips) can still cause disconnections
Full error traceback
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
File "/home/airflow/.local/lib/python3.12/site-packages/starlette/middleware/base.py", line 101, in __call__
response = await self.dispatch_func(request, call_next)
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
settings.Session.remove()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/scoping.py", line 246, in remove
self.registry().close()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2081, in close
self._close_impl(invalidate=False)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2124, in _close_impl
self.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1982, in rollback
self._transaction.rollback(_to_root=True)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1040, in rollback
self._connection_rollback(self._connections[transaction])
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1092, in _connection_rollback
connection.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1065, in rollback
self._transaction.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1768, in rollback
self.connection._rollback_impl()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 902, in _rollback_impl
self._handle_dbapi_exception(e, None, None, None, None)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2240, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 899, in _rollback_impl
self.connection.dbapi_connection.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 272, in rollback
self.query("ROLLBACK")
File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
Are you willing to submit PR?
Code of Conduct
Apache Airflow Provider(s)
fab
Versions of Apache Airflow Providers
Apache Airflow version
3.1.6
Operating System
Debian 12 (bookworm) — official Airflow Docker image
Deployment
Official Apache Airflow Helm Chart
Deployment details
wait_timeout: 28800 seconds (8 hours)pool_recycle=60,pool_pre_ping=true,pool_size=3,max_overflow=2apache/airflow:3.1.6withproviders-fab==3.3.0What happened
The
cleanup_session_middlewareintroduced in PR #61480 (included inproviders-fab 3.3.0) callsSession.remove()in a barefinallyblock without any error handling. When the underlying MySQL connection has been closed server-side (e.g., due to timeout, network interruption, or Aurora failover),Session.remove()internally attempts aROLLBACKon the dead connection, which raisesMySQLdb.OperationalError: (2006, 'Server has gone away').This unhandled exception propagates up as a 500 Internal Server Error to the client, even though the original request may have completed successfully.
Error log from api-server pod:
Relevant source code (
fab_auth_manager.py, lines 235-243):The
finallyblock does not catch exceptions fromSession.remove(). Since this is a cleanup operation, any failure here should be logged and suppressed — not propagated to the client.What you think should happen instead
Session.remove()in thefinallyblock should be wrapped withsuppress(Exception)to gracefully handle database connection errors during cleanup. The cleanup middleware's purpose is to prevent stale sessions — if cleanup itself fails because the connection is already dead, that's not an error that should affect the HTTP response.Suggested fix:
This is consistent with the
suppress(Exception)pattern already used indeserialize_user(PR #62153, merged 2026-02-19) for identical session cleanup error handling. Thefrom contextlib import suppressimport already exists in the file.How to reproduce
providers-fab==3.3.0using MySQL (Aurora MySQL) as metadata DBpool_pre_ping=trueandpool_recycle=60/auth/fab/v1/login) that triggerscleanup_session_middlewareSession.remove()→ROLLBACK→MySQLdb.OperationalError: (2006, 'Server has gone away')→ 500 errorNote: This is timing-dependent and occurs intermittently. In our production environment, it appeared on 1 of 2 api-server pods. The issue is more likely to manifest with MySQL than PostgreSQL, since MySQL's
Server has gone awayerror has no automatic retry at the driver level.Anything else
Context — this is a follow-up to PR #61480:
PR #61480 correctly addressed the root cause of
PendingRollbackError(issue #59349) by addingcleanup_session_middlewareto ensureSession.remove()runs after every request. However, thefinallyblock assumesSession.remove()always succeeds. When the DB connection is already dead, the cleanup itself fails and turns a successful request into a 500 error.Impact:
Related issues and PRs:
#59349 — Original
PendingRollbackErrorissue that motivated PR #61480#61480 — PR that introduced
cleanup_session_middleware#62153 — PR that established the
suppress(Exception)pattern for session cleanup indeserialize_user(same class of problem, different code path)#57470, #57859 — Earlier reports of the same session lifecycle problem
Environment evidence:
pool_pre_ping=trueis enabled, which means SQLAlchemy validates connections before use — butSession.remove()bypasses this check since it operates on an already-bound sessionwait_timeout=28800(8h) andpool_recycle=60should prevent most stale connections, but edge cases (Aurora failover, network blips) can still cause disconnectionsFull error traceback
Are you willing to submit PR?
Code of Conduct