[UI Telemetry] Implement SharedWorker for batching frontend logs#19292
[UI Telemetry] Implement SharedWorker for batching frontend logs#19292daniellok-db merged 26 commits intomlflow:masterfrom
Conversation
c2cc1fe to
53c8f83
Compare
f0fb458 to
3bc5eb5
Compare
|
@daniellok-db Thank you for the contribution! Could you fix the following issue(s)? ⚠ DCO checkThe DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details. |
There was a problem hiding this comment.
Pull request overview
This PR implements a SharedWorker-based system for batching and uploading frontend telemetry logs in MLflow. The implementation includes:
- Backend handlers for fetching telemetry configuration and receiving batched telemetry records
- Frontend SharedWorker that batches events and uploads them every 15 seconds
- Configuration management with caching and fallback values
- Proto definitions for UI telemetry messages
Key Changes
- Added UI-specific telemetry configuration endpoints with caching
- Implemented SharedWorker for cross-tab telemetry batching
- Extended telemetry Record schema to support custom session/installation IDs
- Added comprehensive test coverage for new functionality
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
mlflow/telemetry/utils.py |
Added fetch_ui_telemetry_config() function with fallback support |
mlflow/telemetry/constant.py |
Added UI config URLs and fallback config constant |
mlflow/telemetry/schemas.py |
Extended Record to support optional installation_id and session_id |
mlflow/server/handlers.py |
Added GET/POST handlers for UI telemetry with TTL caching |
mlflow/server/__init__.py |
Registered new telemetry routes |
mlflow/server/js/src/telemetry/worker/TelemetryLogger.worker.ts |
Implemented SharedWorker for batching telemetry |
mlflow/server/js/src/telemetry/worker/LogQueue.ts |
Implemented queue with 15-second flush interval |
mlflow/server/js/craco.config.js |
Configured webpack to build worker as separate entry point |
tests/telemetry/test_*.py |
Added comprehensive tests for new functionality |
tests/server/test_handlers.py |
Added tests for GET/POST telemetry handlers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Documentation preview for d06c455 is available at: More info
|
05a93cb to
19de120
Compare
| // serve SharedWorker file at top-level, it seems to be more | ||
| // stable than if it's contained in `static/js/...`. previously | ||
| // i was running into issues with webpack path resolution | ||
| return 'TelemetryLogger.[name].[contenthash].worker.js'; |
There was a problem hiding this comment.
i think it is possible not to include contenthash here so that the file can be stable across hot reloads, but i was running into some issue where the yarn dev server was OOMing. i assume there were a lot of references that could not be cleaned up.
this way when the hash changes, we can manually terminate the orphaned workers. anyway it is a dev only issue as the bundle is static post-build.
| import { UI_TELEMETRY_ENDPOINT } from './constants'; | ||
| import { type TelemetryRecord } from './types'; | ||
|
|
||
| const FLUSH_INTERVAL_MS = 15000; // 15 seconds |
There was a problem hiding this comment.
to be discussed: what's the right flush interval? UI sessions can be longer lived, 15 seconds may be a bit too short. the tradeoff here is log fidelity vs. request volume. the longer the interval, the more logs are lost when the browser is closed, but we can save some request overhead.
| if (!config || (config.disable_ui_telemetry ?? true)) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
if config couldn't be fetched, or if for some reason we cannot read the disable_ui_telemetry field, then drop the log
| private config: Promise<TelemetryConfig | null> = fetchConfig(); | ||
| private sessionId = crypto.randomUUID(); | ||
| private logQueue: LogQueue = new LogQueue(); | ||
| private samplingValue: number = Math.random() * 100; |
There was a problem hiding this comment.
Math.random() generates a random float between 0 - 1, so we can specify a pretty granular ui_rollout_percent in the config, like 12.345678.
121fb44 to
d06c455
Compare
🥞 Stacked PR
Use this link to review incremental changes.
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
This PR implements the SharedWorker code that runs in the browser. The next PR contains the client code that is actually used by the frontend.
The worker is fairly simple, it receives logs via
postMessagefrom the client, does a few validation checks, enriches the data with session ID, and then queues up the logs to be dispatched periodically.How is this PR tested?
Bug bashed manually
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.