Skip to content

Fix a leaked run error in test_pytorch_test_metrics_logged#4302

Merged
harupy merged 1 commit intomlflow:masterfrom
harupy:fix-pl-leaked-run
Apr 30, 2021
Merged

Fix a leaked run error in test_pytorch_test_metrics_logged#4302
harupy merged 1 commit intomlflow:masterfrom
harupy:fix-pl-leaked-run

Conversation

@harupy
Copy link
Copy Markdown
Member

@harupy harupy commented Apr 29, 2021

Signed-off-by: harupy 17039389+harupy@users.noreply.github.com

What changes are proposed in this pull request?

Fix a leaked run error in test_pytorch_test_metrics_logged.

https://github.com/mlflow/mlflow/runs/2464693969?check_suite_focus=true#step:10:23

==================================== ERRORS ====================================
____________ ERROR at teardown of test_pytorch_test_metrics_logged _____________

    @pytest.fixture(autouse=True)
    def clean_up_leaked_runs():
        """
        Certain test cases validate safety API behavior when runs are leaked. Leaked runs that
        are not cleaned up between test cases may result in cascading failures that are hard to
        debug. Accordingly, this fixture attempts to end any active runs it encounters and
        throws an exception (which reported as an additional error in the pytest execution output).
        """
        try:
            yield
>           assert (
                not mlflow.active_run()
            ), "test case unexpectedly leaked a run. Run info: {}. Run data: {}".format(
                mlflow.active_run().info, mlflow.active_run().data
            )
E           AssertionError: test case unexpectedly leaked a run. Run info: <RunInfo: artifact_uri='./mlruns/0/2a9d070e8c4b4d4b8e27cf0754d2190e/artifacts', end_time=None, experiment_id='0', lifecycle_stage='active', run_id='2a9d070e8c4b4d4b8e27cf0754d2190e', run_uuid='2a9d070e8c4b4d4b8e27cf0754d2190e', start_time=1619683570521, status='RUNNING', user_id='runner'>. Run data: <RunData: metrics={}, params={}, tags={'mlflow.source.name': '/usr/share/miniconda/bin/pytest',
E              'mlflow.source.type': 'LOCAL',
E              'mlflow.user': 'runner'}>
E           assert not <ActiveRun: >
E            +  where <ActiveRun: > = <function active_run at 0x7fabc1ac96a8>()
E            +    where <function active_run at 0x7fabc1ac96a8> = mlflow.active_run

How is this patch tested?

Fixed test

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
@github-actions github-actions Bot added the rn/none List under Small Changes in Changelogs. label Apr 29, 2021
Comment on lines -270 to +273
trainer.fit(model, dm)
trainer.test()
with mlflow.start_run() as run:
trainer.fit(model, dm)
trainer.test()
Copy link
Copy Markdown
Member Author

@harupy harupy Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change in the pytorch-lightning repo caused the issue:

https://github.com/PyTorchLightning/pytorch-lightning/pull/7232/files#diff-667a2513c158c2b01d73b479e4dea75587d8531f9cd286891d34570a2fd145dcR986-R987

Prior to this change, Trainer.test calls Trainer.fit (which is patched and automatically creates / terminates a run). This means when on_test_end is called, an active run exists.

Now Trainer.test no longer calls Trainer.fit. This means when on_test_end is called, no active run exists and a new run will be created by mlflow.set_tag and leaked.

def on_test_end(self, trainer, pl_module):
    """
    Logs accuracy and other relevant metrics on the testing end

    :param trainer: pytorch lightning trainer instance
    :param pl_module: pytorch lightning base module
    """
    try_mlflow_log(mlflow.set_tag, "Mode", "testing")
    for key, value in trainer.callback_metrics.items():
        try_mlflow_log(mlflow.log_metric, key, float(value))

@harupy harupy mentioned this pull request Apr 29, 2021
27 tasks
@harupy harupy merged commit 780b3bf into mlflow:master Apr 30, 2021
@harupy harupy deleted the fix-pl-leaked-run branch April 30, 2021 08:32
YQ-Wang pushed a commit to YQ-Wang/mlflow that referenced this pull request May 29, 2021
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: Yiqing Wang <yiqing@wangemail.com>
harupy added a commit to wamartin-aml/mlflow that referenced this pull request Jun 7, 2021
Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rn/none List under Small Changes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants