Skip to content

[Fix] training time validation bug fix#611

Merged
mzr1996 merged 10 commits intoopen-mmlab:masterfrom
0x4f5da2:fix_sampler
Dec 23, 2021
Merged

[Fix] training time validation bug fix#611
mzr1996 merged 10 commits intoopen-mmlab:masterfrom
0x4f5da2:fix_sampler

Conversation

@0x4f5da2
Copy link
Copy Markdown
Contributor

Motivation

This PR fixes the bug introduced in #588

Modification

add sampler_cfg for validation in mmcls/api/train.py

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 22, 2021

Codecov Report

Merging #611 (21fd957) into master (d5ae255) will increase coverage by 0.44%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   81.33%   81.78%   +0.44%     
==========================================
  Files         116      116              
  Lines        6720     6709      -11     
  Branches     1156     1151       -5     
==========================================
+ Hits         5466     5487      +21     
+ Misses       1103     1069      -34     
- Partials      151      153       +2     
Flag Coverage Δ
unittests 81.78% <100.00%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/apis/train.py 21.25% <ø> (-6.53%) ⬇️
mmcls/datasets/builder.py 73.21% <100.00%> (+27.06%) ⬆️
mmcls/datasets/samplers/repeat_aug.py 90.47% <100.00%> (+5.36%) ⬆️
mmcls/datasets/samplers/distributed_sampler.py 84.00% <0.00%> (+56.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d5ae255...21fd957. Read the comment docs.

@0x4f5da2 0x4f5da2 requested a review from mzr1996 December 22, 2021 04:51
@mzr1996
Copy link
Copy Markdown
Member

mzr1996 commented Dec 22, 2021

Please also add it in tools/test.py

@0x4f5da2
Copy link
Copy Markdown
Contributor Author

Please also add it in tools/test

fixed

@mzr1996 mzr1996 requested a review from RangiLyu December 23, 2021 01:39
"""
rank, world_size = get_dist_info()

sampler_cfg = get_sampler_cfg(cfg, distributed=dist)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good choice to pass the whole config into a module's builder.

What if we change like this?

    rank, world_size = get_dist_info()
    # default logic
    if dist:
        sampler = DistributedSampler(
            dataset, world_size, rank, shuffle=shuffle, round_up=round_up)
        batch_size = samples_per_gpu
        num_workers = workers_per_gpu
    else:
        sampler = None
        batch_size = num_gpus * samples_per_gpu
        num_workers = num_gpus * workers_per_gpu

    # build custom sampler logic
    if sampler_cfg:
        # shuffle=False when val and test
        sampler_cfg.update(shuffle=shuffle)
        sampler = build_sampler(
            sampler_cfg,
            default_args=dict(dataset=dataset, num_replicas=world_size, rank=rank))

    # If sampler exists, turn off dataloader shuffle
    if sampler is not None:
        shuffle = False

Copy link
Copy Markdown
Member

@mzr1996 mzr1996 Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use:

    rank, world_size = get_dist_info()

    # Custom sampler logic
    if sampler_cfg:
        # shuffle=False when val and test
        sampler_cfg.update(shuffle=shuffle)
        sampler = build_sampler(
            sampler_cfg,
            default_args=dict(dataset=dataset, num_replicas=world_size, rank=rank))
    # Default sampler logic
    elif dist:
        sampler = DistributedSampler(
            dataset, world_size, rank, shuffle=shuffle, round_up=round_up)
    else:
        sampler = None

    # If sampler exists, turn off dataloader shuffle
    if sampler is not None:
        shuffle = False

    if dist:
        batch_size = samples_per_gpu
        num_workers = workers_per_gpu
    else:
        batch_size = num_gpus * samples_per_gpu
        num_workers = num_gpus * workers_per_gpu

Then we can avoid an unnecessary instantiation of DistributedSampler. And split the logic of sampler and batch_size.

SAMPLERS = Registry('sampler')


def get_sampler_cfg(cfg, distributed):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MME need to handle DistributedSampler and InfiniteSampler as default sampler when runner type is different, so there is a set_default_sampler_cfg function.
But MMClassification only has one default sampler: DistributedSampler. I think this function is unnecessary here.

Copy link
Copy Markdown
Member

@mzr1996 mzr1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mzr1996 mzr1996 merged commit da39ca6 into open-mmlab:master Dec 23, 2021
mzr1996 added a commit to mzr1996/mmpretrain that referenced this pull request Nov 24, 2022
* sampler bugfixes

* sampler bugfixes

* reorganize code

* minor fixes

* reorganize code

* minor fixes

* Use `mmcv.runner.get_dist_info` instead of `dist` package to get rank
and world size.

* Add `build_dataloader` unit tests and fix sampler's unit tests.

* Fix unit tests

* Fix unit tests

Co-authored-by: mzr1996 <mzr1996@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants