Skip to content

Fix scheduling for fp16 reductions#370

Merged
jjsjann123 merged 9 commits into20_8_18_develfrom
fp16reduction
Sep 14, 2020
Merged

Fix scheduling for fp16 reductions#370
jjsjann123 merged 9 commits into20_8_18_develfrom
fp16reduction

Conversation

@csarofeen
Copy link
Copy Markdown
Owner

Fixes #362

Copy link
Copy Markdown
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not stamping it yet as I haven't got around to try the fix.

// them.
struct FindOutputs : public IterVisitor {
const std::unordered_set<Val*>& of_;
std::unordered_set<Val*> outs;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, inconsistent name of of_ & outs;

traverseFrom(fusion, fusion->outputs(), false);
};

public:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, public in struct.

// Use dependency check to find the reduction tv as it returns used values
// instead of exprs.
auto used_vals = DependencyCheck::getAllValsBetween(
{fusion.inputs().begin(), fusion.inputs().end()}, fusion.outputs());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does getAllValsBetween(...) have a smaller traverse space than fusion.exprs() because we might have trailing nodes after outputs? Just curious about the motivation of code change here.

// tv->axis(-1)->parallelize(ParallelType::TIDx);
// }
}
TensorView* out0 = fusion->outputs()[0]->as<TensorView>();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code below out0 should be cleaned as well. This is used in previously when scheduling is marked explicitly in transformation.

jjsjann123 added a commit that referenced this pull request Sep 14, 2020
Fixes #357. Two things in this PR:

Do type propagation even for profiling executor -> this is the root cause for bug reported in #357
Allow dtype argument in sum, which is simply handled in our type propagation.

It exposes scheduling issue in #362, for which we added python tests (currently disabled). Fix is in PR #370, will enable after merge.
@jjsjann123 jjsjann123 merged commit 83028a4 into 20_8_18_devel Sep 14, 2020
@jjsjann123 jjsjann123 deleted the fp16reduction branch September 14, 2020 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reduction with fp16 cast generates wrong kernel indexing

3 participants