Skip to content

Remove paragraph on simulating data from first tip#218

Merged
cgreene merged 4 commits intomasterfrom
remove-simulated-data-from-concepts
Oct 1, 2020
Merged

Remove paragraph on simulating data from first tip#218
cgreene merged 4 commits intomasterfrom
remove-simulated-data-from-concepts

Conversation

@Benjamin-Lee
Copy link
Owner

This paragraph doesn't seem to fit the theme of learning about general ML before diving into DL, so I think we should remove it.

Copy link
Collaborator

@cgreene cgreene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this doesn't belong here but it's probably good to have in Tip 4. I will file a PR adding it there.

@cgreene cgreene mentioned this pull request Sep 16, 2020
1 task
This was referenced Sep 29, 2020
@ejfertig
Copy link
Collaborator

I agree. Personally I think Tip 1 should be revised to introduce concepts about the right model entirely (e.g., supervised / unnsupervised / architecture choice / model complexity). Data is mentioned pretty early on, but it's confusing. It may be nice to make all focus on data selection in Tip 4 overall.

@Benjamin-Lee
Copy link
Owner Author

By the way @ejfertig, I just invited you as a collaborator to the repo. Once you accept, you can review this PR. In general, the procedure we do is to have one person make the PR, a second person approve it, and a third approve and merge it. So, if you think the changes are good to go, mark your approval and merge!

To accurately test the performance of the model, it is important that simulated datasets be generated for a range of parameters.
Simulated datasets generated from parameters that are consistent with the assumptions of the deep learning models can be used to verify the correctness of the model’s implementation.
Further varying the parameters to violate the model's assumptions can test the sensitivity of the model's performance to those assumptions.
Generative models based upon sample training sets can be an important tool for generating large cohorts of simulated datasets that are representative of the real world problems to which the machine learning prediction models will be applied.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify this sentence a bit? I'm still not 100% sure I understand what its saying. Does it mean that generative models are useful for testing predictive models (like what the last sentence said)? Is it a conclusion? Or is it introducing a different idea which I failed to catch?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read. It is a bit confusing. Seems like it can either be the conclusion or the start of going in a more specific direction (since it seems to use the word generative for the first time). Not sure the last sentence is needed.

Copy link
Collaborator

@pstew pstew Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I re-read, I think this paragraph needs to be tweaked. The intro sentence is talking about datasets not being thoroughly understood...so meaning the true distribution is unknown? If the idea is to learn the true data distribution of the data for the purpose of generating new, simulated data, then I think this should be spelled out for the reader. Do we also want to spell out generative vs discriminative? GAN vs VAE?

Not sure if this is commit worthy, but for a starting place how about something like

Data simulation is a powerful approach to creating additional data for model testing. In data simulation, a generative model is used to learn the true distribution of a training set for the purpose of creating new data points. Simulations should be performed under reasonable assumptions since the goal would be to identify useful model architectures and hyperparameters, and simulated datasets can be used to verify the correctness of a model’s implementation. To accurately test the performance of the model, it is important that simulated datasets be generated for a range of parameters. For example, varying the parameters to violate the model's assumptions can test the sensitivity of the model's performance.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember @rasbt suggested we avoid specifically talking about VAEs due to their declining use (correct me if I'm wrong). Your paragraph is clearer so I suggest we use that in its place since it covers the same topics as the current suggested change. If someone else agrees, I'll update the PR and re-request reviews.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends a bit. We used VAEs recently to better understand a phenomenon that had been widely observed but not really empirically demonstrated: https://www.biorxiv.org/content/10.1101/2020.05.03.066597v1 . The data from the VAE are clearly a bit over-smoothed, but depending on your use case that might or might not matter.

There are also elements of data simulation that aren't based on a generative neural network (i.e., learning the true distribution of a training set) and are instead based upon researcher suppositions about the problem at hand. If we want to start from @pstew's I'm happy to suggest some modifications on top of it! 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also see if @ajlee21 wants to weigh in on revising the paragraph from @pstew since she led that dataset simulation work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, very happy to help with this. This manuscript sounds like it will be very helpful!. Once @pstew 's changes are merged in I'm happy to add to this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstew - can you go to files changed and mark as approve? then we can merge this and @ajlee21 can contribute a PR with suggested modifications

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajlee21 Thanks! I made the changes. Please tweak as much as needed. Since we took this from another rule, do we need to ultimately bring this paragraph back around to the idea of this rule (knowing your data/knowing your problem)? I suppose getting to the true distribution of the data is a start, but maybe as written it doesn't exactly fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgreene Done!

@Benjamin-Lee Benjamin-Lee requested a review from pstew September 30, 2020 18:00
Start tweaking last paragraph
@cgreene
Copy link
Collaborator

cgreene commented Oct 1, 2020

doing a merge commit here because there are so many commits from so many different people that i don't know what squash will do with authorship

@cgreene cgreene merged commit bafe3ad into master Oct 1, 2020
@cgreene
Copy link
Collaborator

cgreene commented Oct 1, 2020

well - it squash merged but it had 2 authors on the commit so it looks like it worked 🤷

@cgreene cgreene deleted the remove-simulated-data-from-concepts branch October 1, 2020 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants