Remove paragraph on simulating data from first tip#218
Conversation
cgreene
left a comment
There was a problem hiding this comment.
I agree this doesn't belong here but it's probably good to have in Tip 4. I will file a PR adding it there.
|
I agree. Personally I think Tip 1 should be revised to introduce concepts about the right model entirely (e.g., supervised / unnsupervised / architecture choice / model complexity). Data is mentioned pretty early on, but it's confusing. It may be nice to make all focus on data selection in Tip 4 overall. |
|
By the way @ejfertig, I just invited you as a collaborator to the repo. Once you accept, you can review this PR. In general, the procedure we do is to have one person make the PR, a second person approve it, and a third approve and merge it. So, if you think the changes are good to go, mark your approval and merge! |
content/06.know-your-problem.md
Outdated
| To accurately test the performance of the model, it is important that simulated datasets be generated for a range of parameters. | ||
| Simulated datasets generated from parameters that are consistent with the assumptions of the deep learning models can be used to verify the correctness of the model’s implementation. | ||
| Further varying the parameters to violate the model's assumptions can test the sensitivity of the model's performance to those assumptions. | ||
| Generative models based upon sample training sets can be an important tool for generating large cohorts of simulated datasets that are representative of the real world problems to which the machine learning prediction models will be applied. |
There was a problem hiding this comment.
Can we clarify this sentence a bit? I'm still not 100% sure I understand what its saying. Does it mean that generative models are useful for testing predictive models (like what the last sentence said)? Is it a conclusion? Or is it introducing a different idea which I failed to catch?
There was a problem hiding this comment.
I re-read. It is a bit confusing. Seems like it can either be the conclusion or the start of going in a more specific direction (since it seems to use the word generative for the first time). Not sure the last sentence is needed.
There was a problem hiding this comment.
Now that I re-read, I think this paragraph needs to be tweaked. The intro sentence is talking about datasets not being thoroughly understood...so meaning the true distribution is unknown? If the idea is to learn the true data distribution of the data for the purpose of generating new, simulated data, then I think this should be spelled out for the reader. Do we also want to spell out generative vs discriminative? GAN vs VAE?
Not sure if this is commit worthy, but for a starting place how about something like
Data simulation is a powerful approach to creating additional data for model testing. In data simulation, a generative model is used to learn the true distribution of a training set for the purpose of creating new data points. Simulations should be performed under reasonable assumptions since the goal would be to identify useful model architectures and hyperparameters, and simulated datasets can be used to verify the correctness of a model’s implementation. To accurately test the performance of the model, it is important that simulated datasets be generated for a range of parameters. For example, varying the parameters to violate the model's assumptions can test the sensitivity of the model's performance.
There was a problem hiding this comment.
I remember @rasbt suggested we avoid specifically talking about VAEs due to their declining use (correct me if I'm wrong). Your paragraph is clearer so I suggest we use that in its place since it covers the same topics as the current suggested change. If someone else agrees, I'll update the PR and re-request reviews.
There was a problem hiding this comment.
I think it depends a bit. We used VAEs recently to better understand a phenomenon that had been widely observed but not really empirically demonstrated: https://www.biorxiv.org/content/10.1101/2020.05.03.066597v1 . The data from the VAE are clearly a bit over-smoothed, but depending on your use case that might or might not matter.
There are also elements of data simulation that aren't based on a generative neural network (i.e., learning the true distribution of a training set) and are instead based upon researcher suppositions about the problem at hand. If we want to start from @pstew's I'm happy to suggest some modifications on top of it! 👍
There was a problem hiding this comment.
Sure, very happy to help with this. This manuscript sounds like it will be very helpful!. Once @pstew 's changes are merged in I'm happy to add to this
There was a problem hiding this comment.
@ajlee21 Thanks! I made the changes. Please tweak as much as needed. Since we took this from another rule, do we need to ultimately bring this paragraph back around to the idea of this rule (knowing your data/knowing your problem)? I suppose getting to the true distribution of the data is a start, but maybe as written it doesn't exactly fit.
Start tweaking last paragraph
|
doing a merge commit here because there are so many commits from so many different people that i don't know what squash will do with authorship |
|
well - it squash merged but it had 2 authors on the commit so it looks like it worked 🤷 |
[ci skip] This build is based on bafe3ad. This commit was created by the following CI build and job: https://github.com/Benjamin-Lee/deep-rules/commit/bafe3adb935bef5addbcd6717de18b32a968539a/checks https://github.com/Benjamin-Lee/deep-rules/runs/282422705
[ci skip] This build is based on bafe3ad. This commit was created by the following CI build and job: https://github.com/Benjamin-Lee/deep-rules/commit/bafe3adb935bef5addbcd6717de18b32a968539a/checks https://github.com/Benjamin-Lee/deep-rules/runs/282422705
This paragraph doesn't seem to fit the theme of learning about general ML before diving into DL, so I think we should remove it.