Prompt ranking evaluation task in Hierarchical Neural Story Generation paper#733
Prompt ranking evaluation task in Hierarchical Neural Story Generation paper#733apappu97 wants to merge 6 commits into
Conversation
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
|
This pull request has been automatically marked as stale. If this pull request is still relevant, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize reviewing it yet. Your contribution is very much appreciated. |
|
Closing this pull request after a prolonged period of inactivity. If this issue is still present in the latest release, please ask for this pull request to be reopened. Thank you! |
Hi all,
Thanks for releasing the code and models for the Hierarchical Neural Story Generation paper!
We have been trying to reproduce the prompt ranking accuracy metric mentioned in the paper. The paper says (Figure 6) that the full fusion model gets 16.3% accuracy. Weirdly, when we run our implementation of the prompt ranking accuracy evaluation, we get something in the region of 39.8% accuracy instead.
As far as we can tell, we've done the evaluation as described in the paper - but perhaps we've done something wrong, or maybe our method for sampling the examples is different to what was done in the paper?
This PR contains code and instructions to run our prompt ranking accuracy evaluation and reproduce our 39.8% number. We're hoping that someone at FAIR might be able to help us figure out why we're getting a very different number to that reported in the paper.
Thanks a lot!
Aneesh