{"id":1112,"date":"2016-01-14T08:50:49","date_gmt":"2016-01-14T13:50:49","guid":{"rendered":"http:\/\/datacolada.org\/?p=1112"},"modified":"2020-02-11T23:06:02","modified_gmt":"2020-02-12T04:06:02","slug":"45-ambitious","status":"publish","type":"post","link":"https:\/\/datacolada.org\/45","title":{"rendered":"[45] Ambitious P-Hacking and P-Curve 4.0"},"content":{"rendered":"<p>In this post, we first consider how plausible it is for\u00a0researchers to engage in more ambitious <em>p<\/em>-hacking (i.e., past the nominal significance level of p&lt;.05). Then, we describe how we have modified <em>p<\/em>-curve (see <a href=\"http:\/\/www.p-curve.com\/app4\" target=\"_blank\" rel=\"noopener noreferrer\">app 4.0<\/a>) to deal with this possibility.<\/p>\n<p><strong>Ambitious <em>p<\/em>-hacking is hard.<\/strong><br \/>\nIn \u201cFalse-Positive Psychology\u201d (<a href=\"http:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=1850704\">SSRN<\/a>), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p&lt;.05) skyrocketed from the nominal 5% to 61%.<a href=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1113 size-full\" style=\"border: 1px solid #000000;\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f1.png\" alt=\"f1\" width=\"1288\" height=\"770\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f1.png 1288w, https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f1-300x179.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f1-1024x612.png 1024w\" sizes=\"auto, (max-width: 1288px) 100vw, 1288px\" \/><\/a><\/p>\n<p>For a recently published paper, \"Better <em>P<\/em>-Curves\" (<a href=\"http:\/\/urisohn.com\/sohn_files\/wp\/wordpress\/wp-content\/uploads\/2019\/01\/better-p-curves-published.pdf\">.pdf<\/a>), we modified those simulations to see how hard it would be for <em>p<\/em>-hackers to keep going past .05. We found that <em>p<\/em>-hacking needs to increase exponentially to get smaller and smaller <em>p<\/em>-values. For instance, once a nonexistent effect has been <em>p<\/em>-hacked to <em>p<\/em>&lt;.05, a researcher would need to attempt nine times as many analyses to achieve <em>p<\/em>&lt;.01.<\/p>\n<p><a href=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/F2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1114 size-full\" style=\"border: 1px solid #000000;\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/F2.png\" alt=\"F2\" width=\"1612\" height=\"901\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/F2.png 1612w, https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/F2-300x168.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/F2-1024x572.png 1024w\" sizes=\"auto, (max-width: 1612px) 100vw, 1612px\" \/><\/a><\/p>\n<p>Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious <em>p<\/em>-hacking often fails.[<a href=\"#footnote_0_1112\" id=\"identifier_0_1112\" class=\"footnote-link footnote-identifier-link\" title=\"This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking.\">1<\/a>]\n<p><strong><em>P<\/em>-Curve and Ambitious\u00a0<em>p<\/em>-hacking<\/strong><em><br \/>\nP-<\/em>curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant <em>p<\/em>-values of the statistical tests of interest to the original researchers, and you look at its shape. If your <em>p<\/em>-curve is significantly right-skewed, then the literature you are examining has evidential value. If it\u2019s significantly flat or left-skewed, then it does not.<\/p>\n<p>In the absence of <em>p<\/em>-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed <em>p<\/em>-curve if one is in fact examining a literature full of nonexistent effects. Thus, <em>p<\/em>-curve\u2019s false-positive rate is 5%.<\/p>\n<p>However, when researchers <em>p<\/em>-hack trying to get <em>p<\/em>&lt;.05, that probability <em>drops <\/em>quite a bit, because <em>p<\/em>-hacking causes <em>p<\/em>-curve to be <em>left<\/em>-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through <em>p<\/em>-hacking have less than a 5% chance of obtaining a right-skewed <em>p<\/em>-curve.<\/p>\n<p>But if researchers get ambitious and keep <em>p<\/em>-hacking past .05, the barely significant results start disappearing and so <em>p<\/em>-curve starts having a spurious right-skew. Intuitively, the ambitious <em>p<\/em>-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting\u00a0<em>p<\/em>-curve starts to look artificially good.<\/p>\n<p><strong>Updated <em>p<\/em>-curve app, 4.0 (<a href=\"http:\/\/www.p-curve.com\/app4\/\">htm<\/a>), is robust to ambitious p-hacking<br \/>\n<\/strong>In \u201cBetter <em>P<\/em>-Curves\u201d (<a href=\"http:\/\/urisohn.com\/sohn_files\/wp\/wordpress\/wp-content\/uploads\/2019\/01\/better-p-curves-published.pdf\">.pdf<\/a>) we introduced a new test for evidential value that is much more robust to ambitious <em>p<\/em>-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.<a href=\"http:\/\/www.p-curve.com\/app4\/versions.php\" target=\"_blank\" rel=\"noopener noreferrer\">htm<\/a>)).<\/p>\n<p>The new test focuses on the \u201chalf <em>p<\/em>-curve,\u201d the distribution of <em>p<\/em>-values that are <em>p<\/em>&lt;.025. On the one hand, because half <em>p<\/em>-curve does not include barely significant results, it has a lower probability of mistaking ambitious <em>p<\/em>-hacking for evidential value. On the other hand, dropping observations makes the half <em>p<\/em>-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.<\/p>\n<p>Fortunately, by <em>combining<\/em> the full and half <em>p<\/em>-curves into a single analysis, we obtain inferences that are robust to ambitious <em>p<\/em>-hacking with minimal loss of power.<\/p>\n<p>The new test of evidential value:<br \/>\n<em><span style=\"color: #ff0000;\">A set of studies is said to contain evidential value if <u>either<\/u> the half p-curve has a p&lt;.05 right-skew test, <u>or<\/u> both the full and half p-curves have p&lt;.1 right-skew tests.\u00a0<\/span><\/em><span style=\"color: #ff0000;\"><span style=\"color: #000000;\">[<a href=\"#footnote_1_1112\" id=\"identifier_1_1112\" class=\"footnote-link footnote-identifier-link\" title=\"As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p&lt;.001 is much stronger than both tests with p=.099.\">2<\/a>]<\/span><\/span><\/p>\n<p>In the figure below we compare the performance of this new combination test with that of the full <em>p<\/em>-curve alone (the \u201cold\u201d test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 <em>p<\/em>-values the new test still has more power than the underlying studies do.<\/p>\n<p><a href=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1116 size-full\" style=\"border: 1px solid #000000;\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f3.png\" alt=\"f3\" width=\"1016\" height=\"842\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f3.png 1016w, https:\/\/datacolada.org\/wp-content\/uploads\/2016\/01\/f3-300x249.png 300w\" sizes=\"auto, (max-width: 1016px) 100vw, 1016px\" \/><\/a><\/p>\n<p>The bottom panels show that moderately ambitious <em>p<\/em>-hacking fully invalidates the \u201cold\u201d test, but the new test is unaffected by it.[<a href=\"#footnote_2_1112\" id=\"identifier_2_1112\" class=\"footnote-link footnote-identifier-link\" title=\"When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R&nbsp;Code: https:\/\/osf.io\/mbw5g\/&nbsp;\">3<\/a>]\n<p>We believe that these revisions to p-curve, incorporated in the updated app (<a href=\"http:\/\/www.p-curve.com\/app4\">.html<\/a>), make it much harder to falsely conclude that a set of ambitiously <em>p<\/em>-hacked results contains evidential value. As a consequence, the incentives to ambitiously <em>p<\/em>-hack are even lower than they were before.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-376\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo-300x145.jpg\" alt=\"Wide logo\" width=\"78\" height=\"38\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo-300x145.jpg 300w, https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo.jpg 320w\" sizes=\"auto, (max-width: 78px) 100vw, 78px\" \/><\/p>\n<hr \/>\n<div class=\"jetpack_subscription_widget\"><h2 class=\"widgettitle\">Subscribe to Blog via Email<\/h2>\n\t\t\t<div class=\"wp-block-jetpack-subscriptions__container\">\n\t\t\t<form action=\"#\" method=\"post\" accept-charset=\"utf-8\" id=\"subscribe-blog-1\"\n\t\t\t\tdata-blog=\"58049591\"\n\t\t\t\tdata-post_access_level=\"everybody\" >\n\t\t\t\t\t\t\t\t\t<div id=\"subscribe-text\"><p>Enter your email address to subscribe to this blog and receive notifications of new posts by email.<\/p>\n<\/div>\n\t\t\t\t\t\t\t\t\t\t<p id=\"subscribe-email\">\n\t\t\t\t\t\t<label id=\"jetpack-subscribe-label\"\n\t\t\t\t\t\t\tclass=\"screen-reader-text\"\n\t\t\t\t\t\t\tfor=\"subscribe-field-1\">\n\t\t\t\t\t\t\tEmail Address\t\t\t\t\t\t<\/label>\n\t\t\t\t\t\t<input type=\"email\" name=\"email\" autocomplete=\"email\" required=\"required\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tvalue=\"\"\n\t\t\t\t\t\t\tid=\"subscribe-field-1\"\n\t\t\t\t\t\t\tplaceholder=\"Email Address\"\n\t\t\t\t\t\t\/>\n\t\t\t\t\t<\/p>\n\n\t\t\t\t\t<p id=\"subscribe-submit\"\n\t\t\t\t\t\t\t\t\t\t\t>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"action\" value=\"subscribe\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"source\" value=\"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/1112\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"sub-type\" value=\"widget\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"redirect_fragment\" value=\"subscribe-blog-1\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" id=\"_wpnonce\" name=\"_wpnonce\" value=\"0e7f1cad39\" \/><input type=\"hidden\" name=\"_wp_http_referer\" value=\"\/wp-json\/wp\/v2\/posts\/1112\" \/>\t\t\t\t\t\t<button type=\"submit\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tclass=\"wp-block-button__link\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tstyle=\"margin: 0; margin-left: 0px;\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tname=\"jetpack_subscriptions_widget\"\n\t\t\t\t\t\t>\n\t\t\t\t\t\t\tSubscribe\t\t\t\t\t\t<\/button>\n\t\t\t\t\t<\/p>\n\t\t\t\t\t\t\t<\/form>\n\t\t\t\t\t\t<\/div>\n\t\t\t\n<\/div>\n<span style=\"font-size: 10pt;\"><strong>Footnotes.<\/strong><\/span><\/p>\n<ol class=\"footnotes\">\n<li id=\"footnote_0_1112\" class=\"footnote\">This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of <em>p<\/em>-hacking. [<a href=\"#identifier_0_1112\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_1_1112\" class=\"footnote\">As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p&lt;.001 is much stronger than both tests with p=.099. [<a href=\"#identifier_1_1112\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_2_1112\" class=\"footnote\">When the true effect is zero and researchers do not <em>p<\/em>-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R\u00a0Code: <a href=\"https:\/\/osf.io\/mbw5g\/\">https:\/\/osf.io\/mbw5g\/<\/a>\u00a0 [<a href=\"#identifier_2_1112\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>In this post, we first consider how plausible it is for\u00a0researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p&lt;.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility. Ambitious p-hacking is hard. In \u201cFalse-Positive Psychology\u201d (SSRN), we simulated the consequences of four&#8230;<\/p>\n","protected":false},"author":7,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wp_rev_ctl_limit":""},"categories":[35,65],"tags":[],"class_list":["post-1112","post","type-post","status-publish","format-standard","hentry","category-own-paper","category-p-curve"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/1112","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/comments?post=1112"}],"version-history":[{"count":2,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/1112\/revisions"}],"predecessor-version":[{"id":4799,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/1112\/revisions\/4799"}],"wp:attachment":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/media?parent=1112"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/categories?post=1112"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/tags?post=1112"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}