A few weeks ago we spoke with Joe Simmons from Wharton about his P-curve methodology of reviewing papers. A p-curve is a graph of all of the p-values in a paper. When you graph all of the significant p-values, you want them to all bunch up close to zero. If they tend to bunch up close to .05, it is a sign that the authors may have been working a bit too hard to reach significance, indicating possible "p-hacking". I found the method fascinating, so I've tried to apply some of that same thinking to a class assignment where we were asked to evaluate a program of study. Here is the paper I evaluated:
Hong, J., & Chang, H. H. (2015). “I” Follow My Heart and “We” Rely on Reasons: The Impact of Self-Construal on Reliance on Feelings versus Reasons in Decision Making. Journal of Consumer Research, 41(6), 1392-1411.
Note: This is not meant to be a takedown of this article. As a first year PhD student, I'm conducting this analysis to fulfill a class assignment, and to learn about research.
My methodology for this evaluation is to basically look at the core findings of each study to see how statistically solid the results are (how far from p=.05; further is better).
I had a lot of fun with study 1a because there was enough information in the paper to derive the actual data:
Apartment Choice. A chi-square test revealed a significant effect of self-construal on participants’ apartment choice (χ²(1) = 4.21, p < .05). As predicted, participants primed with an independent self-construal were more likely to choose the affectively superior apartment (55.2%) than those primed with an interdependent self-construal (29.0%). Given that the decision task involved relative preferences, these results also suggest that participants primed with an interdependent self-construal were more likely to choose the cognitively superior option (71.0%) than those primed with an independent self-construal (44.8%).
On the prior page, the authors said that n=60 for this experiment. When I put 30 in each condition, the percentages above didn't break down into whole numbers, so I played with it a bit and figured out that they had 29 people in the independent condition, and 31 people in the interdependent condition.
Here is the R code I used to recreate the chi-squared table:
When I tried to do a chi-squared test on this data, I got a p-value of .07, so I thought I did it wrong, but I found out that R uses Yate's Continuity Correction by default. I tried it without the correction and got the exact same χ²(1) = 4.21 that the authors got. Then, I tried a couple of other tests (Fisher's exact count test and Markov Chains) to have a couple of more tests to compare. I borrowed the methodology from William King.
The p-values for the above tests (same order as above) are .073, .040, .066, .069. These values are fairly consistent except for the .040 p-value obtained by the author by not using any sort of continuity correction. I think it likely that the authors' statistical software defaults to not using a continuity correction.
Just for fun I wanted to graphically compare the four different tests: that of the authors and my three tests. I also wanted to see what would happen if a single participant changed their mind. In the graph below, you can see the authors' .040 p-value in red. The x-axis shows what would happen to the significance if a single subject from the independent class voted the other way. You can see that this study is barely significant. A one unit change would make it non-significant even by the authors' non-stringent test. If a single person switched from an unfavorable to a favorable class, the result would be significant by every method. The black line is a chi-squared test without continuity correction, which the authors used. The other three lines represent Markov chains, Fisher's Exact Test, and Yate's Continuity Correction. Yate's is the top line in green. The other two are virtually identical.
If you want the R code for the above plot, you can get all of my code from this gist: https://gist.github.com/aaroncharlton/b1ce016b7ea320423e93
I did the same plot for the interdependent class and got essentially the same results: non-significant by every method but the authors', and a one unit change would make it non-significant even by their method.
I didn't find enough information here to recreate their tests, so I skipped this study.
Once again, I had to derive the p-values because of the use of p<.05. In study 2, the interaction effect was safely significant (p=.013), but the planned contrast was barely significant (p=.048). Just like study 1a, it's so close to not being significant that…I'll just leave it at that. My code is below:
Study 3: Willingness to Pay
I couldn't find any issues with study 3. The stats looked good to me.
Study 4 looked good too, but it seems like they were testing some obvious stuff here.
I didn't have to derive the p-values in study 5 because they were mostly given. Unfortunately, there were a dozen or so p-values that were all barely or nearly significant (p=.07 to p=.04). Everything was right on the edge.
If you find any problems with my analysis please let me know.
In a follow up post I actually compare the different types of chi square tests and find that the method used by the authors was acceptable, and their results were not likely due to random chance in this case.