Do not Be Fooled By Flimsy Results
2 mins read

Do not Be Fooled By Flimsy Results

[ad_1]

We are utilised to reading studies claiming that a analyze was done, and the participants showed the impact. The participants in the experimental team outperformed all those in the regulate group. Typically the report adds that the result was statistically major, potentially at the .05 stage or even at the .01 level.

So it is quick to sort the impression that all the participants showed the influence. Nonetheless, that would be improper.

In exams of statistical importance, p-values (and influence sizes) are essentially the big difference concerning team indicates divided by (or expressed in terms of) the standard deviation. Statistical significance is with respect to team averages. So to what extent does a variance amongst group means symbolize the discrepancies involving individuals?

Perhaps the contributors all showed the result, but only to a quite smaller extent, and the substantial variety was sufficient to produce statistical importance. Statisticians have nervous about this risk and have invented steps of influence sizing to make clear the discovering. Regrettably, several stories, specially in the media, include any details about result size—probably simply because when you add additional specifics like this, you just muddy the photograph and confuse the reader.

Possibly only some of the individuals confirmed the effect, but showed it to a big extent, balancing out people who did not display any result at all. Simple actions of variability will clearly show if this might be the case, but again, several lay viewers will not comprehend or be intrigued in the meaning of variability and most of the reviews they receive won’t consist of normal deviations.

So let us try out yet another approach—making it quite uncomplicated for visitors to grasp how pervasive an result is.

To check out this concern, we received knowledge from an actual funded examine (not done by us) comparing two teams in terms of overall performance at a activity that was either aided by an AI method (experimental team) or not aided by an AI program. There have been 30 contributors in just about every group, a acceptable number. This individual data set was chosen simply because the distributions seem about standard (using the “eyeball” exam).

You can see the success under. The distribution for the experimental issue is proven in blue and the distribution for the control issue is proven in orange. The figure demonstrates that the two distributions overlap a fantastic offer.

Source: Robert Hoffman

Overlapping Distributions for Experimental and Regulate Teams.

Resource: Robert Hoffman

So there’s an outcome below, and it is statistically substantial: p<,001, using a two-tailed t-test. But it certainly isn’t universal. We need to keep that in mind when we discuss findings such as these.

But what would it take to make this significant effect at the p<.001 level disappear?

The Peelback Method

We can progressively peel away the extremes. First, we removed the data for the two participants in the experimental group who scored the highest and the one participant in the control group who scored the lowest.

Bingo.

The proportion of correct responses in the experimental group dropped from 65% to 54%, and the proportion correct in the control group increased a little, from 48% to 49%, and now the t-test shows p<.334. Not even close to being counted as statistically significant. So the initial “effect” doesn’t seem very robust.

If the statistical significance were still achieved after this first peel, we could keep peeling and recomputing until the p<.05 level was crossed. We might find that we had to do a lot of peeling. In that case, we would have much more confidence in the conclusion about the statistical significance. But if statistical significance disappears merely by dropping three of the 60 participants, then how seriously can we take the results? Or how seriously should we take the t-test?

This general method could be turned into an actual proportional “metric,” that is, the number of “peeling steps” relative to the total sample size. In the present case, that number is 3/60. The smaller that number, the more tenuous the statistical effect.

Side note: We did not cherry-pick this example. We simply wanted a simple data set that yielded statistically significant results. We had no idea in advance that the example would illustrate our thesis so well.

Conclusion

For lay readers of psychology research reports, the peelback method might be much easier to grasp than other kinds of statistics such as effect sizes.

Researchers themselves might find it a useful exercise to explore and examine the peelback method for their own experiments. If they have the courage. Researchers could then consider which participants were responsible for the findings and what these participants were like.

But let’s think again about what the peeling entails. Some statistics textbooks refer to the problem of “outliers,” and even present procedures enabling researchers to justify the removal of data from outliers. The concept seems to be that “outliers” just add noise to the data, hiding the “true” effects.

The highest-performing participants in an experimental group (in studies like the one referenced here) demonstrate what is humanly possible. The worst-performing participants in both the experimental and the control groups might be pointers to issues of motivation or selection. The “best” and the “worst” performers are the individuals who should be studied in greater detail say, by in-depth post-experimental cognitive interviews.

Unfortunately, many psychological studies do not include in-depth post-experimental cognitive interviews as a key part of their method, and even when interviews are conducted, the results are usually given short shrift in the research reports. Data about what people are thinking will always clarify the meaning and “significance” of the numerical results.

So this little exploration of a simple idea exposes some substantive issues and traps in research methodology. Not the least of these is a cautionary tale, to never confuse a statistical effect (about groups) with a causal effect an independent variable might have on individuals.

Robert Hoffman is a co-author for this post.

[ad_2]

Source link