The second in my two-part blog post at PLOS Mind the Brain involves assisting readers to do some debunking of bad neuoscience for themselves. The particular specimen is neurononsense intended to promote emotionally focused psychotherapy (EFT) to the unwary. A promotional video and press releases drawing upon a PLOS One article were aimed to wow therapists seeking further training and CE credits. The dilemma is that most folks are not equipped with the basic neuroscience to detect neurobullocks. Near the end of the blog, I provide some basic principles for cultivating skepticism about bad neuroscience. Namely,
- Multiple statistical tests performed with large numbers of fMRI data points from small numbers of subjects. Results capitalize on chance and probably will not generalize.
- The new phrenology, claims that complex mental functions are localized in single regions of the brain so that a difference for that mental function can be inferred from a specific finding for that region.
- Glib interpretations that if a particular region of the brain is activated. It may simply mean that certain mental processes are occurring. Among other things, he could simply mean that these processes are now taking more effort.
- Claims that changes in activation observed in fMRI data represent changes in the structure of the brain or mental processes. Function does not equal structure.
But mainly, I guided readers through the article calling attention to anomalies and just plain weirdness at the level of basic numbers and descriptions of procedures. Some of my points were rather straightforward, but some may need further explanation or documentation. That is why I have provided this auxiliary blog.
The numbers below correspond to footnotes embedded in the text of the Mind the Brain blog post.
1. Including or excluding one or two participants can change results.
Many of the analyses depended on correlation coefficients. For a sample of 23, a correlation of .41 is required for a significance of .05. To get a sense of how adding or leaving out a few subjects can affect results, look at the scatterplots below.
The first has 28 data points and a correlation of -.272. The second plot has added in three data points which were not particularly outliers, and the correlation jumped to -.454.
2. There is some evidence this could have occurred after initial results were known.
The article notes:
A total of 35 couples completed the 1.5 hour pre-therapy fMRI scan. Over the course of therapy, 5 couples either became pregnant, started taking medication, or revealed a history of trauma which made them no longer eligible for the study. Four couples dropped out of therapy and therefore did not complete the post EFT scan, two couples were dropped for missing data, and one other was dropped whose overall threat-related brain activation in a variety of regions was an extreme a statistical outlier (e.g., greater than three standard deviations below the average of the rest of the sample).
I am particularly interested in the women who revealed a history of trauma after the initial fMRI. When did they reveal it? Did disclosure occur in the course of therapy?
If the experiment had the rigor of a clinical trial as the authors claim, results for all couples would be retained, analogous to what is termed an “intention-to-treat analysis.”
There are clinical trials that started with more patients per cell and dropping or retaining just a few patients affected the overall significance of results. Notable examples are Fawzy et al. who turned a null trial into a positive one by dropping three patients and Classen et al in which results of a trial with 353 participants are significant or not, depending on whether one patient is excluded.
3. Any positive significant findings are likely to be false, and of necessity, significant findings will be large in magnitude, even when false positives.
A good discussion of the likelihood that significant findings from underpowered trials are likely to be false can be found here. Findings from small numbers of participants that are significant are larger, because larger effect sizes are required for significance.
4. They stated that they recruited couples with the criteria that their marital dissatisfaction initially be between 80-96 on the DAS. They then report that initial mean DAS score was 81.2 (SD=14.0). Impossible.
Couples with mild to moderate marital distress are quite common in the general population to which advertisements were directed. It statistically improbable that they recruited from such a pool and obtained a mean score of 81.2. Furthermore, with a lower bound of 80, it makes no sense that if the mean score was 81.2, there would be a standard deviation of 14. This is overall a very weird distribution if we accept what they say.
5. The amount of therapy that these wives received (M= 22-9, range =13-35) was substantially more what was provided in past EFT outcome studies. Whatever therapeutic gains were observed in the sample could not be expected to generalize to past studies.
Past outcome studies of EFT have provided 8 to 12 sessions of EFT with one small dissertation study providing 15 sessions.
6. The average couple finishing the study still qualified for entering it.
Mean DAS scores after EFT was declared completed were 96.0 (SD =17.2). In order to enroll in the study, couples had to have DAS scores 97 or less.
7. No theoretical or clinical rationale is given for not studying husbands or presenting their data as well.
Jim Coan’s video presentation suggests that he was inspired to do this line of research by observing how a man in individual psychotherapy for PTSD was soothed by his wife in the therapy sessions after the man requested that she be present. There is nothing in the promotional materials associated with either the original Coan study or the present one to indicate that fMRI would be limited to wives.
Again, if the studies really had the rigor of a clinical trial as claimed by the authors, the exclusive focus on wives versus husbands’ fMRI would have been pre-specified in the registration of the study. There are no registrations to the studies.
8. The size of many differences between results characterized as significant versus nonsignificant is not itself statistically significant.
With a sample size of 23, let’s take a correlation coefficient of .40, which just misses statistical significance. A correlation of .80 (p < .001) is required to be statistically more significant than .40 (p > .05). So, many “statistically significant findings” are not significantly larger than correlations that were ignored as not significant. This highlights the absurdity of simply tallying up differences that reach the threshold of significance, particularly when no confidence intervals are provided.
9. The graphic representations in Figures 2 and 4 were produced by throwing away two thirds of the available data.
Throwing away the data for 16 women leaves with 7. These were distributed across the four lines in Figures 2 and 4, one or two to a line. Funky? yup.