Needed: more informative and trustworthy abstracts. Recommendations for some simple reforms.

An analysis of an uninformative, seriously spun abstract chosen from PLOS One shows why we need guidelines for writing and interpreting abstracts.

  •  With so much to read, and so little time, readers need to be able to quickly screen abstracts and decide whether articles are worth putting further effort into retrieving them.
  •  More informative, trustworthy abstracts would be of great benefit in this process.

A Call to Action

  •  Journals should adopt, publicize, and enforce standards for writing abstracts.
  •  In the interim, authors can adopt basic standards and editors and reviewers can begin insisting on them.
NeuroSkeptic abstract_expression

From Neuroskeptic

Personally – and I speak only for myself, not any official policies of PLOS One – I’m already applying the standards and desk rejecting manuscripts that don’t comply. My decision letters explain that further consideration of a manuscript is contingent on a revision providing a more informative abstract.

Casual readers benefit from more informative abstracts, but so does everybody else.

We can think of an abstract as part way down the funneling process from a reader encountering the title of an article to downloading the actual paper to eventually citing the article.

For instance, in conducting a systematic review, a large number of abstracts are typically screened  in order to identify a much smaller number for more intensive review. Although there may be some spot checking on this process, the accuracy of an abstract can be decisive in determining  whether it is further examined for inclusion in a review.

Journalists often screen abstracts to choose the articles about which they will write a story. Hype and distortion in media coverage can be linked to exaggerations in an article’s abstract. It is unclear whether that is due to journalists only reading abstracts or to authors reliably revealing their commitment to trustworthy reporting of their study in the transparency and completeness of their abstracts.

Abstracts for clinical trials are increasingly accompanied by trial registration. PubMed now routinely provides trial registration information so that readers can compare the abstract to the trial registration without going to the actual article.

There is little evidence that trial registrations are routinely considered in evaluating manuscripts.  The problem starts with reviewers and editors failing even to access the trial registration.

In this blog post, I will show how an uninformative abstract made assessment difficult of an article published in PLOS One. 

I first compare the abstract to the trial registration and then delve into the article itself. We will soon see why a more uninformative abstract would have led to dismissing this article out of hand.

The article

The open access article, downloadable by anybody with access to the Internet is:

Fancourt D, Perkins R, Ascenso S, Carvalho LA, Steptoe A, Williamon A (2016) Effects of Group Drumming Interventions on Anxiety, Depression, Social Resilience and Inflammatory Immune Response among Mental Health Service Users. PLOS ONE 11(3): e0151136. doi:10.1371/journal.pone.0151136.

Abstract

Growing numbers of mental health organizations are developing community music-making interventions for service users; however, to date there has been little research into their efficacy or mechanisms of effect. This study was an exploratory examination of whether 10 weeks of group drumming could improve depression, anxiety and social resilience among service users compared with a non-music control group (with participants allocated to group by geographical location.) Significant improvements were found in the drumming group but not the control group: by week 6 there were decreases in depression (-2.14 SE 0.50 CI -3.16 to -1.11) and increases in social resilience (7.69 SE 2.00 CI 3.60 to 11.78), and by week 10 these had further improved (depression: -3.41 SE 0.62 CI -4.68 to -2.15; social resilience: 10.59 SE 1.78 CI 6.94 to 14.24) alongside significant improvements in anxiety (-2.21 SE 0.50 CI -3.24 to -1.19) and mental wellbeing (6.14 SE 0.92 CI 4.25 to 8.04). All significant changes were maintained at 3 months follow-up. Furthermore, it is now recognised that many mental health conditions are characterised by underlying inflammatory immune responses. Consequently, participants in the drumming group also provided saliva samples to test for cortisol and the cytokines interleukin (IL) 4, IL6, IL17, tumour necrosis factor alpha (TNFα), and monocyte chemoattractant protein (MCP) 1. Across the 10 weeks there was a shift away from a pro-inflammatory towards an anti-inflammatory immune profile. Consequently, this study demonstrates the psychological benefits of group drumming and also suggests underlying biological effects, supporting its therapeutic potential for mental health.

Trial registration for Creative Practice as Mutual Recovery ClinicalTrials.gov  NCT01906892

The Primary Outcome Measures is designated as the Warwick-Edinburgh Mental Well-being Scale.

Secondary Outcome Measures are both psychological and biological. The psychological are Secker’s Measure of social inclusion. The Connor-Davidson Resilience Scale (CD-RISC), and the Anxiety and Depression subscales of the Hospital Anxiety and Depression Scale (HADS). Biological secondary outcome measures include saliva levels of cortisol, immunoglobulin and interleukins including IL6, as well as blood pressure and heart rate.

Comment and integration

We immediately see evidence of outcomes switching. The Mental Well-being Scale is not mentioned in the abstract. Instead, the Depression subscale of the Hospital Anxiety and Depression Scale (HADS) has been elevated to a primary outcome. Among the primary and secondary outcomes designated in the trial registration, only the HADS depression subscale and Connor-Davidson Resilience Scale are mentioned.

A battery of cortisol and immunological measures derived from saliva are mentioned in the abstract, but there is hand waving, rather than presenting of the actual results and claims of “a shift away from a pro-inflammatory towards an anti-inflammatory immune profile.”

There is no mention of blood pressure or heart rate in the abstract.

What I wanted to see in the abstract, but did not.

A careful reader can figure out that this study was not a randomized trial. Rather, participants were recruited for a drumming group if they lived in close proximity to attend a group. A control group was somehow constructed from participants who lived further away. So, this is not a randomized trial and there is a lack of equivalence or comparability between the intervention and control groups. This likely precludes meaningful, generalizable comparisons.

The nature of the study design certainly needs to be taken into account in interpreting the resulting data. I would have required an explicit statement in the abstract that this is a nonrandomized trial.

Checking with the trial registration, it appears the design was a compromise from what had been originally planned. Whenever I see that a study design has been compromised, I look more carefully for other ways in which compromises  may have introduced bias into the results that are reported.

But I also want to know how many participants were recruited and how many were retained for the analyses. Selective retention in a control group that derives no benefit from participating in the study is another source of bias. More generally, I want to know about whether all participants were retained for analysis, as well as the adequacy of the sample size to determine if this is such an underpowered that I can drop further it from consideration.

Enter the CONSORT Elaboration for Abstracts

The well-known Consolidated Standards for Reporting Trials (CONSORT) checklist is required to be completed by many journals in submitting a manuscript reporting a clinical trial. But there is also a checklist for abstracts  that has not received nearly as much attention. It can be readily modified to cover other designs for intervention studies, if an item is added requiring specific designation of which design the study the paper is reporting involved.

consort for abstracts

Comparison to the body of the article

The Design and Participants section indicates:

Control participants were recruited through the same channels in South and North London. To minimise potential bias, they met the same inclusion criteria but were not within the vicinity of the group drumming sessions to take part. The groups were matched for age, sex, ethnicity and employment status, within the constraints of our recruitment channels…

Drumming participants were not blinded as to which study condition they were in. Control participants were told they were participating in a study about music and mental health but were not aware that they could have had access to a drumming group had they lived in West London. Staff collecting saliva samples were not blinded, but laboratory analysis was blind.

This is a weak design with multiple risks of bias. There is no randomization and no blinding of an intervention and control participants who were recruited with different consent forms. It’s not at all clear comparison with the nonequivalent control condition would provide for understanding of the intervention condition.

Glancing at the Table 1 presenting baseline differences, I noticed a substantial difference in depression scores between the intervention and control condition,  that would cause problems for any reliance on depression as an outcome.

reduced baseline differences

Nonetheless, I next learned that this particular measure was used for calculating sample size:

Sample size was calculated using data from the previous six-week study with the primary endpoint of depression (HADSD) which showed an effect size of f = 0.6 [6]. Using this effect size, an a priori sample size calculation using G*Power 3.1 for a between-factors ANOVA with an alpha of 0.05, power of 0.9 and assuming two-sided tests and a correlation of 0.8 among repeated measures (2 groups, 3 timepoints) was made which showed that an overall total of 28 participants would be required (14 per group). For the control group, to allow for drop-outs of 30% (estimated based on the six-week study), 20 participants were targeted for recruitment. For the experimental group, because of the range of biological markers being tested, we decided to match sample size with our preliminary study [6], and so 39 participants were initially recruited. Recruitment was continued until these targets had been reached before being closed one week before the drumming started. Following drop-outs, 15 control and 30 drumming participants remained.

[It is beyond the focus of the present blog post, but small feasibility studies should not be used to provide effect sizes for determining the sample size for a larger, more resource study. If there are significant findings at all in the smaller study, they are likely to be an exaggeration of what will be found with the larger study.]

Enough is about enough. I am accumulating information here that would’ve been sufficient for me to drop this article from further consideration if it had appeared in the abstract. Namely, this is a nonrandomized trial with different recruitment procedures for intervention and control participants. If that were not enough, there are only 15 control patients retained for analysis. I would not expect robust and generalizable conclusions.

Results

I’m not sure I could given a priori prediction of what benefits, if any, a drumming group would provide for people accessing mental health services, but I wouldn’t expect any effects would be sufficiently strong to be detected in a trial with 15 control patients. Nonetheless…

The analysis of anxiety ratings showed a significant condition by time interaction (F2,84 = 3.63, p<0.05), with anxiety falling over the 10 weeks of drumming (mean -2.21, SE 0.50, CI -3.24 to -1.19) while remaining unchanged in the control condition (mean -0.33, SE 0.57, CI -1.55 to 0.88). Within-subjects contrasts showed that the time by group interaction did not reach significance at 6 weeks but did reach significance by 10 weeks (F1,42 = 5.357, p<0.05). The overall decrease in anxiety from baseline in the drumming group averaged 9% by week 6 and 20% by week 10. Fig 2A shows the within-subject change from baseline at weeks 6 and 10 in both the drumming and control conditions.

The analysis of depression ratings showed a similar pattern, with a significant condition by time interaction (F2,84 = 10.23, p<0.001). Depression fell over the 10 weeks of drumming (mean -3.41, SE 0.62, CI -4.68 to -2.15) while remaining unchanged in the control condition (mean 0.47, SE 0.52, CI -0.66 to 1.59). Within-subjects contrasts showed that the time by group interaction reached significance at 6 weeks (F1,42 = 10.038, p<0.01) and was seen even more strongly by 10 weeks (F1,42 = 17.048, p<0.001). The overall decrease in depression from baseline in the drumming group averaged 24% by week 6 and 38% by week 10. In the light of the baseline differences in depression ratings, we also analyzed change scores over time controlling for baseline levels; the difference between drumming and control conditions remained significant at week 10 (F1,41 = 5.035, p<0.05). Fig 2B shows the within-subject change from baseline at weeks 6 and 10 in both the drumming and control conditions.

So, for anxiety, which was not an original primary outcome, results reached the magical p<0.05, not at six weeks, but at 10. Results for depression at first seem more impressive – that is until we recall the large group difference at the outset, which will not be overcome by any control for baseline characteristics. As for the originally designated primary outcome well-being?

In the analysis of the wellbeing scores, the condition by time interaction was nearly significant (F2,82 = 2.91, p = 0.06), with changes in the drumming group (mean 6.14, SE 0.92, CI 4.25 to 8.04) but not the control group (mean 2.33, SE 1.56, CI -1.01 to 5.67). Within-subjects contrasts showed that the time by group interaction was not significant at 6 weeks but was significant by 10 weeks (F1,41 = 5.033, p<0.05). The improvement in the drumming group averaged 8% by week 6 and 16% by week 10. Fig 2D shows the within-subject change from baseline at weeks 6 and 10 in both the drumming and control conditions.

The results for biological variables are presented with a lot of selective reporting and hand waving, but a worth reading through to the dramatic dénouement of a nonsignificant decrease in cortisol, p=0.557.  A real trend developing there, that will need a bigger study to explore. [Sigh!]

Biological results

In order to explore the mechanisms of change in the drumming group, exploratory saliva samples were taken immediately before and after baseline and weeks 6 and 10 of drumming. Drumming significantly increased the anti-inflammatory cytokine IL4 (F2,34 = 3.830, p<0.05). Planned polynomial contrasts showed that there was a linear effect across time (F1,17 = 6.504, p<0.05), with a 9% increase in levels by week 6 and a 13% increase by week 10 (see Fig 3A). Alongside this, there was a significant change in levels of the pro-inflammatory chemokine MCP1 (F2,30 = 4.221, p<0.05), which polynomial contrasts revealed to be a quadratic effect, with a decrease of 10% by week 6 followed by a return to near baseline levels by week 10 (F1,15 = 10.793, p<0.01) (see Fig 3B). There was also a near-significant effect for IL17 (F2,30 = 2.502, p = 0.099), also shown to be a quadratic effect characterized by an initial increase of 13% followed by a small decrease returning to 6% above baseline levels (F1,15 = 4.301, p = 0.056) (see Fig 3C). No changes were found in levels of TNFα across the 10 weeks (F2,34 = 0.134, p = 0.875) nor IL6 (F2,34 = 0.808, p = 0.454), and although there was a decrease in cortisol across the 10 weeks, this was not significant (F2,42 = 0.593, p = 0.557).

Discussion

We are long beyond the days of browsing the scientific literature by going over to the library and taking a volume off the shelf. Authors and even journals that expect us to commit attention should realize that they, in turn, need to capture and retain our attention with informative abstracts that cultivate our sense of a trustworthy source.

The particular PLOS One article that I examined in detail was chosen largely at random. I’m confident that I could find numerous examples elsewhere and surely worse ones. Misleading abstracts are endemic, and an important part of promoting confused and deliberately misleading science. In this article, as I’m sure we would find elsewhere, the deviations from an accurate abstract conforming to a trial registration and a fair interpretation of the results were in the service of creating a confirmatory bias.

PLOS One prides itself in not requiring breakthrough or innovative research as a condition of publication, but it does insist on the transparent reporting of studies, which even if not perfect, are clear in their limitations.This abstract fails that test.

plos oneI think it would be great for PLOS One to lead the field forward in insisting on better abstracts and educating authors and reviewers accordingly.