The authors of a recent article in JAMA Internal Medicine
“Physician Gender and Outcomes of Hospitalized Medicare Beneficiaries in the U.S.,” Yusuke Tsugawa, Anupam B. Jena, Jose F. Figueroa, E. John Orav, Daniel M. Blumenthal, Ashish K. Jha, MD, MPH1,2,8, JAMA Internal Medicine, online December 19, 2016, doi: 10.1001/jamainternmed.2016.7875
Stirred lots of attention in the media with direct quotes like these:
“If we had a treatment that lowered mortality by 0.4 percentage points or half a percentage point, that is a treatment we would use widely. We would think of that as a clinically important treatment we want to use for our patients,” said Ashish Jha, professor of health policy at the Harvard School of Public Health. The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.
Washington Post: Women really are better doctors, study suggests.
My immediate reactions after looking at the abstract were only confirmed when I delved deeper.
Basically, we have a large, but limited and very noisy data set. It is unlikely that these data allow us to be confident about the strength of any signal concerning the relationship between physician gender and patient outcome that is so important to the authors. The small apparent differences could be just more noise on which the authors have zeroed in so that they can make a statement about the injustice of gender differences in physician pay.
I am unwilling to relax methodological and statistical standards to manufacture support for such a change. There could be unwanted consequences of accepting that arguments can be made with such weak evidence, even for a good cause.
What if the authors had found the same small differences in noisy data in the reverse direction? Would they argue that we should preserve gender differences in physician pay? What if the authors focus on a different variable in all this noise and concluded that lower pay which women receive was associated with reduced mortality? Would we then advocate that will reduce the pay of both male and female physicians in order to improve patient outcomes?
Despite all the excitement that claim about an effect of physician gender on patient mortality is generating, it is most likely that we are dealing with noise arising from overinterpretation of complex analyses that assume more completeness and precision than can be found in the data being analyzed.
These claims are not just a matter of causal relationships being spun from correlation. Rather, they are causal claims being made on the basis of partial correlations emerging in complex multivariate relationships found in an administrative data set.
- Administrative data sets, particularly Medicaid data sets like this one, are not constructed with such research questions in mind. There are severe constraints on what variables can be isolated and which potential confounds can be identified and tested.
- Administrative data sets consist of records, not actual behaviors. It’s reasonable to infer a patient death associated with a record of a death. Association of a physician gender associated with a particular record is more problematic, as we will see. Even if we accept the association found in these records, it does not necessarily mean that physicians engaged in any particular behaviors or that the physician behavior is associated with the pattern of deaths emerging in these multivariate analyses.
- The authors start out with a statement about differences in how female and male physicians practice. In the actual article and the media, they have referred to variables like communication skills, providing evidence-based treatments, and encouraging health-related behaviors. None of these variables are remotely accessible in a Medicaid data set.
- Analyses of such administrative data sets do not allow isolation of the effects of physician gender from the effects of the contexts in which their practice occurs and relevant associated variables. We are not talking about a male or female physician encountering a particular patient being associated with a death or not, but an administrative record of physician gender arising in a particular context being interpreted as associated with a death. Male and female physicians may differ in being found in particular contexts in nonrandom fashion. It’s likely that these differences will dwarf any differences in outcomes. There will be a real challenge in even confidently attributing those outcomes to whether patients had an attending male or female physician.
The validity of complex multivariate analyses are strongly threatened by specification bias and residual confounding. The analyses must assume that all of the relevant confounds have been identified and measured without error. Departures from these ideal conditions can lead to spurious results, and generally do. Examination of the limitations in the variables available in a Medicaid data set and how they were coded can quickly undermine any claim to validity.
Acceptance of claims about effects of particular variables like female physician gender arising in complex multivariate analyses involve assumptions of “all-other-things-being-equal.” If we attempt to move from statistical manipulation to inference about a real world encounter, we no longer talking about a particular female physician, but a construction that may be very different from particular physicians interacting with particular patients in particular contexts.
The potential for counterfactual statements can be seen if we move from the study to one of science nerds and basketball players and hypothesize if John and Jason were of equivalent height, John would not study so hard.
Particularly in complex social situations, it is usually a fantasy that we can change one variable, and only one variable, not others. Just how did John and Jason get of equal height? And how are they now otherwise different?
Associations discovered in administrative data sets most often do not translate into effects observed in randomized trials. I’m not sure how we could get a representative sample of patients to disregard their preferences and accept random assignment to a male or female physician. It would have to be a very large study to detect the effect sizes reported in this observational study, and I’m skeptical this sufficiently strong signal would emerge from all of the noise.
We might relax our standards and accept a quasi-experimental design that would be smaller but encompass a wider range of relevant variables. For instance, it is conceivable that we could construct a large sample in which physicians varied in terms of whether they had formal communication skills training. We might examine whether communications training influenced subsequent patient mortality, independent of physician gender, and vice versa. This would be a reasonable translation of the authors’ hypothesis that communication skills differences between male and female physicians account for what the authors believe is the observed association between physician gender and mortality. I know of no such study having been done. I know of no study demonstrating that physician communication training affects patient mortality. I’m skeptical that the typical communication training is so powerful in its effects. If such a study required substantial resources, rather than relied on data on hand, I would not be encouraged to invest in it by the strength of the results of the present study to marshal those resources.
What I saw when I looked at the article
We dealing with very small adjusted differences in percentage arising in a large sample.
Patients treated by female physicians had lower 30-day mortality (adjusted mortality, 11.07% vs 11.49%; adjusted risk difference, –0.43%; 95% CI, –0.57% to –0.28%; P < .001; number needed to treat to prevent 1 death, 233).
Assignment of a particular patient to a particular physician is done with a lot of noise.
We assigned each hospitalization to a physician based on the National Provider Identifier in the Carrier File that accounted for the largest amount of Medicare Part B spending during that hospitalization.25 Part B spending comprises professional and other fees determined by the physician. On average, these physicians were responsible for 51.1% of total Part B spending for a given hospitalization.
One commentator quoted in a news article noted:
William Weeks, a professor of psychiatry at Dartmouth’s Geisel School of Medicine, said that the researchers had done a good job of trying to control for other factors that might influence the outcome. He noted that one caveat is that hospital care is usually done by a team. That fact was underscored by the method the researchers used to identify the doctor who led the care for patients in the study. To identify the gender of the physician, they looked for the doctor responsible for the biggest chunk of billing for hospital services — which was, on average, about half. That means that almost half of the care was provided by others.
Actually, much of the care is not provided by the attending physician, but other staff, including nurses and residents.
The authors undertook the study to call attention to gender disparities in physician pay. But could disparities show up in males being able to claim more billable procedures – greater credit administratively for what is done with patients during hospitalization, including by other physicians? This might explain at least some of the gender differences, but could undermine the validity of this key variable in relating physician gender to differences in patient outcome.
The statistical control of differences in patient and physician characteristics afforded by variables in this data set is inadequate.
Presumably, a full range of patient variables is related to whether patients die within 30 days of a hospitalization. Recall the key assumption that all of the relevant confounds have been identified and assessed without error in considering the variables used to characterize patient characteristics:
Patient characteristics included patient age in 5-year increments (the oldest group was categorized as ≥95 years), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, and other), primary diagnosis (Medicare Severity Diagnosis Related Group), 27 coexisting conditions (determined using the Elixhauser comorbidity index28), median annual household income estimated from residential zip codes (in deciles), an indicator variable for Medicaid coverage, and indicator variables for year.
Note that the comorbidity index is based on collapsing 27 other variables into one number. Simplifies the statistics, yes, but with a tremendous loss of information.
Recall the assumption that this set of variables represent not just what is available in administrative data set, but all the patient characteristics relevant to their dying within 30 days after discharge from the hospital. Are we really willing to accept this assumption?
For the physician variables displayed at the top of Table 1, there are huge differences between male and female physicians, relative to the modest difference in patient mortality, adjusted mortality, 11.07% vs 11.49%.
These authors encourage us to think about the results as simulating a randomized trial, except that statistical controls are serving the function that randomization of patients to physician gender would serve. We are being asked to accept that these difference in baseline characteristics of the practices of female versus physicians can be eliminated through statistics. We would never accept that argument in a randomized trial.
Addressing criticisms of the authors interpretation of their results.
The senior author provided a pair of blog posts in which he acknowledges criticism of his study, but attempts to defuse key objections. It’s unfortunate that the sources of these objections are not identified, and so we dependent on the author’s summary out of context. I think the key responses are to straw man objections.
Correlation is not causation.
… We often make causal inferences based on observational data – and here’s the kicker: sometimes, we should. Think smoking and lung cancer. Remember the RCT that assigned people to smoking (versus not) to see if it really caused lung cancer? Me neither…because it never happened. So, if you are a strict “correlation is not causation” person who thinks observational data only create hypotheses that need to be tested using RCTs, you should only feel comfortable stating that smoking is associated with lung cancer but it’s only a hypothesis for which we await an RCT. That’s silly. Smoking causes lung cancer.
No, it is this argument that is silly. We can now look back on the data concerning smoking and lung cancer and benefit from the hindsight provided by years of sorting smoking as a risk factor from potential confounds. I recall at some point, drinking coffee being related to lung cancer in the United States, whereas drinking tea was correlated in the UK. Of course, if we don’t know that smoking is the culprit, we might miss that in the US, smoking was done while drinking coffee, whereas the UK, while drinking tea.
And isolating smoking as a risk factor, rather than just a marker for risk, is so much simpler than isolating whatever risk factors for death are hidden behind physician gender as a marker for risk of mortality.
Coming up with alternative explanations for the apparent link between physician gender and patient mortality.
The final issue – alternative explanations – has been brought up by nearly every critic. There must be an alternative explanation! There must be confounding! But the critics have mostly failed to come up with what a plausible confounder could be. Remember, a variable, in order to be a confounder, must be correlated both with the predictor (gender) and outcome (mortality).
This is similarly a fallacious argument. I am not arguing for alternative substantive explanations, I’m proposing that spurious results were produced by pervasive specification bias, including measurement error. There is no potential confounder I have to identify. I am simply arguing that that the small differences in mortality are dwarfed by specification and measurement error.
This tiny difference is actually huge in its implications.
Several critics have brought up the point that statistical significance and clinical significance are not the same thing. This too is epidemiology 101. Something can be statistically significant but clinically irrelevant. Is a 0.43 percentage point difference in mortality rate clinically important? This is not a scientific or a statistical question. This is a clinical question. A policy and public health question. And people can reasonably disagree. From a public health point of view, a 0.43 percentage point difference in mortality for Medicare beneficiaries admitted for medical conditions translates into potentially 32,000 additional deaths. You might decide that this is not clinically important. I think it is. It’s a judgment call and we can disagree.
The author taking a small difference and magnifying its importance by applying to a larger population. He is attributing the “additional deaths” to patients being treated by men. I feel he hasn’t made a case that physician gender is the culprit and so nothing is accomplished except introducing shock and awe by amplifying the small effect into its implications for the larger population.
In response to a journalist, the author makes a parallel argument:
The estimate that 32,000 patients’ lives could be saved in the Medicare population alone is on par with the number of deaths from vehicle crashes each year.
In addition to what I have already argued, if we know the same number of deaths are attributable to automobile crashes, we at least know how to take steps to reduce these crashes and the mortality associated with them. We don’t know how to change the mortality the authors claim is associated with physician gender. We don’t even know that the author’s claims are valid.
Searching for meaning where meaning no meaning is to be found.
In framing the study and interpreting the results to the media, the authors undertake a search of the literature with a heavy confirmation bias, ignoring the many contradictions that are uncovered with a systematic search. For instance, one commentator on the senior author’s blog notes
It took me about 5 minutes of Google searching to find a Canadian report suggesting that female physicians in that country have workloads around 75% to 80% of male physicians:
If US data is even vaguely similar, that factor would be a serious omission from your article.
But the authors were looking for what supported the results, not for studies that potentially challenged or contradicted their results. They are looking to strengthen a narrative, not expose it to refutation.
Is there a call to action here?
As consumers of health services, we could all switch to being cared for by female physicians. I suspect that some of the systems and structural issues associated with the appearance that care by male physicians inferior would be spread among females, including increased workloads. The bias in the ability of male physicians to claim credit for the work of others would be redistributed to women. Neither would improve patient mortality.
We should push for reduction in inequalities in pay related to gender. But we don’t need results of this study to encourage us.
I certainly know health care professionals and researchers who have more confidence in communication learning modules producing clinically significant changes in position behavior. I don’t know any of them who could produce evidence that these changes include measurable reductions in patient mortality. If someone produces such data, I’m capable of being persuaded. But the present study adds nothing to my confidence in that likelihood.
If we are uncomfortable with the communication skills or attention to evidence that our personal physicians display, we should replace them. But I don’t think this study provides additional evidence for us doing so, beyond the legitimacy of us acting on our preferences.
In the end, this article reminds us to stick to our standards and not be tempted to relax them to make socially acceptable points.