Sex-based data in the spotlight
Recent interest in developing sex-specific medical treatments has heightened calls for the discovery of biological sex differences. Biomedical researchers are under pressure to test for sex differences in each and every variable they measure. Because the results of sex comparisons could ultimately be used to decide who gets access to which medical treatments, it is imperative that the research is conducted according to the highest standards of rigor. Otherwise, a person of any gender could be denied access to effective treatment.
In 2020, a report in Nature claimed to reveal large sex differences in immune responses to Covid-19. The authors stated that their results provided a basis for the development sex-specific treatments and vaccines, causing a media frenzy. When a different group of scientists analyzed the same data, however, they found few sex differences. The second group reported that the findings for men and women were similar.
How could the conclusions of two different scientific teams, using the same dataset, differ so dramatically? In a 2021 analysis in eLife, Neuroscience student Yesenia Garcia-Sifuentes and I delve into the dark underbelly of sex differences research. We show that across a variety of disciplines in the biological sciences, approaches to testing for sex differences are quite variable. Some approaches are more rigorous than others. In fact, sex differences are often claimed without comparing the sexes statistically at all. We also show, conversely, that significant sex differences are often ignored and the data analyzed without regard to sex, even when it would be prudent to consider it.
Is rigor in the study of biological sex differences in danger?
This 'explainer' covers our main findings. See also the Emory press release
and the eLife Insights commentary.
This riverplot shows the analysis of 720 articles published in 2019 in the biological sciences. The first part of the analysis, up to "Analyzed by sex", was conducted by Woitowich et al. (2020). We picked up where they left off and looked closely at all of the articles in which data were disaggregated by sex. We found that in that subset of 147 articles, the sexes were compared, either statistically or by assertion, more than 80% of the time. Of those articles, sex differences were reported in more than 70%. The difference was treated as a major finding, that is, highlighted in the title or the abstract, about half of the time.
Next, we asked whether claims of sex-specific effects were supported by sufficient statistical evidence.
Answer: Mostly not.
In most of the studies we analyzed, the authors were testing for an effect of an experimental manipulation in females and males. A control group was compared with a treated group in both sexes. This is called a 'factorial design' because there are two factors being considered: treatment and sex.
Graphic adapted from Gompers (2021)
The hypothetical dataset to the left shows a potential result of such a study: an effect that reached statistical significance in females but not quite in males. In males, perhaps there were not enough animals to detect a difference, but we can see that the effect of treatment is fairly similar in the two sexes. We cannot, at this point, conclude that the effect of treatment differed between the sexes. In order to conclude that the "difference in differences" is significant, the difference must itself be statistically compared between the sexes. Not comparing the effects of treatment between the sexes represents a common statistical error--one that I have made myself in my own work.
This common error has been written about numerous times, for example by Gelman and Stern (2006), Nieuwenhuis et al. (2011), and Makin and Orban de XIvry (2019). The problem is that comparing the general outcome of two separate tests does not itself test for a difference between the two effects. It is like comparing a "probably so" with an "I don't know" or "too soon to tell". This is not scientific evidence. To show evidence that a response to treatment differed between females and males, we must show statistically that the effect of treatment differs between the sexes. We must compare the male response directly with the female response. To do this, we test for a statistical interaction between treatment and sex. A significant interaction means that the response to treatment likely depends on sex -- the females and males responded differently to the treatment. That is, we have found a "sex-specific" effect.
In our analysis, we found that most of the studies we analyzed did have a factorial design with sex as one of the factors. However, authors failed to test for interactions in a large majority of articles. Overall, we found that authors tested for and reported the results of tests for interactions only 29% of the time.
Next, we tested whether an inappropriate analysis typically resulted in a particular conclusion.
Answer: Authors who failed to test for sex-specific effects were significantly more likely to 'find' them.
When authors failed to test for interactions, they reported sex-specific effects almost 90% of the time. In contrast, authors that did test for interactions reported sex-specific effects only about 63% of the time. This difference was statistically significant (p = 0.016). Our results thus suggest that researchers are predisposed to finding sex differences and that sex-specific effects are likely over-reported in the literature.
Finally, we asked whether authors may have missed sex differences by pooling their data from male and females together.
Answer: Even when data were initially separated by sex, they were often pooled without tests for sex differences.
We analyzed only articles in which data were initially separated (see the top river plot above). Nonetheless, data from females and males were ultimately pooled more than a third of the time (top pie chart to the left). When data were pooled, researchers tested for a significant difference between males and females only about half of the time (bottom pie chart). Even when sex differences were found, males and females were sometimes pooled anyway.
Pooling without testing for a sex difference could cause important differences between females and males to go undetected, potentially causing sex differences to be under-reported.
We recommend reporting not only p values but effect sizes (e.g., Cohen's d). before pooling. Click here for a helpful tool for visualizing and calculating effect sizes.
Rigor in crisis?
When it comes to research on sex differences, the stakes are high. Many researchers, administrators and even politicians are calling for the development of sex-specific medical treatments. Thus, in order to ensure that people of all genders maintain access to care that works for them, research on potential sex-specific effects of treatments must be carried out according to the highest possible standards. We call upon funding agencies, editors, and our colleagues to raise the bar; research on sex differences must be carried out according to the highest standards of rigor. Our results show room for improvement.