Same Data, Different Story. An insight on Simpson Paradox.
When I was going to the Economics Department to collect my results, I could see many student gathering around the lobby of the department, trying to find out how they performed in their examination. The first semester results were special; it showed how you stand within your peers.
As I walk by, I found some disgruntled friends of mine, mostly girls, arguing that the department has been biased against female students. It was a genuine concern, given that the boys were often provided more opportunities by that particular department even when the number of boys and girls who took courses from that department was equal. However, this time the girls were really angered because apparently, 30% of the girls failed in the courses offered by this department compared to 22.8% of the boys. It was annoying, given that they had been doing really well in the college internal evaluation and had been generally better in class.
Exasperate, the Head of the Department called the individual subject teachers for inquiry. The department offered only Macroeconomics and Microeconomics for the first semester students, therefore only the two subject teachers of their respective subjects were summoned.
Right away, the subject teachers were confused. They were asked to explain the discrepancy in the result favouring boys but the subject teacher had different stories. To the Head of the Department’s surprise, both the teacher reported that the girls performed better in their respective subject. She looked at the reports and was stunned seeing the teacher’s reports. 50% of the boys taking Macroeconomics had failed the subject compared to 45.5% of girls failing the same course whereas 12% of the boys failed Microeconomics compared to just 10% of the girls failing the same course. What had happened? How can the girls perform better across individual subjects than boys but have worse overall?
The report presented by the individual teacher and the department result looked like this:
21 of the 70 girls failed in either subject whereas 16 of the 70 boys failed. Yet the girls performed better in both the subjects.
What happened is an example of Simpson’s Paradox. The paradox shows that the data can make an argument in two different directions when used differently. The answer to this problem is that a large number of female student took Macroeconomics, which, as the result suggests, was probably a more demanding subject. Out of the 70 boys, only 20 took Macroeconomics whereas twice as many girls took the same subject. Half of the boys failed in this subject, compared to 45% of the girls failing. However, the absolute number of the 45% girls failing was 18, which is almost twice the number of boys (i.e 10) failing. Only a few students from both the sex failed in Microeconomics. The 18 girls who failed in Macroeconomics was by itself more than all the boys who failed in both the course combined.(Note: this is not a necessary condition for the paradox to exist, neither is both subgroups having equal participants. It just happened to be so in this case) Hence, even when the girls did better in both the subjects, the high number of girls taking and failing in the difficult course contributed to higher rate of failure in the aggregate data.
This paradox shows how data can be misinterpreted. Although the above story is fictional (sorry for not disclosing it beforehand), a real-life case of the Simpson Paradox happened in 1973 when UC Berkeley was sued for their enrollment being gender-bias against women. The case was brought up because UC-Berkeley admitted 44 % of all the male applicant but only 35% of the female applicant. But after breaking down the enrollment by departments, it was found that the admission rate of the female candidate was greater in 4 out of 6 departments. The gender bias was for, rather than against, the female applicant. Across the department, girls were rejected in a lower proportion than boys, yet due to a large proportion of female applicant applying and getting rejected on a particular department with a low-acceptance rate, the aggregate rejection rate provided misleading information about the university’s enrollment. The UC Berkeley case is one of the most known cases of Simpson’s Paradox. (More about this on the link below)
Personal Take
Just like how well a writer can tune an argument in his favor with properly structured lines, Simpson’s Paradox can do the same with data. If naive, data can lie. Such an incident happened once when a friend of mine and I were analyzing data about the human lung’s capacity. A simple linear model showed that smokers were likely to have more lungs capacity than non-smokers, which was surprising as well as counter-intuitive. Later we realized that the data had a large population of children and young teens, who had lower lung capacity than older people and also did not smoke, making it seem as if being smoker would mean having a higher lungs capacity. The estimate changed when we controlled for age.
In the examples discussed, we needed to segregate data into groups for proper evaluation. I am bringing this case forward since some of my friends (and politician) were claiming that Nepalese are more immune to Covid-19 because the death rate has been below 0.5% in the country. I am no doctor to verify the claim for or against people’s immunity, but what I know is that the young adults have been disproportionately infected by Covid-19, and the mortality rate of Covid-19 throughout the globe is lowest for such age group. This begs the question of whether the low-death rate of Nepalese is disguised under Simpson’s paradox. (If I have access to these data, I shall write a post on it)Apart from the age group, pre-existing health conditions and other health complication are also known to have an influence on mortality from Covid-19, which calls for even further segregated analysis.
To avoid misinterpretation of data, one must know if grouping the data or aggregating it is the better alternative. The key is understanding the causal relationship between the factor that may be influencing the data. While all results can be statistically correct, the interpretation may be misleading.
Infection of Covid-19 in Nepalese by age:
UC Berkeley Case: