Simpsons Paradox in Linear Regression
The Simpsons Paradox is a fascinating statistical paradox that often happens around but only a few people know of it. It occurs when the trends within sub-groups of a dataset contradict the overall trend. A classic example of this can be seen in the infamous UC Berkeley admission case, which stirred controversy over apparent sexist practices. For a more in-depth exploration, I recommend checking out an article https://www.statisticshowto.com/what-is-simpsons-paradox/. Also, my pen on this subject: “Same Data, Two Stories: An Insight on Simpson’s Paradox,” available at https://medium.com/@alabhya/same-data-two-story-an-insight-on-simpson-paradox-f1d64ad360e7. This paradox is more than just a statistical quirk; it’s a reminder of how data can tell different stories, depending on how it’s interpreted.
Simpson’s Paradox isn’t just an intriguing statistical concept; it’s a critical element in analytics that demands attention. Failing to recognize this paradox can lead your analysis astray, yielding results that aren’t just slightly off or unreliable, but starkly opposite to what they should be. Emphasizing the word ‘opposite’ here is key — it means the outcome is a direct contradiction of the expected result.
Today, I’ll demonstrate with a compelling example how overlooking Simpson’s Paradox can flip the direction of linear regression, turning your analysis on its head.
Here’s a scenario for you. Imagine I’m wandering around California, gathering data on training hours and income. With the data that I collected, I generated a graph on the hours (total training) and income of the individual. The graph in each location is as follows:
It’s obvious that in every location, the more trained you are the higher your income is. Let’s run a linear regression to verify this model:
Linear Regression in San Diego
The model shows significant and positive correlations between training hours and income in San Diego.
Linear Regression in LA
Again, same story. We see a significant and positive relationship between hours and income.
Linear Regression in SF
Again, significant and positive relationship between hours and income.
Linear Regression in Irvine
Same story in Irvine as well. (Notice the decreasing value of the coefficient)!
Linear Regression in Riverside
Significant and positive relationship.
These linear models are not very interesting given that all it is doing is verifying the graphs that I plotted earlier. But now, rather than doing a place-by-analysis, I am interested in finding the relationship between training hours and income in California itself! So I compile all the data and run the linear model.
This is the results that I got.
Amazing right? If you did not notice, the coefficient is now showing negative. The estimate is still significant but in the opposite direction. That’s way too weird, given that every subsection showed a positive relationship.
(The low R² gives away this model, but believe me, often the R² is way better than in this model with Simpson’s Paradox)
This is the case of Simpson’s Paradox. The essence of Simpson’s Paradox lies in its ability to remain concealed within data. Had I initially conducted a California-wide regression analysis correlating hours to income, the nuanced discrepancies at the location level would have remained undetected, exemplifying how Simpson’s Paradox can subtly influence data interpretation. Moreover, what if I had not accounted for the area bias when I took the survey and had one single dataset of training hours and income in California?
The issue with this data is actually straightforward. Within each individual area, there’s a positive correlation between the extent of training and income; more training typically equates to higher earnings. However, when we consolidate data across different areas, this trend shifts. In certain high-income regions, individuals can earn more even with less training. This variance highlights how regional economic factors can significantly influence income, independent of training levels.
This is how the graphs look-
The visualization reveals a striking phenomenon: while each sub-areas regression line shows a positive correlation, the overall trend dramatically reverses when we apply a single regression model to the entire dataset. This is a classic instance of Simpson’s Paradox.
To rectify this, introducing an area dummy variable in the regression model can align the results more accurately. By doing so, the adjusted regression demonstrates a pattern consistent with those observed in the individual sub-areas, as follows:
After adjusting the regression model, the impact of hours on income is once again positive, demonstrating the correction of the model. Furthermore, the area dummy variables shed light on regional income disparities: areas with higher-than-average incomes show positive coefficients, while those with incomes below the average exhibit negative coefficients. This adjustment highlights the significant role regional factors play in determining income.
Simpson’s Paradox represents a critical reminder of the limitations of analytic models, underscoring the necessity for astute analytical judgment. It’s a cautionary note to delve deeper and consider underlying patterns before hastily applying statistical models.
All the data were self-generated through random simulation in R (and Python later on). Find the code that I generated in my GitHub: https://github.com/AlabhyaMe/SimpsonsParadox