Causal Machine Learning — Using ML Models in Social Experiments

10 min readSep 25, 2024

In the age of AI and machine learning, ML is frequently treated as a one-size-fits-all solution for every problem. While undeniably powerful and valuable, it is not always the appropriate tool for every situation.

Impact evaluation methods such as Randomized Control Trials (RCTs), A/B testing, and Difference-in-Differences are powerful tools for causal inference, but they are not designed for prediction. Machine learning methods like Logistic Regression, Random Forest, and Neural Networks excel at prediction but fall short of providing causal insights. In this article, I will demonstrate how we can combine both approaches to build a prescriptive model that predicts outcomes and allows for causal inference.

Suppose I am an aid mobilizer of a philanthropy organization. I am faced with a choice: either spend on training for trade school students studying electronics or allocate funds for broader labor skill development. The agency is more inclined towards training the students, but I must ensure that the project has a significant impact and that resources are not wasted. To that end, I decided to pilot a research program to evaluate the effect of offering additional training to students in trade school programs.

The priority is to ensure the program’s effectiveness. I want to observe a significant improvement in career opportunities for students who received the training. To assess this, I designed an experiment as part of the pilot program, where approximately 15,000 students were randomly selected to receive the training. To establish causal inference, I also randomly selected an equal number of students with similar characteristics to form a control group. Six months after the training was completed, I tracked the placement outcomes for each individual, which will serve as the benchmark for evaluating the program’s effectiveness.

The dataset looks like this:

treatment: if or not the student got treatment. It is assigned at random
age: the age of the student training ranges from 18–22
score: the score they obtain at school level examination (grade 10)
gender: Male or Female
experience: if they have previous experience or not
previous_exp: how many months of previous experience
distance_majorcity: how far they live from a major city
owns_motor: if they have a motor vehicle
placement: did they get the job after the end of their education

Evaluating the program — Randomized Control Trials

First, let me evaluate the program. This experiment by design can utilize Randomized Control Trials (RCT) for program evaluation. To determine whether the program had a meaningful impact, I conducted an RCT analysis to compare outcomes between the treatment and control groups. The RCT evaluation will follow the following model:

Effect of Treatment = E( P = 1| T = 1) — E( P = 1 | T = 0)

E(P=1|T=1): This represents the expected value (or probability) of the outcome P=1 (e.g., placement) given that T=1 (treatment group).
E(P=1|T=0): This represents the expected value (or probability) of the outcome P=1 given that T=0 (control group)

The placement and treatment|control matrix is as follows:

We observe that the treatment group is more likely to secure placements. 5141 of the treated got placement, whereas only 2359 in the control got placement. Both the group has roughly 15,000 individuals.

The conditional probability of getting placement if someone is treated or not is -

E(Placement=1|Treatment =1) = 5141 / 15032 ~ 34%

E(Placement=1|Treatment =0) = 2359 / 14968 ~ 16%

The probability of securing a placement after receiving the treatment is nearly double that of those who were not treated. The average effect on treatment is 18%. We run a hypothesis test to check if our null hypothesis (Treatment improves placement) is significant —

The t-test results show a t-value of 38.18 and a p-value far below 0.05, indicating statistical significance. The difference in conditional probability, combined with the t-test, allows us to conclude that the program has been effective confidently. Students who received the treatment are securing better placements than those who did not.

Based on the test results, I am confident the program has a significant impact. However, this model's limitation is that it only provides the average treatment effect across both groups without identifying which specific student benefits most from the program. I want to estimate the conditional average effect of treatment, i.e., how much the benefit is for a particular group of individuals. We need an ML model to find the impact of conditional impact.

2. Machine Learning Prediction — Logistic Model

The RCT results confirmed that the program had a significant impact. But my goal is not just to evaluate whether the program is impactful or not, but also to predict who is more likely to secure placement. Any traditional machine learning models can help with this prediction. I will be using logistic regression due to its interpretability.

The logistic regression will be modeled as follows:

lr = (Placement ~ c + age + score + gender + experience + previous_exp + distance_majorcity + owns_motor+treatment)

Seventy-five percent of the dataset has been separated for training and 25% for testing. This is a simple logistic regression model with no hyperparameter tuning.

The purpose of the logistic model is twofold: first, to identify the variables that most significantly impact placement, including the effect of the treatment itself, and second, to estimate which individuals are most likely to secure placement. By using this model, the program can better target individuals who are more likely to succeed in future placement efforts, optimizing the program’s impact.

The result from the model gives us this -

The treatment has a positive and significant impact on placement, which was also previously confirmed by the RCT experiment. All variables, except for gender, have a significant impact on placement. In addition to treatment, both experience and months of experience have high log-odds ratios.

In this model, my primary interest is identifying individuals who will benefit the most from the program. However, I am equally concerned about minimizing errors to avoid wasting resources on individuals who are less likely to benefit. Based on the ROC curve below, I believe that a threshold of 0.4 provides a good balance between accuracy, true positive rate, and false positive rate.

Running the regression, I achieved an accuracy of 89.97% on the test set. Does this mean I can use logistic regression to identify individuals who should be treated or not in the future since the model has such high accuracy?

This model has a significant limitation. It is designed post-experiment, meaning it captures the impact of the treatment. In other words, for those who received the treatment, we cannot determine what their outcome would have been without it. A possible workaround is to exclude the treatment variable, but this would result in losing insight into the treatment’s impact when half the individuals still benefit from being treated. Therefore, the primary limitation is that the machine learning model fails to provide a valid counterfactual for comparison.

3. Causal Machine Learning — Two Model Uplift

I want to ensure that the program reaches those who need it most, specifically students who would otherwise not secure a placement without the intervention. I want to avoid enrolling individuals who would achieve placement regardless of the treatment, and I certainly do not want to include those who are unlikely to get placed, whether they receive the treatment or not.

To address the limitations of both RCT and machine learning models while leveraging their strengths, I will integrate and synergize the two approaches.

Instead of running one model, I will develop two logistic models. One model will be developed from the treated group and the other will be developed from the control group. The rationale for this approach is that the logistic regression trained on the treatment group will provide predictions for the scenario where all individuals are treated, while the logistic regression on the control group will predict outcomes for the scenario where none of the individuals are treated. By doing this, I can estimate both possible outcomes — whether someone is treated or not — giving me a more complete understanding of the potential impact of the program.

lr_treat = (Placement ~ c + age + score + gender + experience + previous_exp + distance_majorcity + owns_motor | treatment =1)

lr_control = (Placement ~ c + age + score + gender + experience + previous_exp + distance_majorcity + owns_motor | treatment =0)

(Note: we are no longer using treatment variable as a parameter)

With both models, I have now made placement predictions with the full dataset.

The treatment prediction gives the probability that an individual will secure a placement, assuming everyone receives the treatment. Conversely, the control prediction gives the probability of placement if everyone is in the control group. Running the model separately allows me to estimate how other variables influence placement outcomes.

I will calculate the difference between the treated and control predictions to determine the impact of the treatment for each individual. While the individual effect may not be generalizable, it can be generalized across groups. To achieve this, I will sort the differences and divide them into deciles, allowing me to assess the impact of treatment within each decile.

Below is the summary of what each decile looks like.

The table presents the distribution of individuals in both the treated and control groups, as well as those who secured placements. These values represent actual outcomes, not predicted ones. As expected, in the first decile, many individuals in the treated group received placements, while only a few did so in the control group. This is anticipated since the model was trained and structured to find the largest difference in the first decile. However, it’s important to note that the deciles were divided based on predicted probabilities, whereas the table reflects actual outcomes. This alignment between predictions and actual results reinforces confidence in the model’s effectiveness.

By examining the proportional difference, it’s clear that the first decile shows the greatest impact. Individuals in decile 5 and beyond see minimal improvement. However, the average improvement across the first four deciles is quite significant. It is up to the decision maker to decide how many people to cover for the program. If we had cost data, we could calculate the ROI and determine which deciles to target for maximum return. Since this is a social program, though, the primary focus is on effectiveness rather than ROI. Thus, our goal would be to evaluate the overall impact and prioritize deciles based on the program’s social objectives.

Another method to evaluate this model is to look at the cumulative impact.

I calculate the cumulative numbers of individuals in the treatment group, the control group, treatment individuals who received placement, and control individuals who received placement. Next, I calculate the difference between the weighted proportion of placements in the treatment and control groups within each decile. This difference is then divided by the total number of individuals in the control group.

By doing this, I can determine the incremental effect of the treatment on placement outcomes for each decile, showing how much the presence of the treatment group increases placements compared to the control group. This is used more in marketing, where there is the possibility of losing out on people when treated (it happens in business, not so much in social programs)

This model shows the additional benefits the treatment group experiences over what would have happened if they had not received the treatment. You can see that the first few deciles have the highest benefits, and then the benefits go down.

In a social program, the primary focus is on the benefits rather than the costs. However, as any economist would argue, resources are limited while needs are endless. Ideally, we would want to offer everyone an equal chance of receiving the treatment, but given resource constraints, targeted intervention proves to be more effective. Based on the model discussed above, it would be reasonable to exclude individuals in the 10th decile, as they derive no benefit from the treatment. Excluding those in the 5th decile onwards is also justifiable since the impact of the treatment on them is minimal.

As we scale up the program and aim to enroll new participants, we can utilize the models we’ve developed to make predictions. When expanding the program beyond the pilot phase, we can collect information on new individuals, similar to the data we have, and apply the causal machine learning models developed and extract the individuals. We know the program has a positive impact on placement outcomes, and with these models, we can identify who stands to benefit the most.

While I did not run a formal train-test split in this case (since the data was synthetic and simulated, and I already know the outcomes), it is essential to do so in real-world applications.

Find my GitHub repository here for the analysis.

Causal Machine Learning — Using ML Models in Social Experiments

Written by Alabhya Dahal

No responses yet