Six months ago Matthew Kraft published an excellent article on effect sizes.
I worked in education for five years before I had any understanding of research design and reporting. I wish Matt’s piece was around a decade ago.
His article is a bit dense if you’re trying to just wrap your head around the issue, so consider this post a lay person’s intro to Matt’s piece and the subject itself.
If you catch any mistakes, please do let me know. I’m still learning.
Why are effect sizes useful?
Consider currencies. Currencies are useful because they allow you to easily compare prices across various goods. Instead of having to constantly refer to one set of goods in relation to another set (ie, three apples are worth the same as four oranges which is worth the same as three paperclips), we can use the same unit (dollars) to compare a bunch of different goods.
Effect sizes serve the same function. They help us easily compare the magnitude of the impact of a bunch different interventions. We can do research on graduation rates, test scores, suspension rates, or whatever we want, and then we can convert our results into an effect size to help us compare how big of an impact we had.
Effect sizes are the unit of currency for measuring impact.
What is an effect size?
Many effect size calculations in education research are expressed in standard deviations.
A common formula to determine the effect size is:
(mean of experimental group – mean of control group) / standard deviation
Let’s say we trying to find the effect size of a new math curriculum on test scores. We might give half the population the new curriculum, half the old curriculum, and then see what the difference is.
Let’s say the difference is +5 pts out of a 100 for the students using the new curriculum. The curriculum “worked.”
But what does that mean?
We now want to know if +5 pts is a big deal. This is where the standard deviation comes in.
A low standard deviation means there is very little difference in the population (everyone is scoring about the same score). A large standard deviation means there is a wide spread in scores.
Because the standard deviation is the denominator in the formula, the smaller it is, the large the effect will be for any given difference between two groups.
In other words, if everyone is scoring between 62 and 65 out of a hundred, and you jump five points, you could go from the bottom 1% of test takers to the top 1% of test takers.
Because the standard deviation is low (small spread), a modest jump leads to a big effect.
What is a large effect?
This is where Matt’s paper is particularly useful.
Much of the previous literature on effect sizes made many mistakes:
- Sample sizes were ignored.
- Duration of treatment were ignored.
- Time elapsed until measurement was ignored.
- Cost was ignored.
Taken together, scalability of interventions was ignored. This had the unintended consequence of setting the bar too high for what should be considered a large effect size.
Bloom’s 2 standard deviation effect
You may have heard of Bloom’s 2 sigma tutoring intervention. This result is taken to show that 1-1 tutoring can have a 2 standard deviation (very large!) effect.
But Bloom’s study design was the following: take dozens of 4th, 5th, and 8th graders; give them 1-1 tutoring in discrete subjects like cartography or probability; and then test them on what they learned after 3-4 weeks!
It’s much easier to squeeze out a big effect under these conditions.
These types of small sample studies led to a research norm where an effect size had to be .8 standard deviations for it to be considered large.
New Orleans’ .4 standard deviation effect
Contrast Bloom’s study to Doug Harris’ study on the New Orleans education reforms.
The New Orleans study covered tens of thousands of students. Students received the treatment across all major subjects, including math, reading, science, and social studies. The treatment lasted multiple years. And students were tested once every year in each subject.
It’s a lot harder to make large gains under these conditions, especially when the intervention costs under 20% of 1-1 tutoring.
Doug’s study found .4 standard deviation effects for New Orleans students over a five year period.
In his paper he wrote that he was “not aware of any other districts that have made such large improvements in such a short time.”
- The standard bar for a large effect was .8 standard deviations. This was irregardless of sample size, length of treatment, measurement proximity, or cost. The bar was poorly constructed.
- New Orleans achieved a +.4 standard deviation effect on test scores.
- Researchers had never seen a citywide effect this large before.
There are two ways to interpret this.
- The previous .8 standard deviation bar was way too high for large samples.
- The New Orleans effect, despite being relatively large for district improvement, is still so absolutely small that we should not be too impressed.
Was the New Orleans effect too small?
The +.4 standard deviation effect equates to the average New Orleans student moving from the 22nd to 37th percentile in performance.
For any individual, this might or not be life changing. But in the aggregate this means the average New Orleans student roughly went from a borderline high school dropout (bottom 20% of performance) to a student who has a real chance to enter a two year or four year college (modestly below average performance).
Across a large population, this is a pretty big deal.
We should pay attention to a city level +.4 standard deviation increase in test score. If this effect (or even one somewhat lower) can be scaled, kids across the country will have a better chance at leading a good life.
Of course, academics and test scores are just one piece of the puzzle of economic mobility, but they are an important piece. Schools with negative effects on test scores tend not to deliver great long-term life outcomes for kids.
Matt Kraft’s proposed effect size scale
When it comes to large interventions, Matt argues we should get rid of the .8 standard deviation benchmark.
Matt proposes the following rough scale:
Small effect: less than .05
Medium effect: .05 to .2
Large effect: .2 or larger
Matt reviews a bunch of educational studies to help come up with this table. While I don’t love that it averages a bunch of very different studies, at the very least it sets conservative estimates on effects and cost (given that averages include studies that don’t meet the highest bar for sample / duration / etc.).
Take a look at where .4 standard deviations shows up. The New Orleans reforms are in the 90th percentile of magnitude but the 60th percentile of costs. New Orleans increased it’s pup-pupil by $1,400 in the years following Katrina, though it’s not clear to me that the money is what really drove the effect. But even if you assume it did, the results pass a ROI test.
Again, the New Orleans impacts are pretty remarkable.
In considering impact, cost, and scale, Matt also provides the following matrix:
New Orleans does well.
In Sum: Toyotas > Ferraris
When it comes to effect sizes, be very careful to review sample sizes, treatment duration, measurement proximity, and cost.
Holding out for .8 standard deviation effects is foolish. These effects will rarely occur and when they do they tend to be very hard to scale.
When it comes to large scale interventions across medium term time frames, effects above .2 standard deviation warrant our attention.
The most realistic path for broad academic gains is to look for meaningful jumps in student performance that are caused by an intervention that has a real chance of scaling over time. And then testing and scaling and testing and scaling.
In other words: Toyotas > Ferraris.