The Challenges of Experimental Research in Education

In educational research, the theory of change is usually articulated as participation in X intervention will result in Y outcome or change. This is a causal claim and is extremely difficult to conduct in educational settings. How does a researcher know for sure that the outcome is truly the result of the intervention? How does one make causal inference? For the purposes of this brief overview, let’s explore a hypothetical research question. As a teacher of art history, you are interested in understanding the effects of taking an introductory art history class offered at a local university. You believe that students who take this class have a higher capacity of historical empathy as measured by a questionnaire that has been validated by psychometricians.

The first thing a researcher must consider is a counterfactual, usually defined as a treatment and a control group (if a random experiment), or comparison group (if quasi-experiment). The most common and weakest design in education is the good old pre-test, post-test. Imagine the local university has required all freshmen to enroll in the course and decided to measure the effects by having them take a pre-test to measure their historical empathy. Then at the end of the course you give them a post-test to see if there was any growth in their historical empathy. The research reveals that there was a 200% increase in historical empathy, which must be the result of having taken this course, right? Even with the most exceptional teacher, one cannot attribute this change to the course. In between the pre-test and the post-test this thing called life happened. Maybe the students saw a movie that increased their interest, read a book, or any other number of things we can’t control. Because there was no counterfactual (control or comparison group) experiencing the messiness of life as well, the researcher cannot make any causal claims, only correlational at best.

Sometimes, researchers find natural experiments that can be looked at causally. Imagine the same scenario of an introductory art history course offered by the local university. The new administration has determined that all freshmen must take this course because they recognize the awesomeness of art history and believe that these young adults will have an increased measure of historical empathy after taking the course, and therefore, be better humans. There is a clear line in the sand of the students who did not have to take the class in 2014, and students who had to take the course in 2015. On average, the 2014 and 2015 kids are probably the same. The only thing that makes them different is one group did not have to take the art history class, and the other group did. This is a Regression Discontinuity and while it can get at causality, these types of policy decisions do not happen every day. Most often, researchers have to design their own experiments.

Ideally, a researcher would want to conduct an experiment similar to a medical or psychological study using random assignment.  To get at this, one would need to have more students want to enroll than capacity (this always happens in art history!). Students would be randomized as to who got the take the course, and who did not get the course. On average, both groups are the same. This could even be tested by giving both the treatment and control groups a pre-test and the results, on average, should be the same. Following the course, any difference in a post-test should be attributed to the fact that the only thing different about these two groups is that one group took the course, and one group did not. Simple right?

Wrong, because the problem with experimental design in education is we are dealing with real humans in the real world and not in a lab where we can keep them locked up. People break the rules all the time. Imagine that over half the treatment group decides not to attend giving you a high rate of attrition. How could you possibly measure the effects of the intervention? This often happens and researchers are left measuring the effects of the Intent to Treat, which only measures the effects of winning the lottery because there may be something about those students that won the lottery, but decided not to attend the intervention that will bias the results of measuring the effect of Treatment on Treated. When there is high attrition, as there often is in educational research, one will need to  use another approach such as an instrumental variable developed in econometrics to prevent bias, which is beyond the scope of brief overview, but you can find more about it here.

So even if you have all the conditions to conduct a Random Assignment Evaluation, considered the gold standard, so many things can happen that will bias your design. However, once in awhile we do get great research that meets this gold standard. In 2011, Crystal Bridges Museum of American Art, located in Bentonville, Arkansas, opened as the first art museum of significance in the region. Researchers were able to randomize school groups into a treatment (field trip) and control (no field trip) and measure the effects of a one-time field trip upon students. In addition, because of the large sample size, there was enough power to detect statistically significant differences between the groups. You can read more about the research design, and the instruments used to measure outcomes, here.  

More common in educational research is a comparison between a treatment group and a comparison group (note not a control group if the intervention is not randomly assigned). In this scenario, the researcher must find a comparison of students within the local university. How can one know if they are truly, on average, the same? Isn’t there something about the student that wants to course that inherently makes them different? It is really difficult to find a virtual twin, but Propensity Matching can help by matching participants on variables. But there are always qualities that are difficult to measure, so unlike random assignment, it is difficult to say that the groups are on average, the same.

In a nutshell, rigorous educational research is really difficult. But it is also really important. Policy decisions are made upon sound research from deciding if students can go on field trips to which courses should be eliminated from, or added to, a university’s curriculum. If the research you are conducting or reading cannot provide a counterfactual, if it cannot satisfy the question, “as compared to what,” and is making a causal claim, then be skeptical of its validity.


What Works Clearing House provides a database of research that provides credible evidence to educational  programs.

A great website on research methods and a definition of each approach can be found here


Leave a Reply

Your email address will not be published. Required fields are marked *