An experiment is a study in which the researcher manipulates the treatment, or intervention, and then measures the outcome. It addresses the question “if we change X (the treatment or intervention), what happens to Y (the outcome)?” Conducted both in the laboratory and in real life situations, experiments are powerful techniques for evaluating cause-and-effect relationships. The researcher may manipulate whether research subjects receive a treatment (e.g., attendance in a Head Start program: yes or no) or the level of treatment (e.g., hours per day in the program).
Suppose, for example, a group of researchers was interested in the effect of government-funded child care subsidies on maternal employment. They might hypothesize that the provision of government-subsidized child care would promote such employment. They could then design an experiment in which some mothers would be provided the option of government-funded child care subsidies and others would not. The researchers might also manipulate the value of the child care subsidies in order to determine if higher subsidy values might result in different levels of maternal employment.
The group of participants that receives the intervention or treatment is known as the "treatment group," and the group that does not is known as the “control group” in randomized experiments and “comparison group” in quasi-experiments.
The key distinction between randomized experiments and quasi-experiments lies in the fact that in a randomized experiment, participants are randomly assigned to either the treatment or the control group whereas participants are not in a quasi-experiment.
Random assignment ensures that all participants have the same chance of being in a given experimental condition. Randomized experiments (also known as RCT or randomized control trials) are considered to be the most rigorous approach, or the “gold standard,” to identifying causal effects because they theoretically eliminate all preexisting differences between the treatment and control groups. However, some differences might occur due to chance. In practice, therefore, researchers often control for observed characteristics that might differ between individuals in the treatment and control groups when estimating treatment effects. The use of control variables improves the precision of treatment effect estimates.
Despite being the “gold standard” in causal study design, randomized experiments are not common in social science research because it is often impossible or unethical to randomize individuals to experimental conditions. Cluster-randomized experiments, in which groups (e.g., schools or classes) instead of individuals are randomized, often encounter less objections out of ethical concerns and therefore are more feasible in real life. They also prevent treatment spill over to the control group. For example, if students in the same class are randomly assigned to either the treatment or control group with the treatment being a new curriculum, teachers may introduce features of the treatment (i.e., new curriculum) when working with students in the control group in ways that might affect the outcomes.
One drawback of cluster-randomized experiments is a reduction in statistical power. That is, the likelihood that a true effect is detected is reduced with this design.
Quasi-experiments are characterized by the lack of randomized assignment. They may or may not have comparison groups. When there are both comparison and treatment groups in a quasi-experiment, the groups differ not only in terms of the experimental treatment they receive, but also in other, often unknown or unknowable, ways. As a result, there may be several "rival hypotheses" competing with the experimental manipulation as explanations for observed results.
There are a variety of quasi-experiments. Below are some of the most common types in social and policy research, arranged in the order of weak to strong in terms of their capabilities of addressing threats to a statement that the relationship between the treatment and the outcome of interest is causal.
One group only
A single group that receives the treatment is observed at two time points, one before the treatment and one after the treatment. Changes in the outcome of interest are presumed to be the effect of the treatment. For example, a new fourth grade math curriculum is introduced and students' math achievement is assessed in the fall and spring of the school year. Improved scores on the assessment are attributed to the curriculum. The biggest weakness of this design is that a number of events can happen around the time of the treatment and influence the outcome. There can be multiple plausible alternative explanations for the observed results.
Interrupted time series
A single group that receives the treatment is observed at multiple time points both before and after the treatment. A change in the trend around the time of the treatment is presumed to be the treatment effect. For example, individuals participating in an exercise program might be weighed each week before and after a new exercise routine is introduced. A downward trend in their weight around the time the new routine was introduced would be seen as evidence of the effectiveness of the treatment. This design is stronger than one-group pretest-posttest because it shows the trend in the outcome variable both before and after the treatment instead of a simple two-point-in-time comparison. However, it still suffers the same weakness that other events can happen at the time of the treatment and be the alternative causes of the observed outcome.
A group that has experienced some treatment is compared with one that has not. Observed differences between the two groups are assumed to be the result of the treatment. For example, fourth graders in some classrooms in a school district are introduced to a new math curriculum while fourth graders in other classrooms in the district are not. Differences in the math scores of the two groups assessed in the spring of the school year only are assumed to be the result of the new curriculum. The weakness of this design is that the treatment and comparison groups may not be truly comparable because participants are not randomly assigned to the groups and there may be important differences in the characteristics and experiences of the groups, only some of which may be known. If the two groups differ in ways that affect the outcome of interest, the causal claim cannot be presumed.
Both treatment and comparison groups are measured before and after the treatment. The difference between the two before-after differences is presumed to be the treatment effect. This design is an improvement of the static-group comparison because it compares outcomes that are measured both before and after the treatment is introduced instead of two post-treatment outcomes. For example, the fourth graders in the prior example are assessed in both the fall (pre-treatment) and spring (post-treatment). Differences in the fall-spring scores between the two fourth grade groups are seen as evidence of the effect of the curriculum. For this reason, the treatment and comparison groups in difference-in-differences do not have to be perfectly comparable. The biggest challenge for the researcher is to defend the parallel trend assumption, namely the change in the treatment group would be the same as the change in the comparison group in the absence of the treatment.
Participants are assigned to experimental conditions based on whether their scores are above or below a cut point for a quantitative variable. For example, students who score below 75 on a math test are assigned to the treatment group with the treatment being an intensive tutoring program. Those who score at or above 75 are assigned to the comparison group. The students who score just above or below the cut point are considered to be on average identical because their score differences are most likely due to chance. These students therefore act as if they were randomly assigned. The difference in the outcome of interest (e.g., math ability as measured by a different test after the treatment) between the students right around the cut point is presumed to be the treatment effect.
Regression discontinuity is an alternative to randomized experiments when the latter design is not possible. It is the only recognized quasi-experimental design that meets the Institute of Education Sciences standards for establishing causal effects. Although considered to be a strong quasi-experimental design, it needs to meet certain conditions.
See the following for additional information on randomized and quasi-experimental designs.
- The Core Analytics of Randomized Experiments for Social Research (PDF)
- Experimental and Quasi-Experimental Designs for Research (PDF)
- Experimental and Quasi-Experimental Designs for Generalized Causal Inference (PDF)
An instrumental variable is a variable that is correlated with the independent variable of interest and only affects the dependent variable through that independent variable. The IV approach can be used in both randomized experiments and quasi-experiments.
In randomized experiments, the IV approach is used to estimate the effect of treatment receipt, which is different from treatment offer. Many social programs can only offer participants the treatment, or intervention, but not mandate them to use it. For example, parents are randomly assigned by way of lottery to a school voucher program. Those in the treatment group are offered vouchers to help pay for private school, but ultimately it is up to the parents to decide whether or not they will use the vouchers. If the researcher is interested in estimating the impact of voucher usage, namely the effect of treatment receipt, the IV approach is one way to do so. In this case, the IV is the treatment assignment status (e.g., a dummy variable with 1 being in the treatment group and 0 being in the control group), which is used to predict the probability of a parent using the voucher, which is in turn used as the independent variable of interest to estimate the effect of voucher usage.
In quasi-experiments, the IV approach is used to address the issue of endogeneity, namely that the treatment status is determined by participants themselves (self-selection) or by criteria established by the program designer (treatment selection). Endogeneity is an issue that plagues quasi-experiments and often a source of threats to the causal claim. The IV approach can be used to tease out the causal impact of an endogenous variable on the outcome. For example, researchers used cigarette taxes as an instrumental variable to estimate the effect of maternal smoking on birth outcomes (Evans and Ringel, 1999). Cigarette taxes affect how much pregnant mothers smoke but not birth outcomes. They therefore meet the condition of being an IV, which correlates with the independent variable/treatment (i.e., maternal smoking habit) and only affects the dependent variable (i.e., birth outcomes) through that independent variable. The estimated effect is, strictly speaking, a local average treatment effect, namely the effect of treatment (maternal smoking) among those mothers affected by the IV (cigarette taxes). It does not include mothers whose smoking habit is not affected by the price of cigarettes (e.g., chain smokers who may be addicted to nicotine).
An instrumental variable needs to meet certain conditions to provide a consistent estimate of a causal effect.
See the following for additional information on instrumental variables.
- An introduction to instrumental variable assumptions, validation and estimation
- An Introduction to Instrumental Variables (PDF)
The two types of validity are internal and external. It is often difficult to achieve both in social science research experiments.
- Internal Validity
- Internal validity refers to the strength of evidence of a causal relationship between the treatment (e.g., child care subsidies) and the outcome (e.g., maternal employment).
- When subjects are randomly assigned to treatment or control groups, we can assume that the treatment caused the observed outcomes because the two groups should not have differed from one another at the start of the experiment.
- For example, take the child care subsidy example above. Since research subjects were randomly assigned to the treatment (child care subsidies available) and control (no child care subsidies available) groups, the two groups should not have differed at the outset of the study. If, after the intervention, mothers in the treatment group were more likely to be working, we can assume that the availability of child care subsidies promoted maternal employment.
One potential threat to internal validity in experiments occurs when participants either drop out of the study or refuse to participate in the study. If individuals with particular characteristics drop out or refuse to participate more often than individuals with other characteristics, this is called differential attrition. For example, suppose an experiment was conducted to assess the effects of a new reading curriculum on the reading achievement of 10th graders. Schools were randomly assigned to use the new curriculum in all classrooms (treatment schools) or to continue using their current curriculum (control schools). If many of the slowest readers in treatment schools left the study before it was completed (e.g., dropped out of school or transferred to a school in another state), schools with the new curriculum would experience an increase in the average reading scores. The reason they experienced an increase in reading scores, however, is because weaker readers left the school, not because the new curriculum improved students' reading skills. The effects of the curriculum on the achievement of 10th graders might be overestimated, if schools in the control schools did not experience the same type of attrition.
- External Validity
- External validity, or generalizability, is also of particular concern in social science experiments.
- It can be very difficult to generalize experimental results to groups that were not included in the study.
- Studies that randomly select participants from the most diverse and representative populations are more likely to have external validity.
For example, a study shows that a new curriculum improved reading comprehension of third-grade children in Iowa. To assess the study's external validity, the researcher would consider whether this new curriculum would also be effective with third graders in New York or with children in other elementary grades.
- Yield the most accurate assessment of cause and effect.
- Typically have strong internal validity.
- Ensure that the treatment and control groups are truly comparable and that treatment status is not determined by participant characteristics that might influence the outcome.
- In social policy research, it can be impractical or unethical to conduct randomized experiments.
- They typically have limited external validity due to the fact that they often rely on volunteers and are implemented in a somewhat artificial experimental setting with a small number of participants.
- Despite being the “gold standard” for identifying causal impacts, they can also be faced with threats to internal validity such as attrition, contamination, cross-overs, and Hawthorne effects.
- Often have stronger external validity than randomized experiments because they are typically implemented in real-world settings and on larger scale.
- May be more feasible than randomized experiments because they have fewer time and logistical constraints often associated with randomized experiments.
- Avoid the ethical concerns associated with random assignment.
- Are often less expensive than randomized experiments.
- They often have weaker internal validity than randomized experiments.
- The lack of randomized assignment means that the treatment and control groups may not be comparable and that treatment status may be driven by participant characteristics or other experiences that might influence the outcome.
- Conclusions about causality are less definitive than randomized experiments due to the lack of randomization and reduced internal validity.
- Despite having weaker internal validity, they are often the best option available when it is impractical or unethical to conduct randomized experiments.