Day 3 of a week-long series on teacher effectiveness. Links to posts on what makes a teacher good and teacher interdependence here. Today's installment involves the problem of sorting.
Measuring the effects of teachers on students is challenging because students are not randomly assigned to teachers. In everyday conversation and journalistic accounts, the distinction between treatment and selection effects is rarely acknowledged. Ironically, teachers know this statistical trade secret well, but get lambasted when they point out that all teachers are not working with the same inputs.
Imagine a study evaluating the effectiveness of an exercise program. The first trial randomly assigns study participants to two conditions: running five miles a day or watching television. If, two months later, exercisers are in better shape than the TV watchers, we can conclude that the program is responsible. The average difference in fitness between these two groups is called a treatment effect.
Now, imagine a second trial, where participants themselves decide whether they want to exercise or watch TV (baseball post-season, anyone?). People know how athletic they are. They have a pretty good idea of whether they would like to, or even can, run five miles a day. Suppose that the track team signs up for the exercise program, while those who have a special distaste for physical activity (or were cut from the track team) opt for television. Two months later, we find that those participating in the exercise condition have lower blood pressure and less body fat.
What can we make of the effects of the program? Not much. Common sense dictates that members of the track team likely had lower blood pressure and less body fat to begin with. In this scenario, we can't just compare average differences in fitness between exercisers and TV watchers because of selection bias.
If students were randomly assigned to teachers, differences in teacher performance would represent a treatment, or a “teacher effect.” These differences could thus be attributed to something the teacher did. As the example above illustrates, accurately measuring the effects of teachers would require this type of assignment. Of course, students are not randomly assigned to teachers. Parents do not mindlessly flip a coin and leave their child’s placement with a bad teacher up to chance; we know that principals and guidance counselors often heed parents’ wishes in the teacher placement process. Parents aside, we also know that principals non-randomly assign kids to teachers based on their sense of which teachers are good with certain kinds of kids. (In the worst examples of this I’ve seen, new teachers are given the biggest behavior problems to “make them or break them.”)
If performance is measured by absolute criteria, and students with poorer performance are not distributed equally across teachers, some teachers will spuriously appear to be doing much better than others. Because of the way we are currently measuring teacher effectiveness (even with growth models that account for students’ initial score - we rarely hear of models that will take into account demographic and social characteristics that also affect student achievement), a high-performing teacher is one that enrolls high-performing students in the first place.
Any legitimate model of measuring teacher effectiveness needs to take non-random assignment seriously. Researchers who study teacher effects have developed a formidable repertoire of tools to deal with this problem, but they all involve 1) controlling for characteristics of kids that are politically unpopular to control for, like race and poverty, and 2) using multiple observations of the same teacher (i.e the last three years of this teacher's performance) to establish a “permanent” teacher effect. To date, the former has been politically unpopular, while the latter would be unacceptable to those who want to measure a teacher’s performance one year at a time.
Because of these political tensions, my fear is that we end up with a model of teacher effectiveness that a) isn’t valid and b) provides disincentives for teaching the lowest performing kids. More on both of these issues in the next two days.