Thursday, September 27, 2007

The oops factor (in measuring teacher effectiveness)

Part 4 of 5 of a week-long teacher effectiveness soiree. First 3 posts here.

Everything in the world is measured with error. This isn't the sexiest topic - so I'll try to keep it short - but it's important. I want to draw your attention to two different problems:

1) Randomness:
Repeated measurements of anything necessarily have some random variation. Step onto the scale daily for a month (assuming you're not preparing to hibernate for the winter, in which case your weight changes are not "random variation") and this will quickly become apparent. In terms of measuring teacher effects, this random variation is less of a problem if we are willing to assess teacher effectiveness across multiple years of data (i.e. my students' test scores in 2005, 2006, and 2007).

But I've heard a lot of plans kicked around that propose to reward some proportion of teachers based on one year of data; certainly, from an incentives perspective, this makes sense. We want to give you an incentive this year to push hard. But when Mrs. Scott is awarded merit pay in 2005, but not in 2006, and then she's a merit pay-worthy teacher again in 2007, this system doesn't have a lot of face validity for educators or the public.

2) A student's baseline is not necessarily a good control:
To be sure, value-added models are a tremendous improvement upon NCLB's proficiency system. (Note: value-added models with unrealistic proficiency targets aren't really an improvement - more on this next week.) Value-added approaches give us a more accurate portrayal of how a school or teacher is really doing. Here's a description of value-added Tennessee style from the Center for Greater Philadelphia at Penn:

Because individual students rather than cohorts are traced over time, each student serves as his or her own "baseline" or control, which removes virtually all of the influence of the unvarying characteristics of the student, such as race or socioeconomic factors.

Sounds about right. Right? The clearest example of why a student's baseline test score does not "remove virtually all the influence of unvarying characteristics of the student" is the following: Mrs. Jones' class enrolls only wealthy children, while Mrs. Scott's class enrolls only kids who qualify for free lunch. Tests are given in the spring of each school year, so we're going to measure May to May. We know that poor students have lower rates of learning in the summer compared to their more advantaged peers. But if we only take into account their initial score, not their socioeconomic status, we're going to come up with a biased estimate of teacher effectiveness. In this case, teachers who teach low-income kids look like poorer teachers simply because of summer learning loss.

You respond, "Okay, let's measure September to June - the initial score explains everything that's out of the teacher's control." What we're saying, then, is that there are no effects of race, socioeconomic status, gender, or anything else that are not explained by the initial score. Studies using data from the Early Childhood Longitudinal Study, which followed kindergarten kids through the fifth grade, have determined that black kids fall behind each year, even controlling for their initial score. (See, for example, Roland Fryer and Steve Levitt here.) It's not clear why this is the case, and certainly some component is explained by things that happen in school - the very problem we're trying to fix. But there are good reasons to believe that some part of falling behind is explained by things that happen outside of school. The key point is that the initial score doesn't explain all non-school factors that affect learning, and this remains a big problem with growth models as they're currently conceived.

Check out the powerpoint at the Center for Greater Philadelphia - link above. It's a great overview of value-added.

No comments: