Friday, November 9, 2007

Is this a wake-up call for the people who work there? You betcha.

Earlier this week, Mayor Michael Bloomberg flexed his muscles by threatening to close F schools as early as June. He quipped, "Is this a wake-up call for the people who work there? You betcha."

Through analyzing these data, I've concluded that the people in need of a wake-up call work not at F schools, but at the NYC Department of Education. Undoubtedly, data can and should be used for organizational learning and school improvement. But if we're going to rank and sort schools - an action that has serious consequences for the kids, educators, and parents affected - the Department of Ed's methods should be in line with the standards to which statisticians and quantitative social scientists hold themselves. Needless to say, NYC's report cards are not.

There are five reasons the report cards might kindly be called statistical malpractice:

1) Ignoring measurement error

Measurement error isn't sexy and won't attract the attention of journalists and commentators. But it may be the central downfall of the NYC report card system. For example, elementary school PS 179 (score=30.9) got an F, while PS 277 (score=31.06) got a D. Similarly, Queens Gateway to Health Sciences Secondary School (score=65.21) got a B, while IS 229 (score=65.22) got an A.

If we actually acknowledged that these overall scores are measured with error, a school scoring a 65.21 is not statistically distinguishable from one scoring a 65.22 (a difference of .0007 standard deviations) . And Mayor Bloomberg is threatening to close PS 179 this year and keep PS 277 open because of a difference of .16? (See the grade brackets in the table below to see how close your school was to earning a higher or lower grade.)

2) Arbitrary grade distributions and cutoffs

Initially, the Dept announced a curve on which schools would be graded, but now they've curiously changed the distribution of grades and created different distributions for elementary, middle schools, high schools, and K-8s. Why should 25.26% of middle schools get As, but only 21.72% of elementary schools do? By the same token, why should 5.12% of high schools get Ds while 9.69% of middle schools do? It's not that the Dept has set criterion-referenced score cutoffs for attaining these grades, as the table above demonstrates - so what's going on here? The Dept of Ed needs to release more information about why this distribution of grades was chosen, and why it is different for each school level.

For example, there are more A/B middle schools than A/B elementary schools - does this mean that NYC's middle schools are "better?" The table below shows the percentage of schools receiving each grade for each school level, as well as the number of schools receiving that grade. You can click to enlarge.

* For high schools, the denominator does not include schools with grades "under review."

3) 6-12 schools grade discrepancies

Schools serving grades 6-12 got two grades - one for 6-8, and one for 9-12. These are the same schools, same principal, same roof. But for the 33 schools for which there are middle and high school grades available, 22 have different grades - though they are the same school! Sometimes these differences are substantial.

Consider the Academy of Environmental Science - its high school got a C, but its middle school got an F. At Hostos Lincoln Academy of Science, the middle school got a D, but the high school got a B. At the Bronx School for Law, Government, and Justice, the middle school got an F, but the high school got a C.

4) Poorly constructed comparison groups

As I've written here, the Dept of Ed flubbed the comparison groups by treating the percent African-American and percent Hispanic as interchangeable (i.e. a school with 59% Hispanic and 1% African-American is a perfect match for a school with 59% African-American and 1% Hispanic.) In addition, the Dept did not consider the proportion of Asian students when creating comparison groups; schools with higher proportions of Asian kids were more likely to get As and Bs, and there's no reason to believe that Asian kids in NYC have access to much higher quality schools. It's more likely that Asian kids grow academically at a faster rate because of things that happen outside of school.

Until the Dept releases the comparison groups, it is difficult to know how bad these comparisons are - so stay tuned.

5) Problems with growth models: Interval scaling and ceiling effects

I'm all for growth models, but you can't treat 1 unit of growth at the bottom of the distribution (i.e. moving from 13 to 14 on a 100 point scale) the same as 1 unit of growth at the top of the distribution (i.e. moving from an 89 to a 90). Put formally, the Department of Ed's model assumes that tests are "interval scaled," but they are not. Similarly, if a student is scoring near the top possible score of a test (the ceiling), there is very little room left to grow. One can address this problem by weighting growth at different parts of the distribution differently, but the Dept chose not to do this.

Hopefully, folks who care about public education in NYC will issue a wake-up call to the Dept of Ed and demand that these problems are fixed before vital decisions are made about schools based on highly questionable methods.


Anonymous said...

6. New York State assessments were never designed to be used in this fashion. Current SED tests were designed around a static model that is more sensitive at the middle of the performance scale, not the tails. SED is currently investigating redesigning the tests to accomodate a growth model. There's a reason why no one else in New York State is using this model.

Anonymous said...

If I understand what the DOE did, it's even worse than you describe. Many standardized tests are scored with "scale scores," which are intended to be interval-level. The proficiency levels of NYS's assessments of not meeting learning standards (1), partially meeting learning standards (2), meeting learning standards (3), and meeting learning standards with distinction (4) are ordinal, not interval-level categories. DOE took the scale scores and mapped them onto the cut-off scores for the proficiency levels, calling them Proficiency Ratings, which were then treated as interval-level values.

For example, a Proficiency Rating of 2.5 corresponds to the midpoint between the cut-off scores for Performance Level 2 and Performance Level 3; a Proficiency Rating of 3.0 corresponds to the cut-off score for Performance Level 3; and a Proficiency Rating of 3.5 corresponds to the midpoint of the cut-off scores for Performance Levels 3 and 4.

The Grade 3 cut scores on the 2006 ELA for Levels 2, 3 and 4 were 616, 650 and 730, respectively. This means that a scale score of 633 was assigned a Proficiency Rating of 2.5; a scale score of 650 was assigned a Proficiency Rating of 3.0; and a scale score of 690 was assigned a Proficiency Rating of 3.5. The difference in scale scores between Proficiency Ratings of 2.5 and 3.0 is 17 points, and between Proficiency Ratings of 3.0 and 3.5 is 40 points. For a true interval-level scale, these differences should represent the same number of scale score points. Unless I'm missing something, the DOE created Proficiency Ratings that are not interval-level from scale scores that are.