{"id":26,"date":"2026-02-26T14:57:24","date_gmt":"2026-02-26T14:57:24","guid":{"rendered":"https:\/\/cosinus.opalstacked.com\/?p=26"},"modified":"2026-02-26T14:57:24","modified_gmt":"2026-02-26T14:57:24","slug":"gates-measures-of-effective-teaching-study-more-value-added-madness","status":"publish","type":"post","link":"https:\/\/cosinus.opalstacked.com\/?p=26","title":{"rendered":"Gates\u2019 Measures of Effective Teaching Study: More Value-Added Madness"},"content":{"rendered":"\n<p>By&nbsp;Justin Baeder&nbsp;\u2014 December 21, 2010&nbsp;&nbsp;5 min read<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"javascript:window.print()\"><\/a><\/li>\n<\/ul>\n\n\n\n<p><a href=\"mailto:?body=Gates%27%20Measures%20of%20Effective%20Teaching%20Study%3A%20More%20Value-Added%20Madness%0A%0Ahttps%3A%2F%2Fwww.edweek.org%2Fteaching-learning%2Fopinion-gates-measures-of-effective-teaching-study-more-value-added-madness%2F2010%2F12%0A%0A%40eduleadership\"><\/a><a href=\"https:\/\/www.facebook.com\/dialog\/share?app_id=200633758294132&amp;display=popup&amp;href=https%3A%2F%2Fwww.edweek.org%2Fteaching-learning%2Fopinion-gates-measures-of-effective-teaching-study-more-value-added-madness%2F2010%2F12\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/www.linkedin.com\/shareArticle?url=https%3A%2F%2Fwww.edweek.org%2Fteaching-learning%2Fopinion-gates-measures-of-effective-teaching-study-more-value-added-madness%2F2010%2F12&amp;mini=true&amp;title=Gates%27%20Measures%20of%20Effective%20Teaching%20Study%3A%20More%20Value-Added%20Madness&amp;summary=%40eduleadership&amp;source=Education%20Week\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><a href=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fwww.edweek.org%2Fteaching-learning%2Fopinion-gates-measures-of-effective-teaching-study-more-value-added-madness%2F2010%2F12&amp;text=Gates%27%20Measures%20of%20Effective%20Teaching%20Study%3A%20More%20Value-Added%20Madness\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n\n\n\n<p>Justin Baeder<\/p>\n\n\n\n<p>Justin Baeder is a public school principal in Seattle and a doctoral student studying principal performance and productivity at the University of Washington.<\/p>\n\n\n\n<p><a href=\"https:\/\/twitter.com\/eduleadership\" target=\"_blank\" rel=\"noreferrer noopener\">@eduleadership<\/a><\/p>\n\n\n\n<p>The Measures of Effective Teaching project, funded to the tune of $45 million by the Gates Foundation, has&nbsp;<a href=\"http:\/\/documents.latimes.com\/measures-of-effective-teaching\/\" target=\"_blank\" rel=\"noreferrer noopener\">released its first of four reports<\/a>. While the report is full of intelligent insights, it makes a number of astounding logical leaps to justify the use of value-added teacher ratings, and I can already see how the study will be used to make the case for sloppy valued-added teacher evaluation systems.<\/p>\n\n\n\n<p>This is an ambitious study, and a very well-designed one at that. I can\u2019t imagine a better-funded or better-designed study of teacher effectiveness measures; top-notch researchers and the use of five different measures will doubtless make this one of the stronger studies of its type. However, one of its fundamental premises is deeply flawed, and this affects the conclusions drawn throughout the report:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Second, any additional components of the evaluation (e.g., classroom observations, student feedback) should be demonstrably related to student achievement gains. p. 5<\/p>\n<\/blockquote>\n\n\n\n<p>This assumption would make sense if \u201cstudent achievement gains\u201d were as legitimate and stable a construct as the study asserts. Certainly it makes sense to evaluate teachers based on how well they improve student learning, but the assumption that we can isolate teacher effects from all of the other influences on student test performance has not been borne out by research to date on value-added measurement (VAM), including the research done in the MET study itself.<\/p>\n\n\n\n<p>In fact, every indication is that it will continue to remain impossible to isolate teacher effects from such other influences as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Nonrandom assignment of students to teachers, which can happen due to purposeful assignment of struggling students to stronger teachers, or due to unintentional scheduling oddities (for example, if Advanced Band is offered when I teach Algebra II, I\u2019m going to get non-band students in my Algebra II class, who may be different in significant ways from band students)<\/li>\n\n\n\n<li>Other teachers who serve the same students, e.g. special education, gifted &amp; talented, academic support, or other subject-area teachers whose effects may spill over into other classes<\/li>\n\n\n\n<li>Class composition, which every teacher will tell you varies from year to year despite attempts to \u201cbalance\u201d classes. While it\u2019s the teacher\u2019s job to create an environment conducive to learning, students do play an important role in determining the culture of a class, which can have a significant impact on learning<\/li>\n<\/ul>\n\n\n\n<p>Despite attempts to control for such extraneous variables in value-added measurement, there is strong empirical evidence that \u201cstudent achievement gains\u201d are not stable from year to year\u2014nor as the MET report notes, even between different sections of the same subject taught at the same time by the same teacher. As&nbsp;<a href=\"https:\/\/www.epi.org\/publications\/entry\/bp278\" target=\"_blank\" rel=\"noreferrer noopener\">this EPI briefing paper notes<\/a>,<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>there is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed.<\/p>\n<\/blockquote>\n\n\n\n<p>Specifically,<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers&#8217; effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people&#8217;s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a &#8220;teacher effect&#8221; or the effect of a wide variety of other factors.<\/p>\n<\/blockquote>\n\n\n\n<p>These cited studies are not the final word, of course, so it would be reasonable for the MET study to try to improve value-added measurements in order to try to obtain more stable year-to-year ratings.<\/p>\n\n\n\n<p>However, that does not appear to be part of the design; instead, first-year value-added scores will be treated as the \u201ctruth\u201d against which all other measures of teacher effectiveness will be judged. But how good is this \u201ctruth\u201d?<\/p>\n\n\n\n<p>Let\u2019s see what the MET study found about the stability of its own value-added measures. Two types of data were available: comparisons between different sections of the same subject taught by the same teacher in the same year, and comparisons between the same class taught by the same teacher in two different years. The report states that<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated. We observed the highest correlations in teacher value-added on the state math tests, with a between-section correlation of .38 and a between-year correlation of .40. The correlation in value-added on the open-ended version of the Stanford 9 was comparable, .35. However, the correlation in teacher value-added on the state ELA test was considerably lower&#8211;.18 between sections and .20 between years.<\/p>\n<\/blockquote>\n\n\n\n<p>In other words, the value-added score for one math class only has a 40% correlation with the same teacher\u2019s score for another class, whether taught at the same time or in a different year. For ELA, the correlation is only about 20%.<\/p>\n\n\n\n<p>You\u2019d think that the researchers would at this point give up on value-added and start looking for more reliable measures. Instead, we\u2019re treated to a full paragraph of logical gymnastics and implication-avoidance (emphasis mine):<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Does this mean that there are no persistent differences between teachers? Not at all. The correlations merely report the proportion of the variance that is due to persistent differences between teachers. Given that the total (unadjusted) variance in teacher value-added is quite large, the implied variance associated with persistent differences between teachers also turns out to be large, despite the low between-year and between-section correlations. For instance, the implied variance in the stable component of teacher value-added on the state math test is .020 using the between-section data and .016 using the between-year data. Recall that the value- added measures are all reported in terms of standard deviations in student achievement at the student level. Assuming that the distribution of teacher effects is &#8220;bell-shaped&#8221; (that is, a normal distribution), this means that if one could accurately identify the subset of teachers with value-added in the top quartile, they would raise achievement for the average student in their class by .18 standard deviations relative to those assigned to the median teacher. Similarly, the worst quarter of teachers would lower achievement by .18 standard deviations. So the difference in average student achievement between having a top or bottom quartile teacher would be .36 standard deviations. That is far more than one-third of the black-white achievement gap in 4th and 8th grade as measured by the National Assessment of Educational Progress&#8211;closed in a single year!<\/p>\n<\/blockquote>\n\n\n\n<p>So we\u2019ve gone from \u201cthese results are highly unstable\u201d to \u201cwe can eliminate the achievement gap in one year!\u201d in the space of a single paragraph. Please leave a comment if I\u2019m misinterpreting this part of the study, but it seems to me that if your measure is only a 20% predictor of&nbsp;<em>itself<\/em>, you don\u2019t have a meaningful measure at all. It\u2019s certainly true that some teachers are much, much better than others, but forgive me if I\u2019m hesitant to trust a measure that is potentially wrong 80% of the time.<\/p>\n\n\n\n<p>Earlier, the authors describe their plan for handling this risk: simply give VAM less weight by<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>scaling down (or up) the value-added measures themselves. But that&#8217;s largely a matter of determining how much weight should be attached to value-added as one of multiple measures of teacher effectiveness. p. 4<\/p>\n<\/blockquote>\n\n\n\n<p>Let me get this straight: If I choose to evaluate you on the basis of a coin toss, which is totally random, I know I\u2019ll be wrong 50% of the time. Therefore, the coin toss is a valid evaluation tool provided that it only counts for 50% of your overall evaluation.<\/p>\n\n\n\n<p>Please tell me I\u2019m reading this wrong\u2014my background in statistics barely qualifies as graduate-level, and I\u2019m certainly not a VAM expert. But I think I\u2019m interpreting the authors\u2019 argument correctly.<\/p>\n\n\n\n<p>Interestingly, student ratings are much more stable between classes and from year to year\u2014their correlation is on the order of 67%. If anything, this first MET report provides good evidence that simply asking students about their teachers is a much better idea than going through both statistical and logical gymnastics to obtain a VAM score.<\/p>\n\n\n\n<p>In its closing section, the report argues that VAM is useful even though its predictive power is incredibly weak. Now, if you wanted to report the utter unreliability of VAM as good news, how would you do it? The authors take this tack:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Two types of evidence\u2014student achievement gains and student feedback\u2014do seem to point in the same direction, with teachers performing better on one measure tending to perform better on the other measures. p. 31<\/p>\n<\/blockquote>\n\n\n\n<p>In other words, the correlations between VAM and other forms of assessment are ridiculously weak, but hey, at least they\u2019re not negative. The report goes on to say that<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>many people simply forget all the bad decisions being made now, when there is essentially no evidence base available by which to judge performance. Every day, effective teachers are being treated as if they were the same as ineffective teachers and ineffective teachers are automatically granted tenure after two or three years on the job. Given that we know there are large differences in teacher effects on children, we are effectively mis-categorizing everyone when we treat everyone the same. Value-added data adds information. Better information will lead to fewer mistakes, not more. Better information will also allow schools to make decisions which will lead to higher student achievement.<\/p>\n<\/blockquote>\n\n\n\n<p>Wow. Notice the careful wording: \u201cValue-added data adds information. Better information&#8230;\u201d Does it say that value-added data&nbsp;<em>is<\/em>&nbsp;better information? No, because clearly it\u2019s not. If and when we\u2019re able to obtain better information (e.g. from student ratings or more rigorous observation methods), we should certainly incorporate it into teacher evaluations. But for now, VAM doesn\u2019t give us anything useful\u2014just a contextless number accompanied by a false sense of certainty.<\/p>\n\n\n\n<p>In the end, what does this report tell us? Let\u2019s look at the problem the study is intended to solve: More than 99% of teachers are rated \u201csatisfactory\u201d every year, which should seem incredible to even the most ardent supporter of teachers. There is no doubt that principals, myself included, need to do a far better job of identifying underperforming teachers and helping them to improve (or, if that doesn\u2019t work, to exit the profession). I look forward to reading the forthcoming MET reports that tell us more about effective observations; Gates has convened some of the best minds in our country to tackle the issue. I can only hope that the effectiveness of rigorous classroom observations is not judged against the shoddy \u201ctruth\u201d of value-added measurements.<\/p>\n\n\n\n<p>Related Tags:<\/p>\n\n\n\n<p><a href=\"https:\/\/www.edweek.org\/teacher-evaluations\">Teacher Evaluations<\/a><\/p>\n\n\n\n<p>The opinions expressed in On Performance are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By&nbsp;Justin Baeder&nbsp;\u2014 December 21, 2010&nbsp;&nbsp;5 min read Justin Baeder Justin Baeder is a public school principal in Seattle and a doctoral student studying principal performance and productivity at the University of Washington. @eduleadership The Measures of Effective Teaching project, funded to the tune of $45 million by the Gates Foundation, has&nbsp;released its first of four [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-26","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/posts\/26","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=26"}],"version-history":[{"count":1,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/posts\/26\/revisions"}],"predecessor-version":[{"id":27,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=\/wp\/v2\/posts\/26\/revisions\/27"}],"wp:attachment":[{"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=26"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=26"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cosinus.opalstacked.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=26"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}