How Did Teacher Evaluation Become a Thing?

Part One of a post-mortem on the movement to distinguish good teaching from bad.

Apr 09, 2024

Are teachers interchangeable parts?

In 2006, I was the newly minted Vice President for Policy and Research at TNTP and I was 29 years old and obviously I knew everything, so that was nice.

TNTP had written a couple of reports about why urban school districts had so much trouble staffing high poverty schools with good teachers.1 In addition to the standard issues like low pay and high stress, we found that schools were often burdened with ineffective teachers who were transferred from other schools.

If these teachers were bad, why weren’t they fired?

There were two competing explanations. The first – favored by teachers unions – put the blame on administrators. They were too lazy or inept to provide teachers the feedback and support that were prerequisite for seeking dismissal, and their documentation was so sloppy that it couldn’t survive the scrutiny of an honest arbitrator. The second – favored by school officials – was that the burden of proof to dismiss a tenured teacher was impossible to meet. A principal could follow every step perfectly for years, spending untold hours, only to have a hearing officer rule that a teacher’s shortcomings were not worthy of dismissal. Back to square one.

This argument was important. Neither group denied that schools had some teachers who were not good. No one could agree what to do about it. Districts were under pressure, during the height of NCLB accountability, to improve. These teachers made it harder.

My colleagues and I decided to launch a multi-year research project to get to the bottom of it. Our grand ambition was to study 12 district across four states. We invited everyone to the table – district leaders, unions, policy organizations – and formed an advisory panel totaling 80 people. There were large districts, like Chicago and Denver, and smaller ones, like Springdale (AR) and Rockford (IL). We hoped that by getting all parties in the same room, looking at the same data, we could move past stalemate.

In June 2009, we published our findings as The Widget Effect.2 Our conclusion? It didn’t matter whether the process for dismissing ineffective teachers was easy or impossible because it hardly ever started. Virtually zero teachers were assigned low ratings on official evaluations. Instead, almost all of them received the highest option available. These evaluations - which occurred less than annually for some teachers - were superficial box-checking observations that principals found burdensome and teachers found useless.

Our schools were unable to distinguish good teaching from bad. That’s why we used the term “widget.” Despite all the rhetoric about teachers being earthly saints, on a functional level they were treated as interchangeable. With that approach, how could systems possibly help teachers get better? Or help students learn more?

Time for a post-mortem

In this multi-part edition of The Education Daly, we’ll unpack how The Widget Effect, among many other factors, made teacher evaluation one of the hottest topics in education.3 I’ll be using the first person more often than usual because I was an active participant in the whole shebang. When I refer to those who pushing for evaluation reforms, that’s me.

I have a lot of regrets.

My goal is to excavate the history of the movement with the benefit of a little hindsight. This post-mortem comes on the heels of my attempt, a few months ago, to explain our brief but intense obsession with Finland’s schools. At the time, I promised to give the same treatment to some topics nearer to my own heart.

Like Finland-mania, the timeline here is relatively short. After bursting onto the scene around 2009, evaluation reform more or less ran its course by 2015, when Congress passed a new law to replace NCLB that did not require districts to do anything with teacher evaluation. No one was surprised. The wave had rolled back.

Let’s begin by stating the obvious: The movement to reform teacher evaluation did not achieve its goals. Teachers continue to receive superlative evaluation ratings. Those ratings are based almost entirely on perfunctory classroom observations conducted by administrators, just as was the case in 2006. Teachers are almost never fired for poor evaluations and they rarely receive higher pay or recognition for good ones. Nearly every teacher who is considered for tenure is granted it.

To many, the teacher evaluation crusade is a cautionary tale – maybe the ultimate one – of hubristic confidence in a fad.

It’s all of that. And also, more. It’s a window into how our schools truly operate and a great way to understand why we’ve stopped making academic progress in the last decade – and why our recovery from COVID-era learning loss is going so poorly.

Today, we’ll set the stage by understanding the elements that converged around 2009 to set the whole thing aflame. In future posts, I’ll cover what happened, how it went wrong, and what we learned.

Buckle up, friends. There’s a lot to digest.

Where did evaluation reform start?

It became clear that traditional measures of teacher value were inaccurate. There were two ways to move up (and earn more) in teaching: 1) Accumulate more years of seniority and 2) Obtain a master’s degree. By the early 2000s, research had shown that assuming a given teacher would be better at their job due to either of these traits was comically wrong. On average, teachers improve for the first few years of their careers, but they level off relatively quickly. Some teachers with ten years in the classroom are amazing… and some of them are awful. Same is true of master’s degrees. They are easy to earn and most teachers pursue them – because they come with a pay bump. But teachers who have advanced degrees aren’t any better instructionally than peers without them – and might be worse.
Some teachers are much more effective than others. We always knew this. Every third grader knows this. But as an empirical research fact, it become inescapable as annual testing of students became more common. Researchers cranked out a wave of studies showing that even when controlling for baseline student traits and achievement, some teachers consistently moved their students further each year. They added more learning value. But not only were they not being recognized for it, they weren’t told that their outcomes were exceptional. Nobody noticed.4
Charter schools put heat on districts. In the early 2000s, networks like KIPP, Aspire, North Star, YES Prep and Achievement First posted exceptional results with teachers who were non-unionized 20-somethings willing to work long hours for school leaders who could dismiss them at-will. These schools had flexibility to innovate and a passionate, mission-driven spirit. Folks started to wonder if districts needed to borrow a page from the charter playbook and give their principals more autonomy to select and re-shape their teams.

Crossover to the mainstream

Big city superintendents began challenging employee unions. Many of them came from non-traditional backgrounds. Joel Klein had been prosecuting Microsoft for antitrust violations prior to being named Chancellor in New York City. Michael Bennet – now a US Senator from Colorado – was a business lawyer and chief of staff to Denver’s mayor before running its school system. Michelle Rhee, the most famous of them all, was leading the relatively small non-profit that I worked for when, at 37, she took over DC Public Schools. They viewed large districts as broadly dysfunctional and shared a willingness to address third-rail issues that were generally career-killers for traditional superintendents. Firing low performing teachers was on that list.
Philanthropy wrote big checks. A week after Obama’s historic election victory in November 2008, the Bill and Melinda Gates Foundation convened a who’s who of education players in Seattle to announce a shift in strategy. Bill Gates was disappointed lackluster results from his efforts to convert large schools into smaller, more personal configurations. Going forward, he would turn his attention where he believed the research was pointing: higher academic standards for students and more effective teachers. Over the next few years, the foundation poured hundreds of millions into grants for districts that promised to reform their evaluation systems as well as research to test new methods of measuring teaching.
Major press outlets took notice. In August 2009, The New Yorker ran a long feature by Steven Brill about so-called “rubber rooms” where the city warehoused approximately 600 teachers awaiting disposition of their cases for misconduct or incompetence. They had been collecting full salaries and benefits for an average of three years. Readers were aghast at some of the stories. Rubber room became an all-purpose term for teachers who should have been dismissed but were not.
The federal government dangled a carrot. In 2008, the real estate market imploded and the economy teetered on the brink of insolvency. Congress quickly passed a rescue package along bipartisan lines. Tucked inside was significant funding to incent innovation in schools. While these funds were approved at the end of George W. Bush's second term, they were administered by President Obama. As states competed for their share of the loot, they committed to overhaul their outdated drive-by evaluations with multi-measure systems that included consideration of student achievement. Later, waivers from NCLB accountability also required states to adopt new teacher evals. By 2016, 44 states had passed legislation.
Unions signaled openness. In January 2010, AFT President Randi Weingarten won headlines for a speech announcing her commitment to modernizing evaluations, which included the use of student test scores as one measure.5 The AFT retained the services of renowned mediator Kenneth Feinberg, who had administered compensation for 9/11 victims, to oversee the development of a new model.

There’s more, but you get the idea. All roads converged. Acting on the differences in performance among teachers held the potential to improve student learning substantially and reduce the race- and class-based achievement gaps that had bedeviled US schools for decades. The consensus was pretty strong.

One could envision a future where, over the course of a decade or so, this momentum led to a new generation of evaluations and the elevation of teaching as a higher-paying, more prestigious profession that held its members to shared standards of practice.

But that’s not what happened. So, where did it go wrong? Read part two.

I wish I could claim some credit for these early reports but my work at TNTP between 2001 and 2006 focused exclusively on teacher pipeline programs. Jess Levin led the research and writing on two papers that challenged conventional wisdom about school staffing. The first, Missed Opportunities, in 2003, explained how late, disorganized hiring processes caused schools to miss out on the best candidates. Unintended Consequences, in 2005, showed how collectively bargained provisions that were meant to create order and fairness in teacher transfers had instead fueled inequity. They hold up surprisingly well after two decades.

The Widget Effect became a juggernaut, frequently cited in the media and academic research. It did not start out that way. We held a release event in DC - I think it was at the National Press Club - with some district and union representatives who had been part of the research process. Barely anyone attended. We tried mightily to pitch national reporters on the news-worthiness of the report without much luck. I still remember feeling in those first days after the unveiling that we’d completely failed and nobody was going to pay any attention.

Shout out to

Matthew Yglesias

for his series on education reform which included a particularly good entry on teacher evaluation. Give it a read. That series was one of my inspirations for this post-mortem.

Later, researchers showed that teachers also contribute differently to non-test student outcomes like absenteeism, suspensions, and course grades, which is all the more reason for developing better ways of assessing performance. But so far as I can tell, nothing’s happening on this front.

I was among those who joined in the chorus of praise for Weingarten in 2010. https://www.nydailynews.com/2010/07/05/weingarten-delivers-the-goods-a-frequent-critic-praises-the-union-head-for-backing-bold-reforms/

Emily Gordon

Apr 9, 2024

Tim, this is a great retrospective. Really looking forward to Part 2!

The first footnote made me smile. As recently as 2022, I was onboarding new staff at TNTP and telling them to read Missed Opportunities and Unintended Consequences if they were going to do talent work. They’ve held up well, which is to say, change moves very slowly in the staffing world. Just subscribed and looking forward to reading more!

Expand full comment

Jane Frantz

This landed in my email box this morning. As a long time teacher (45 years, recently retired) in a highly respected school system, I found your assessment to be a combination of insightful comments and others that missed the mark. IMO, a huge part of the problem is the lack of professional development for principals and other administrators as to how to evaluate good teaching. I never received insightful information about my teaching from a formal evaluation, so I figured out a way to handle the process so it took up less of my time: I put together a run of the mill lesson that I knew would go well (the kids were paying attention and getting their work done), but lacked in creativity, spark, etc. I was a highly respected teacher in my school and district and confident in my skills, and frankly, the evaluation had no meaning for me. It was just another administrative function to get through - one of many. I gave this same advice to my younger colleagues because the process often left them completely demoralized.

Some of the people you quote as being reformers aren't well respected within the profession because they jump in with solutions and simply don't know enough. Bill Gates had his heart in the right place, but for probably the first time in his life, he was in over his head. Michelle Rhee was never considered a serious educator within the profession.

One thing I note from this article is a lack of input from educators. This is a problem in the profession. Input is gathered from everyone but the people who do the actual work.

I look forward to reading your future articles.

2 replies

2 more comments...