Opinion, Berkeley Blogs

Do student evaluations measure teaching effectiveness?

By Philip Stark

Since 1975, course evaluations at Berkeley have included the following question: Considering both the limitations and possibilities of the subject matter and course, how would you rate the overall teaching effectiveness of this instructor?

1 (not at all effective), 2, 3, 4 (moderately effective), 5, 6, 7 (extremely effective)

Among faculty, student evaluations of teaching are a source of pride and satisfaction—and frustration and anxiety. High-stakes decisions including merit reviews, tenure, and promotions are based in part on these evaluations.  Yet, it is widely believed that evaluations reflect little more than a popularity contest; that it’s easy to “game” the ratings; that good teachers get bad ratings; that bad teachers get good ratings; and that fear of bad ratings stifles pedagogical innovation and encourages faculty to water down course content.

What do we really know about student evaluations of teaching effectiveness?

Quantitative student ratings of teaching are the most common method to evaluate teaching.[1] De facto, they define “effective teaching” for many purposes, including faculty promotions. They are popular partly because the measurement is easy: Students fill out forms. It takes about 10 minutes of class time and even less faculty time. The major labor for the institution is to transcribe the data; online evaluations automate that step. Averages of student ratings have an air of objectivity by virtue of being numerical.  And comparing the average rating of any instructor to the average for her department as a whole is simple.

While we are not privy to the deliberations of the Academic Senate Budget Committee (BIR), the idea of comparing an instructor’s average score to averages for other instructors or other courses pervades our institution’s culture.  For instance, a sample letter offered by the College of Letters and Sciences for department chairs to request a “targeted decoupling” of faculty salary includes:

Smith has a strong record of classroom teaching and mentorship.  Recent student evaluations are good, and Smith’s average scores for teaching effectiveness and course worth are (around) ____________ on a seven-point scale, which compares well with the relevant departmental averages.  Narrative responses by students, such as “________________,” are also consistent with Smith’s being a strong classroom instructor.

This places heavy weight on student teaching evaluation scores and encourages comparing an instructor’s average score to the average for her department.

What does such a comparison show?

In this three-part series, we report statistical considerations and experimental evidence that lead us to conclude that comparing average scores on “omnibus” questions, such as the mandatory question quoted above, should be avoided entirely. Moreover, we argue that student evaluations of teaching should be only a piece of a much richer assessment of teaching, rather than the focal point. We will ask:

●      How good are the statistics? Teaching evaluation data are typically spotty and the techniques used to summarize evaluations and compare instructors or courses are generally statistically inappropriate.

●      What do the data measure? While students are in a good position to evaluate some aspects of teaching, there is compelling empirical evidence that student evaluations are only tenuously connected to overall teaching effectiveness.[2] Responses to general questions, such as overall effectiveness, are particularly influenced by factors unrelated to learning outcomes, such as the gender, ethnicity, and attractiveness of the instructor.

●      What’s better? Other ways of evaluating teaching can be combined with student teaching evaluations to produce a more reliable, meaningful, and useful composite; such methods were used in a pilot in the Department of Statistics in spring 2013 and are now department policy.

At the risk of losing our audience right away, we start with a quick nontechnical look at statistical issues that arise in collecting, summarizing, and comparing student evaluations. Please read on!

Administering student teaching evaluations

Until recently, paper teaching evaluations were distributed to Berkeley students in class. The instructor left the room while students filled out the forms. A designated student collected the completed forms and delivered them to the department office. Department staff calculated average effectiveness scores, among other things. Ad hoc committees and department chairs also might excerpt written comments from the forms.

Online teaching evaluations may become (at departments’ option) the primary survey method at Berkeley this year. This raises additional concerns. For instance, the availability of data in electronic form invites comparisons across courses, instructors, and departments; such comparisons are often inappropriate, as we discuss below. There also might be systematic differences between paper-based and online evaluations, which could make it difficult to compare ratings across the “discontinuity.[3]

Who responds?

Some students are absent when in-class evaluations are administered.  Students who are present may not fill out the survey; similarly, some students will not fill out online evaluations.[4] The response rate will be less than 100%. The further the response rate is from 100%, the less we can infer about the class as a whole.

For example, suppose that only half the class responds, and that all those “responders” rate the instructor’s effectiveness as 7.  The mean rating for the entire class might be 7, if the nonresponders would also have rated it 7. Or it might be as low as 4, if the nonresponders would have rated the effectiveness 1. While this example is unrealistically extreme, in general there is no reason to think that the nonresponders are like the responders. Indeed, there is good reason to think they are not like the responders: They were not present or they did not fill out the survey. These might be precisely the students who find the instructor unhelpful.

There may be biases in the other direction, too.  It is human nature to complain more loudly than one praises: People tend to be motivated to action more by anger than by satisfaction. Have you ever seen a public demonstration where people screamed “we’re content!”?[5]

The lower the response rate, the less representative of the overall class the responders might be.  Treating the responders as if they are representative of the entire class is a statistical blunder.

The 1987 Policy for the Evaluation of Teaching (for advancement and promotion) requires faculty to provide an explanation if the response rate is below ⅔. This seems to presume that it is the instructor’s fault if the response rate is low, and that a low response rate is in itself a sign of bad teaching.[6]  The truth is that if the response rate is low, the data should not be considered representative of the class as a whole.  An explanation of the low response rate—which generally is not in the instructor’s control—solves nothing.

Averages of small samples are more susceptible to “the luck of the draw” than averages of larger samples.  This can make teaching evaluations in small classes more extreme than evaluations in larger classes, even if the response rate is 100%.  Moreover, in small classes students might imagine their anonymity to be more tenuous, which might reduce their willingness to respond truthfully or to respond at all.


As noted above, Berkeley’s merit review process invites reporting and comparing averages of scores, for instance, comparing an instructor’s average scores to the departmental average.  Averaging student evaluation scores makes little sense, as a matter of statistics.  It presumes that the difference between 3 and 4 means the same thing as the difference between 6 and 7.  It presumes that the difference between 3 and 4 means the same thing to different students. It presumes that 5 means the same things to different students in different courses. It presumes that a 4 “balances” a 6 to make two 5s. For teaching evaluations, there’s no reason any of those things should be true.[7]

Effectiveness ratings are what statisticians call an “ordinal categorical” variable: The ratings fall in categories with a natural order (7 is better than 6 is better than … is better than 1), but the numbers 1, 2, …, 7 are really labels of categories, not quantities of anything.  We could replace the numbers with descriptive words and no information would be lost: The ratings might as well be “not at all effective”, “slightly effective,” “somewhat effective,” “moderately effective,” “rather effective,” “very effective,” and “extremely effective.”

Does it make sense to take the average of “slightly effective” and “very effective” ratings given by two students? If so, is the result the same as two “moderately effective” scores?  Relying on average evaluation scores does just that: It equates the effectiveness of an instructor who receives two ratings of 4 and the effectiveness of an instructor who receives a 2 and a 6, since both instructors have an average rating of 4. Are they really equivalent?

They are not, as this joke shows: Three statisticians go hunting. They spot a deer. The first statistician shoots; the shot passes a yard to the left of the deer.  The second shoots; the shot passes a yard to the right of the deer.  The third one yells, “we got it!”

Even though the average location of the two misses is a hit, the deer is quite unscathed: Two things can be equal on average, yet otherwise utterly dissimilar. Averages alone are not adequate summaries of evaluation scores.

Scatter matters

Comparing an individual instructor’s (average) performance with an overall average for a course or a department is less informative than campus guidelines appear to assume. For instance, suppose that the departmental average for a particular course is 4.5, and the average for a particular instructor in a particular semester is 4.2.  The instructor is “below average.” How bad is that? Is the difference meaningful?

There is no way to tell from the averages alone, even if response rates were perfect. Comparing averages in this way ignores instructor-to-instructor and semester-to-semester variability.  If all other instructors get an average of exactly 4.5 when they teach the course, 4.2 would be atypically low.  On the other hand, if other instructors get 6s half the time and 3s the other half of the time, 4.2 is almost exactly in the middle of the distribution. The variability of scores across instructors and semesters matters, just as the variability of scores within a class matters. Even if evaluation scores could be taken at face value, the mere fact that one instructor’s average rating is above or below the mean for the department says very little. Averages paint a very incomplete picture.  It would be far better to report the distribution of scores for instructors and for courses: the percentage of ratings that fall in each category (1–7) and a bar chart of those percentages.

All the children are above average

At least half the faculty in any department will have teaching evaluation averages at or below median for that department. Someone in the department will be worst.  Of course, it is possible for an entire department to be “above average” compared to all Berkeley faculty, by some measure. Rumor has it that department chairs sometimes argue in merit cases that a faculty member with below-average teaching evaluations is an excellent teacher—just perhaps not as good as the other teachers in the department, all of whom are superlative.  This could be true in some departments, but it cannot be true in every department. With apologies to Garrison Keillor, while we have no doubt that all Berkeley faculty are above average compared to faculty elsewhere, as a matter of arithmetic they cannot all be above average among Berkeley faculty.

Comparing incommensurables

Different courses fill different roles in students’ education and degree paths, and the nature of the interaction between students and faculty in different types of courses differs.  These variations are large and may be confounded with teaching evaluation scores.[8] Similarly, lower-division students and new transfer students have less experience with Berkeley courses than seniors have.  Students’ motivations for taking courses varies, in some cases systematically by the type of course.  It is not clear how to make fair comparisons of student teaching evaluations across seminars, studios, labs, large lower-division courses, gateway courses, required upper-division courses, etc., although such comparisons seem to be common[9]—and are invited by the administration, as evidenced by the excerpt above.

Student Comments

What about qualitative responses, rather than numerical ratings?  Students are well situated to comment about their experience of the course factors that influence teaching effectiveness, such as the instructor’s audibility, legibility, and availability outside class.[10]

However, the depth and quality of students’ comments vary widely by discipline. Students in science, technology, engineering, and mathematics tend to write much less, and much less enthusiastically, than students in arts and humanities. That makes it hard to use student comments to compare teaching effectiveness across disciplines—a comparison the Senate Budget Committee and the Academic Personnel Office make. Below are comments on two courses, one in Physical Sciences and one in Humanities. By the standards of the disciplines, all four comments are “glowing.”

Physical Sciences Course:

“Lectures are well organized and clear"

“Very clear, organized and easy to work with”

Humanities Course:

“There is great evaluation of understanding in this course and allows for critical analysis of the works and comparisons. The professor prepares the students well in an upbeat manner and engages the course content on a personal level, thereby captivating the class as if attending the theater. I’ve never had such pleasure taking a class. It has been truly incredible!”

“Before this course I had only read 2 plays because they were required in High School. My only expectation was to become more familiar with the works. I did not expect to enjoy the selected texts as much as I did, once they were explained and analyzed in class. It was fascinating to see texts that the author’s were influenced by; I had no idea that such a web of influence in Literature existed. I wish I could be more ‘helpful’ in this evaluation, but I cannot. I would not change a single thing about this course. I looked forward to coming to class everyday. I looked forward to doing the reading for this class. I only wish that it was a year long course so that I could be around the material, GSI’s and professor for another semester.”

While some student comments are extremely informative—and we strongly advocate that faculty read all student comments—it is not obvious how to compare comments across disciplines to gauge teaching effectiveness accurately and fairly.[11]

In summary:

●      Response rates matter, but not in the way campus policy suggests. Low response rates need not signal bad teaching, but they do make it impossible to generalize results reliably to the whole class. Class size matters, too: All else equal, expect more semester-to-semester variability in smaller classes.

●      Taking averages of student ratings does not make much sense statistically.  Rating scales are ordinal categorical, not quantitative, and they may well be incommensurable across students. Moreover, distributions matter more than averages.

●      Comparing instructor averages to department averages is, by itself, uninformative. Again, the distribution of scores—for individual instructors and for departments—is crucial to making meaningful comparisons, even if the data are taken at face value.

●      Comparisons across course types (seminar/lecture/lab/studio), levels (lower division / upper division / MA / PhD), functions (gateway/major/elective), sizes (e.g., 7/35/150/300/800), or disciplines is problematic. Online teaching evaluations invite potentially inappropriate comparisons.

●      Student comments provide valuable data about the students’ experiences. Whether they are a good measure of teaching effectiveness is another matter.

In the next installment, we consider what student teaching evaluations can measure reliably. While students can observe and report accurately some aspects of teaching, randomized, controlled studies consistently show that end-of-term student evaluations of teaching effectiveness can be misleading.


[1] See Cashin (1999), Clayson (2009), Davis (2009), Seldin (1999).

[2] Defining and measuring teaching effectiveness are knotty problems in themselves; we discuss this in the second installment of this blog.

[3] There were plans to conduct randomized, controlled experiments to estimate systematic differences during the pilot of online teaching evaluations in 2012-2013; the plans didn’t work out. One of us (PBS) was involved in designing the experiments.

[4] There are many proposals to provide social and administrative incentives to students to encourage them to fill out online evaluations, for instance, allowing them to view their grades sooner if they have filled out the evaluations. The proposals, some of which have been tried elsewhere, have pros and cons.

[5] See, e.g., http://xkcd.com/470/

[6] Consider these scenarios:

(1) The instructor has invested an enormous amount of effort in providing the material in several forms, including online materials, online self-test exercises, and webcast lectures; the course is at 8am. We might expect attendance and response rates to in-class evaluations to be low.

(2) The instructor is not following any text and has not provided any notes or supplementary materials. Attending lecture is the only way to know what is covered. We might expect attendance and response rates to in-class evaluations to be high.

(3) The instructor is exceptionally entertaining, gives “hints” in lecture about what to expect on exams; the course is at 11am. We might expect attendance and response rates to in-class evaluations to be high.

The point: Response rates in themselves say little about teaching effectiveness.

[7] See, e.g., McCullough & Radson, (2011)

[8] See Cranton & Smith, (1986), Feldman (1984, 1978).

[9] See, e.g., McKeachie (1997).

[10] They might also be able to judge clarity of exposition, but clarity may be confounded with the intrinsic difficulty of the material.

[11]  See Cashin, (1990), Cashin & Clegg (1987), Cranton & Smith (1986), Feldman, (1978).

Co-authored with senior consultant Richard Freishtat, Ph.D., and cross-posted from UC Berkeley's Center for Teaching and Learning blog