Evaluation of Teaching

Attempts to Improve the Evaluation of Teaching

For additional discussion see, K. Bain, What the Best College Teachers Do Cambridge, MA: Harvard University Press, 2004


When we evaluate teaching, we ordinarily want to assess both what the teaching hopes to help people learn and whether it is successful with its intent. In short, we want to know what kind of sustained influence the teaching and the course have on the way students think and act. Does the teaching help and encourage students to learn something we regard as important and appropriate? Does it have a sustained, substantial, and positive influence on the way students think and act (we might also want to know whether the instructor has assessed students’ learning accurately and fairly.)

How can an evaluator know any of this? Data to answer these questions might come from a variety of sources, including syllabi and other course material or an instructor’s reflective essay on the intellectual definition of the course. Part of the data might come from student ratings and comments. Committees, chairs, deans, the Provost, and the President must evaluate using the data from student ratings and comments and from other sources. Notice the distinction between sources of information about teaching and the process of evaluation.

I. Student Ratings and the Evaluation of Teaching

What can student ratings tell us that will help us find out what the teaching hopes to accomplish and whether it has been successful? The literature on student ratings is voluminous. One summary in September 1995 found more than 1500 articles and books on the subject.1

The extensive research covered in those works has found that student ratings and comments can provide valid and reliable information that can help an evaluator determine the effectiveness of a teacher. Indeed, the research has discovered that student ratings can correlate well with external measures of student learning and with instructor self-ratings when the latter are collected independently of personnel decisions. It has also found that student ratings are statistically reliable (i. e., they have internal stability and are consistent over time), are more statistically reliable than are colleague ratings, and are not easily or automatically manipulated by grades. In fact, intellectually challenging classes average higher ratings than do easier courses with light workloads.2 Most important, student ratings can, as one observer put it, “report the extent to which the students have been reached [educationally].”3 Yet student ratings have their limitations. We will say more later about those limits, but, first, let us consider how student ratings can help:
Research has found that certain questions produce the most reliable results. The following types of questions have a strong track record:4

Using a six-point scale (1=lowest; 6=highest)

  • Provide an overall rating of the instructor.
  • Give an overall rating of the course.
  • Estimate how much you learned in the course.

Two additional questions find favor with many evaluators because they also solicit information about the student’s perception of the results of instruction–in essence, asking did the course reach you educationally–and are, therefore, highly recommended:

Again, on a six-point scale:

  • Rate the effectiveness of the instructor in stimulating your interest in the subject5
  • Rate the effectiveness of this course in challenging you intellectually.6

A form should also collect some demographic information on students:7

  • My classification is: a) Graduate b) Senior c) Junior d) Sophomore e) Freshman
  • My major is:
  • I took this course to satisfy: a) Major or minor field requirements b) Other specific degree requirements c) Elective credits required for a degree d) Non-degree requirements e) To satisfy general interest
  • Before taking this course my interest in the subject was: a) Very low b) low c) average d) high e) very high

A consideration of these factors can help control for possible sources of bias in student ratings. Research has found, for example, that prior student interest in the subject does influence the outcome of student ratings of effectiveness.8

Finally, a form can usefully solicit open-ended responses if evaluators are willing to read all of the student comments on a given class. To read only a few responses invites distortions in the mind of the evaluator. Because student ratings and student comments are virtually identical in character, evaluators are likely to make fewer mistakes if they use ratings only rather than ratings plus a reading of only a few of those comments or summaries of them.

II. Student Ratings for Formative (Self-Improvement) Purposes

The questions noted above can provide an evaluator with extremely valuable information with which to make a judgment about the quality of teaching. There are some additional questions that might be used to help professors improve their teaching. Such questions ask about student perceptions of particular methods of teaching: Did the instructor communicate well; was the instructor available and willing to provide assistance outside of the classroom; was the course carefully planned and well organized; and so forth? [Research has found that if feedback is collected in the first half of the term, it can help instructors improve the ratings they will receive at the end of the term, and greatly improve the ratings at the end if consultation accompanies feedback]9 The Office for Faculty Advancement currently offers a service called a “Student Small Group Analysis” designed to help instructors collect detailed feedback from their students during the term and receive feedback on the results.

Yet questions about methods should not be used for summative evaluations (personnel decisions) because they ask about processes of achieving good teaching while the questions in section one concentrate on assessing the results. In other words, one might get high marks on “how much students learned” and low marks on, say, being “readily available and willing to provide assistance outside the classroom” or on “the course was carefully planned and well organized.” We might then argue that this person, nevertheless, did excellent teaching (helped and encouraged students to learn), but did so despite ignoring some conventional wisdom on how best to teach. Conversely, one might get high marks on all of the process questions, and still fail miserably as a teacher (not help students learn anything worth learning as defined by the curriculum and the school or in ways that make a sustained and substantial difference). Process questions do not tell us anything that result questions cannot, except perhaps that a person used this-or-that process, but they can be misleading, potentially punishing those who achieve good results with unorthodox methods, or who teach in fields in which some conventional methods are not appropriate.

III. Limitations of Student Ratings

Two objections to the use of student ratings for summative purposes often emerge. One objection argues that teachers can buy higher ratings with higher grades for the students, thus corrupting the evaluation of both students and faculty. Yet considerable research on the subject has found that students do not automatically give higher ratings to classes in which they receive the highest grades.10 Indeed, the highest marks often go to the most challenging courses. Furthermore, researchers using multiple regression analysis and path analysis to study the influence of various factors on the outcome of student ratings have found that expected grades account for only a tiny percentage (2.6) of the result. Other factors account for more:11 prior subject interest, 5.1 percent; workload/difficulty, 3.6 percent (notice that, despite popular conceptions to the contrary, the latter factor is positively related to student ratings; that is, more difficult well-taught classes receive higher marks).

Perhaps, as one observer put it, “what matters is what faculty think, not what is true. . . . If faculty believe, no matter how erroneously, that lowering of standards will produce higher student [ratings]…, then faculty will live out that belief. They will lower standards and have guilty consciences, or they will hold the line on standards and feel victimized or virtuous–all on the basis of what they believe to be the connection between [ratings]… and standards.”12 Thus, some understanding of the research on student ratings is essential. NYU faculty members have a long history of using student ratings and understand that intellectually challenging courses graded with high standards will produce the best results. The university can help new faculty members develop that same appreciation.

A second objection arises from the belief that some students will use the rating process as an opportunity to punish teachers, presumably over some low grade or other factor unrelated to teaching quality. While there may be isolated instances of such behavior, the extensive research on student ratings has found that such cases are largely spurious or so infrequent that they do not corrupt the process. Indeed, research has found that student ratings tend to be higher when the directions say the results will be used for personnel decisions than they do when the form indicates that the results will go only to the instructor. This is certainly not the pattern of students determined to “punish” the instructor. [Would student ratings go up at Montclair State if more students became convinced that the ratings really do matter in personnel decisions?] Furthermore, student ratings do not automatically go down with lower grades or up with higher ones;13 they have both high internal consistency and high rater stability over time.14

Yet it is possible to ask questions that tend to corrupt the process. Perhaps the most unreliable question is one that asks little more than “How much did you enjoy the class?” The early Dr. Fox experiments demonstrated that if surveys ask some form of that query, student responses may or may not tell us much about the quality of teaching (such a question can be phrased in all sorts of ways, of course). While students usually do enjoy courses that are the most intellectually challenging and meaningful, they can also report that they enjoy a particular class that contributes little to their learning.15 Yet–and this is extremely important–when surveys ask those same students to assess their learning in that particular class or to provide an overall rating of the instructor and course or to assess its intellectual contributions, the students, as a group, are able to distinguish “fluff” from substance.

Equally important, what is true of the whole is not necessarily true of every part. Exceptions exist for all of the generalizations noted here. Student ratings can be wrong16 (although they may err more on the side of too much praise rather than too little.) Students are not always well equipped to judge the course as an intellectual product, to determine whether it is appropriate to the curriculum or sufficiently rigorous. We must use other sources of information, along with the results of student ratings, to clarify the picture. Student ratings can provide valuable information, but they cannot always tell evaluators everything needed to make valid, reliable assessments of teaching effectiveness.

Evaluators should use student rating data along with information from other sources to evaluate teaching. The AAHE Peer Review Project has developed some highly effective ways to collect such additional information. That project treats teaching as a serious intellectual enterprise and courses as important intellectual creations, and it emphasizes ways to assess both the nature of the course and its intellectual influence on students. As the Peer Review Project recognizes, instructors can provide valuable information to evaluators, not simply to say how good or bad they have been but to make a case. That case should be an argument (with supporting evidence) (1) that the instructor tried to help students achieve certain intellectual (physical or emotional) goals and (2) that the effort had a certain success (or failure). If good teaching helps and encourages students to learn something worthwhile and in a way that makes more than a passing difference in the way students think and act, what evidence can the instructor offer about the value of the content (the learning objectives) and about the success or failure of efforts to help students achieve those objectives. Can the instructor offer evidence that the effort to help students learn was somehow worthy even if students did not learn?

Evaluators must decide whether the objectives are valuable and the effort to help and encourage students to learn is sufficiently successful (or commendable despite its failures). Student ratings, evidence of students’ work, information about assignments, and so forth can provide evidence to support claims from the instructor and help evaluators make judgments.

To make the best use of the data from student rating forms evaluators need to understand and apply the major findings on what factors will influence ratings (e. g., level of the course, student motivation, etc.), what differences the influences will make, and what factors will not influence ratings significantly (e. g., time of day when the class is taught, the student’s grade point average, etc.). They must consider carefully what the student ratings will and will not tell us about the results of the teaching (about whether it actually helped and encouraged students to learn).

IV. A Summary and Comments on Formative Evaluations

A successful evaluation system should help teachers improve, not just provide evidence for summative judgments. While the five results questions and the four demographics questions discussed in section one should tell evaluators everything they need to know from students about the success of the teaching, such questions may not tell instructors how they can enhance their teaching. Professors may need to ask additional “formative” questions that will help them identify specific strengths and weaknesses. Comments from students can also help instructors improve.

Each instructor should have the opportunity to add one or more formative questions, the results of which would go only to the instructor. A central office could maintain a bank of such formative questions from which each instructor could choose, or an instructor could write his or her own questions. Instructors could ask about specific assignments or particular lectures or discussions. Every form could also contain some standard open-ended questions, the results of which could go only to the instructor:

  • What are the primary teaching strengths of the instructor?
  • What are the primary weaknesses of the instruction? Can you offer suggestions for improvement?
  • Did the course help you learn? Why are why not? If so, what did it help you learn?

Finally, each form could contain at least one question, the results of which for each class could be reported to individual instructors while departmental and/or school averages on this question could be reported to chairs, deans, and the Provost:

  • On average, I spent the following number of hours per week on this course:
    (a) 0-3; (b) 4-7; (c) 8-11; (d) 12-15; (e) 16-19 (f) 20+

The results of such a question might be valuable in, for example, a study of how students spend their time, but should not be used in evaluating teaching because it determines process rather than the learning results.

Copyright(c) 1996 by Kenneth R. Bain
Used by Permission

End Notes

1. William E. Cashin, “Student Ratings of Teaching: The Research Revisited.” IDEA Paper, No. 32, September, 1995 [Center for Faculty Evaluation and Development, Kansas State University, Manhattan, KS]. Return to Text

2. See, for example, Peter A. Cohen. “Student Ratings of Instruction and Student Achievement: A Meta-analysis of Multisection Validity Studies.” Review of Educational Research 51 (Fall, 1981): 281-309; Judith D. Aubrecht. “Are Student Ratings of Teacher Effectiveness Valid?” IDEA Paper, No. 2, November 1979 [Center for Faculty Evaluation and Development, Kansas State University, Manhattan, KS]; Robert T. Blackburn and Mary Jo Clark. “An Assessment of Faculty Performance: Some Correlates Between Administrator, Colleague, Student and Self-ratings.” Sociology of Education 48 (Spring, 1975): 242-256; Larry Braskamp and Darrel Caulley. “Student Ratings and Instructor Self-ratings, and Their Relationship to Student Achievement.” American Educational Research Journal 16 (Summer, 1979): 295-306; Frank Costin, William Greenough, and Robert Menges. “Student Ratings of College Teaching: Reliability, Validity, and Usefulness.” Review of Educational Research 41 (December, 1971): 511-535; Frank Costin, “Do Student Ratings of College Teachers Predict Student Achievement?” Teaching of Psychology 5 (April, 1978): 86-88. Return to Text

3. Kenton Machina. “Evaluating Student Evaluations.” Academe 73 (May-June, 1987): 19-20. Return to Text

4. P. C. Abrami. “How Should We Use Student Ratings to Evaluate Teaching?” Research in Higher Education 30 (1989): 221-227; J. A. Centra. Reflective Faculty Evaluation: Enhancing Teaching and Determining Faculty Effectiveness. San Francisco: Jossey-Bass, 1993; Larry A. Braskamp and John C. Ory. Assessing Faculty Work: Enhancing Individual and Institutional Performance. San Francisco: Jossey-Bass, 1994; William E. Cashin and Richard G. Downey. “Using Global Student Ratings for Summative Evaluation.” Journal of Educational Psychology 84 (1992): 563-572; William Cashin, R. G. Downey and G. R. Sixbury. “Global and Specific Ratings of Teaching Effectiveness and Their Relation to Course Objectives: Reply to Marsh.” Journal of Educational Psychology 86 (1994): 649-657; Kenneth A. Feldman. “Instructional Effectiveness of College Teachers as Judged by Teachers Themselves, Current and Former Students, Colleagues, Administrators and External (Neutral) Observers.” Research In Higher Education 30 (1989): 583-645. Return to Text

5. This question is important if the teaching is supposed to stimulate continued student interest and possible learning in the course, one of the ways the course might have a sustained, substantial and positive influence on the way students think and act. Return to Text

6. The language of “rating” is used throughout to emphasize the notion that students offer ratings, not evaluations. Return to Text

7. See, for example, Herbert W. Marsh. “Students’ Evaluations of University Teaching: Dimensionality, Reliability, Validity, Potential Biases, and Utility.” Journal of Educational Psychology 76 (No. 5, 1984): 707-754; Kenneth A. Feldman. “Course Characteristics and College Students; Ratings of Their Teachers: What We Know and What We Don’t.” Research in Higher Education 9 (No. 3, 1978): 199-242. Return to Text

8. Students who take courses to satisfy general interest or as a major elective tend to give higher ratings; students who take courses to satisfy a major requirement or a general education requirement tend to give lower ratings. See, for example, Herbert W. Marsh and M. Dunkin. “Students evaluations of University Teaching: A Multidimensional Perspective.” in J. C. Smart, editor. Higher Education: Handbook of Theory and Research. Volume 8. New York: Agathon, 1992: 143-233. Return to Text

9. Peter A. Cohen. “Effectiveness of Student-Rating Feedback for Improving College Instruction: A Meta-Analysis of Findings.” Research in Higher Education 13 (1980): 321-341. Return to Text

10. See, for example, George Howard and Scott Maxwell. “Correlation Between Student Satisfaction and Grades: A Case of Mistaken Causation.” Journal of Educational Psychology 72 (December, 1980): 810-820. Return to Text

11. See, for example, H. W. Marsh. “The Influence of Student, Course, and Instructor Characteristics in the Evaluations of University Teaching.” American Educational Research Journal 17 (Summer, 1980): 219-237. Return to Text

12. Kenton Machina. “Evaluating Student Evaluations.” 20. Return to Text

13. The literature on the correlations between grades and student ratings is long and complex. As noted in the text earlier, student ratings tend to be slightly higher if students expect to receive higher grades. But this does not necessarily mean that grade leniency accounts for the differences that have been noticed. Some research suggests that the differences come because students give higher ratings if (1) they are highly motivated and (2) they are learning more and can thus expect to get higher grades. See, for example, Howard and Maxwell. “Correlation Between Student Satisfaction and Grades: A Case of Mistaken Causation.”: 810-820; and George Howard and Scott Maxwell. “Do Grades Contaminate Student Evaluations of Instruction?” Research in Higher Education 16 (1982): 175-188. The best way to determine if a course is leniently graded is probably through a review of course materials and methods and practices of evaluating students. Lenient grading, however, does not necessarily mean less learning. Because of the different standards by which different faculty members assign different letter grades, the only way to determine levels of learning is to look in detail at actual student performances (the papers they write, the types of questions they can answer, the problems they can solve, the performances they give) and the way those performances change over time; mere class grade point averages cannot provide that information. Return to Text

14. See, for example, Larry A. Braskamp, Dale C. Brandenburg, and John C. Ory. Evaluating Teaching Effectiveness: A Practical Guide. Beverly Hills: Sage Publications, 1984; Kenneth A. Feldman. “The Significance of Circumstances for College Students’ Ratings of Their Teachers and Courses.” Research in Higher Education 10 (No. 2, 1979): 149-172. Return to Text.

15. See, for example, Robert M. Kaplan. “Reflections on the Doctor Fox Paradigm.” Journal of Medical Education 49 (March, 1974): 310-312; Donald H. Naftulin and John E. Ware, Jr. “The Dr. Fox Lecture: A Paradigm of Educational Seduction.” Journal of Medical Education 48 (July, 1973): 630-635; H. W. Marsh. “Experimental Manipulations of University Student Motivation and Their Effects on Examination Performance. British Journal of Educational Psychology 54 (June, 1984): 206-213. Return to Text

16. Naftulin and Ware. “The Dr. Fox Lecture.” 630-635; Braskamp, et al. Evaluating Teaching Effectiveness; Feldman.”The Significance of Circumstances for College Students’ Ratings of Their Teachers and Courses.” 149-172. Return to Text

17. Special Note: The research has found a very high positive correlation between student comments and ratings on the kinds of results questions recommended here, suggesting that the student comments will not provide the evaluator with any evidence to make a judgment that is not available from the ratings. See, for example, John C. Ory, Larry Braskamp, and D. M. Pieper. “Congruency of Student Evaluative Information Collected by Three Methods.” Journal of Educational Psychology 72 (1980): 181-185. Return to Text.