Research on Evaluation of Teaching: A Comprehensive Perspective
The Evaluation of Teaching
By Ken Bain
"Improvements in the evaluation and improvement of teaching must rest on carefully considered views of what counts as excellence in university pedagogy . . . Departments, schools or programs should generate their own statements of mission and, hence, of teaching excellence in their field. No single statement will adequately cover all fields, nor should anyone presume to prescribe a universal set of standards."
"Towards Greater Excellence in Teaching at Stanford"
While in the twentieth century academic eminence often came solely from reputations in research and published scholarship,(1) it seems likely, given important changes that are already taking place among the great institutions of higher learning, that such a standing in the twenty-first century will come only if the university excels in both research and teaching and recognizes how these two enterprises can complement each other. Rather than thinking in terms of the traditional dichotomy of research and teaching, a separation that often paralyzed higher education in the twentieth century, we can begin to think of ourselves as a learning community concerned with the learning of both faculty (research) and students (teaching) and the ways in which the learning of one can benefit the other.
Yet a university can achieve greater eminence in neither teaching (the cultivation of student learning) nor research (faculty learning) unless it has a system to evaluate each enterprise. While methods to evaluate the latter have a long standing in the academic community, methods to evaluate teaching do not, in part because many academics have been convinced that there has been no good way to evaluate teaching. Accordingly, teaching excellence has perhaps not been as visibly promoted and rewarded among the faculty as have been the enterprises of research and traditional scholarship. Every university must rectify this situation if it is to achieve high standards. The changing expectations of great universities will allow nothing else.
We propose in what follows a concrete and systematic set of guidelines to identify, recognize, and reward outstanding teaching within a university. Drawing upon various sources, including material developed in the American Association of Higher Education’s Peer Review Project, we sketch out here the practices and insights that must play a role in any comprehensive evaluation program, designing a set of flexible guidelines and suggestions to evaluate teaching for summative purposes. We envision this report serving as a template for the evaluation process, from which administrators in individual schools and departments can formulate procedures that are specific to their needs and expectations. We have included in what follows some of the research from the theoretical literature on evaluating teaching, but we have striven primarily to offer a very practical and useful set of guidelines and suggestions.
Before laying out that plan, one final and extremely important introductory note: the first step in implementing a serious and meaningful program of teaching evaluation is to conceptualize teaching as a form of scholarship, something most university faculty members already do. Scholars have long believed, as Physicist Robert Oppenheimer said in 1954, “It is proper to the role of the scientist that he not merely find the truth. . .but that he teach, that he try to bring the most honest and most intelligible account of new knowledge to all who will try to learn.” But it was Ernest Boyer’s 1990 report Scholarship Reconsidered that took these ancient commitments of the scholar to teaching and carried the idea an additional step. He argued that teaching is not merely a logical outcome of scholarship but it is most properly thought of as a form of scholarship, along with the scholarships of discovery [what we normally call research], integration, and application. Teaching as scholarship implies that we recognize that the creation and manifestation of a course are challenging, creative, and consequential intellectual tasks and that every course we craft is a lens into our field and our personal conception of our disciplines or inter-disciplines. As Russell Edgerton, Pat Hutchings, and Kathleen Quinlan wrote in their discussion of the scholarship of teaching, “At bottom, the concept entails a view that teaching, like other scholarly activities . . . relies on a base of expertise, a ‘scholarly knowing’ that needs to and can be identified, made public, and evaluated; a scholarship that faculty themselves must be responsible for monitoring.” Lee Shulman, Boyer’s successor as President of the Carnegie Foundation for the Advancement of Teaching, argued that teaching is the highest form of scholarship because it, unlike any of the others, necessarily entails all of the others. “Indeed,” Boyer wrote, “as Aristotle said, Teaching is the highest form of understanding.’”
If we recognize the creation and manifestation of a course as a piece of intellectual work as challenging, creative, and consequential an intellectual task as a piece of published scholarship, then we will begin to recognize teaching as a practice that demands both serious evaluation and meaningful rewards. If we will recognize, as Boyer wrote, that “knowledge is acquired through research, through synthesis, through practice, and through teaching,” we can recognize that teaching is properly thought of as a form of scholarship that deserves meaningful recognition, evaluation, and reward. To think in these terms means that we must expect that the scholarly qualities of our teaching must meet the highest standards and that in our search for a method to evaluate teaching that we must necessarily find ways to identify and assess the intellectual or artistic qualities of our teaching.
What Kind of Evaluation and for What Purpose?
Evaluation of teaching might be done in a variety of circumstances and for several different reasons. We might evaluate to help someone improve their teaching (what the literature often calls "formative" evaluations), or we might do so to help make judgements about hiring, retention, merit pay, awards, or promotions (what the literature often calls "summative" evaluations ). Faculty members and departments should already be fostering formative evaluations. Many schools have already implemented regular mechanisms and means through which faculty members can obtain feedback on, and work towards improving, their own teaching.
Our concern in this report, however, is primarily with the evaluation of teaching for summative purposes.(2) Summative evaluations of teaching can play a substantial role in decisions made 1) to hire someone to the faculty; 2) to retain, promote or tenure a faculty members, or 3) to offer merit pay raises to faculty members.
There are two fundamental and simple questions that we should ask before we begin considering how to evaluate teaching:
1) What questions do we want to answer?
In other words, what are we evaluating? Are we interested in the teacher's ability to motivate students? To have a sustained and substantial influence on a student's intellectual development? To foster a sense of community in the classroom? To prepare students for external exams of some sort? All of these elements? The evaluation process must begin with an answer (or answers) to this question, which defines the terms and objectives of that process.
Our response to this question, based both on practical experience in the classroom and extensive research in the learning sciences, runs like this: "Does the teacher help and encourage students to learn something worth learning in a way that makes a sustained, substantial and positive difference in the way they think, act, or feel, without doing the students any harm?" This answer breaks down into four essential components:
A. Is the teacher's material worth learning? Has the teacher identified questions worth answering? reasoning skills worth obtaining? abilities worth developing? Do the teacher's course objectives and course materials offer a window, however wide or narrow, into the fundamental problems and questions of that teacher's discipline? Will the learning objectives, if achieved by students, prepare them adequately for additional study?
It is here, more than with the questions to follow, that we can begin to define and assess the scholarship of teaching, to apply scholarly standards to the evaluation of teaching. Are the information, ideas and practices that students are expected to learn both appropriate to the level of the course and important within the discipline, either as gateways to ongoing learning of the field or as important parts of the field on their own? Does what is taught show that the professor has made the choices of learning objectives with an adequate understanding of the existing scholarship in the field? Does the content and the professor's teaching of that content reflect an adequate understanding of the existing scholarship in the field?
B. Are the students learning what the course is supposedly teaching? Do exams, projects, and classroom performance reflect students' mastery of the course material? Have the students gained or honed the intellectual, physical, or emotional skills the teacher's course objectives promised to them? Has the course had a sustained and substantial influence on how they think, act, or feel?
C. Are the teacher's strategies effective in helping and encouraging the students to learn? Has the teacher successfully motivated their interest in the course? Do his or her teaching strategies and techniques actually facilitate student learning? Or are students learning in spite of the teacher?
D. Has the teacher avoided doing harm to students? Or has the teacher done harm to the students by simply fostering short-term learning with threats and intimidation tactics in the classroom? Has the instructor inappropriately discouraged rather than stimulated additional interest in the field? Has he or she conducted the class in an ethically appropriate manner? Has the teacher handled student diversity with sensitivity, respected students' right to dissent and disagree, and allotted to the course sufficient time and energy to teach the material effectively?(3)
Perhaps most important, has the instructor evaluated student learning fairly and accurately? Has that evaluation been based on the students' ability to achieve the stated learning objectives? Or has it been based on some other standard, in essence redefining the objectives (perhaps along lines that might be deemed inappropriate or inadequate)?
These questions are probably essential ones for evaluation in any imaginable discipline and teaching situation, but individual administrators certainly might have additional questions that they would like to see answered as part of the evaluation process.(4)
There are certain questions, however, we would insist that it is not important to answer. Any questions that try to evaluate aspects of the teacher's delivery skills would be inappropriate for an evaluation process that is interested in assessing a teacher's ability to help and encourage student learning. Teachers need not be entertainers or performers to be effective; indeed, many outstanding teachers foster student learning most effectively by remaining off-stage as much as possible. While performance can certainly affect student learning, it is not the performance per se that should be under review but the success or failure of that performance in helping and encouraging students to learn. In short, the evaluation should not judge how the instructor helps and encourages students to learn something worth learning (as long as no harm is done) but whether such help and encouragement exist.
This leads to a second set of issues that the evaluation process should not address: the worth and quality of various teaching methods and strategies. The evaluation process will lose credibility if it implicitly or explicitly endorses any one teaching method as the most effective tool for enhancing student learning. The research on teaching and learning continually insists that no teaching method trumps all others. As Braskamp and Ory have written, "no single instructional strategy is always superior to any other . . . . Faculty who lecture are not necessarily better teachers than faculty members who use discussion techniques" (Ory, 18). Hence faculty members should not necessarily be rewarded or punished for using either traditional or experimental instructional techniques. The evaluation should focus upon the ability of the teacher to help and encourage students to learn, whatever the methods will be.(5)
We can, however, imagine a situation in which a department or school or a university might want to reward someone for conducting a carefully studied experiment that might inform and benefit colleagues even if the results did not produce better learning. Indeed, the evaluation system we outline here can and should identify and reward efforts to test methods of university teaching, enterprises that contribute to a better understanding of what and how we should teach.
2) What will count as evidence in answering those questions?
What materials will give you satisfactory answers to your evaluative questions? While answers to this question will vary, depending upon your initial evaluative questions, what we can say with absolute certainty is that no one source of data--either self-evaluations, or student ratings and comments, or the observations of a peer--will provide you with sufficient evidence for a summative judgment. Any evaluative process must rely on multiple sources of data, which are then compiled and interpreted by an evaluator or evaluative committee. Student remarks and ratings on rating forms, in other words, are not evaluations; they are one set of data that an evaluation process should take into consideration. The same can be said for self-evaluations, and the results of peer or administrative observations. In short, different sources of data, or evidence, will be appropriate for different evaluative questions. In the next section, we consider the kinds of evidence that might be appropriate for the four parts of the evaluative question we posed above.
Once you have determined both the questions you wish to answer and the evidence which will provide you with answers to these questions, a final step remains: selecting the evaluators, and considering how you will train them to make informed and reflective decisions. It seems evident that more than one individual should evaluate individual teachers, and that evaluation committees should be composed of both administrators and peers. Precisely how those evaluators should be selected is a complex question that requires ongoing discussion.
But it also seems evident that evaluators should have some means of communicating with each other, and achieving a loose consensus upon, the standards of evaluation. That process may take the form of a brief training program, sponsored by experts in the evaluation of teaching, or it may simply take the form of group discussions prior to the evaluative process. Each member of the evaluative committee could spell out the standards he or she intends to apply, and then the group could work together to compile standards acceptable to the entire committee. Evaluation will be most successful, effective, and valid if some minimum degree of consensus on standards is achieved prior to the evaluation process.
Finally, as Larry A. Braskamp, Dale C. Brandenburg, and John C. Ory have argued in Evaluating Teaching Effectiveness: A Practical Guide,"we should regard [evaluation] as a form of argument" (8). Evaluation, like research, is not simply about numbers. It involves presenting a case: advancing an argument about the effectiveness of a teacher, presenting evidence to support that argument, and submitting that argument to the judgment of one's peers. We have no numerical scale to determine the ultimate worth of a particular piece of research; the same applies to teaching. But we do evaluate research, relying upon our own self-scrutiny, as well as the careful and experienced judgment of peers, publishers, and the reactions of our reading audiences. Teaching is no different: we can evaluate teaching, and we can do so by relying upon our own self-examinations, as well as the careful and experienced judgment of our peers, administrators, and our students.
How can we use these ideas to carry out a specific evaluation process? We expect an evaluation to answer the question, does the teacher help and encourage students to learn something worth learning without doing the students any harm? What will count as evidence to answer that question? How will that evidence be collected and presented? Will we require different kinds and levels of evidence depending on the purpose of the evaluation?
We suggested above that evaluation involves an argument. What we describe here about the nature of the argument may remind some of the much discussed "teaching portfolio," but we deliberately avoid using the term "portfolio" because too often it has been thought of as a kind of container into which a faculty member simply pours all of the products and descriptions of his or her teaching.(6) We have in mind a case, complete with supporting evidence, that the faculty member would make about her or his efforts to help and encourage students to learn something worth learning (without doing them harm).(7) That case would, in fact, be a series of arguments, with supporting evidence, that answer each of the questions that the department and school have decided are important. For example, such a case might provide an answer to the following questions: What have you tried to help and encourage students to learn? Why are those learning objectives worth achieving for the course you are teaching? What strategies did you use? Were those strategies effective in helping students to learn? Why or why not? What did your students learn as a result of your teaching? [If they are not learning what you want them to learn, why not?] Did you stimulate their interest in the subject?
Those arguments would require careful and rigorous thought on the part of the teacher. Rather than simply gathering material--student ratings, syllabus, etc.--and sending it to the evaluator, the faculty member under review would offer synthetic and carefully organized arguments. Thus, the burden of establishing connections with the evidence and offering coherence throughout would fall on the teacher under review rather than on the evaluator. If the teacher merely submits a container of documents that lack coherence, then the argument for teaching effectiveness has simply not be made.
This approach has the positive benefit of allowing teachers to assume control over what aspects of their teaching are subject to evaluation. Properly conceived and overseen, the case (or argument) on teaching quality can also help ensure that all faculty members are subjected to the same high standards. If we require each faculty member to submit some evidence in each of the four categories we discuss here--though the precise nature of that evidence will be teacher's decision--then we will ensure both flexibility and a certain degree of continuity in the evaluation process.
Because faculty members must build a case and evaluators must decide whether the case has demonstrated the existence of good teaching, we turn again to the question of what constitutes evidence for each major type of question we have proposed.
A. Is the material worth learning and appropriate to the course and curriculum?
There are really two types of evidence here: evidence that comes from the teacher and evidence that comes from an outside review:
1. Evidence from the teacher: The most important piece of evidence is a statement from the teacher of what he or she has helped students to do intellectually, physically, or emotionally. That statement, in turn, might be supported with references to a lesson plan, lecture, PBL session, clerkship, syllabus, specific assignments or patterns of assignments, or assessments that reflect the learning objectives.Perhaps the most important evidence the teacher can provide is not simply a statement that certain objectives have been pursued but evidence from examinations, assignments, problem sets, and so forth that demonstrate that students were ultimately evaluated for their ability to meet certain objectives rather than on the basis of other considerations, considerations that spell out the true nature of learning objectives.
2. Evidence from an outside review: The individuals most qualified to judge the quality and adequacy of the learning objectives would be the instructor's peers: other members of his or her discipline, preferably even within that instructor's area of specialty. Just as each discipline has its own protocols and criteria for evaluating research, so will each have its own standards for evaluating learning objectives. The faculty member would work with the department to solicit from another expert in the field an initial judgement of the learning objectives that the faculty member has defined. This is especially important for the case on a single course, but can also be done for the general case on teaching quality. The outside reviewer would look at the case offered by the faculty member, including the evidence reflected in the way students are evaluated, to offer a review. That review of objectives then becomes part of the evidence that other reviewers will see. Thus, the first level of evaluation actually develops additional evidence that other evaluators will consider. Once that evidence has been created, the departments would work with the faculty member under review to select both internal and external reviewers who will receive the completed cases and make evaluations of the teaching. The evaluations from each level of review should become part of the growing portfolio that other, subsequent, reviewers will see.
B. Are the students learning the material?
As noted earlier, the best kind of evidence of student learning comes from examples of student work. Faculty members must consider carefully what makes them think they have helped students learn. What evidence best illustrates the level of student learning? If students have not progressed in their ability to do whatever they are trying to do intellectually, physically, or emotionally, what does their lack of progress suggest, if anything, about the quality of teaching. We recognize that special circumstances beyond the control of the instructor may keep students from learning. The instructor must simply make the explanations about the levels of student learning, connect it to the evidence, and provide examples of that evidence. One important factor we must keep in mind: this question calls for evidence not just of superior student achievements but of improvements in student performances; that is, evidence that the instructor has made a difference.
There is, however, an additional category of evidence available to every instructor. There are questions that one can ask students on a rating form that can provide important evidence about the level of student learning. Considerable research suggests that students are actually quite accurate judges of their own learning, if we ask them the right questions.(8)
Given the good questions, students obviously can provide prima facie evidence about the degree of their own intellectual stimulation or the decline or growth in their interest in a subject. Thus, the student responses to the question "Rate the effectiveness of the teacher in challenging you intellectually," and the question "Rate the effectiveness of the instructor in stimulating your interest in the subject" provide important evidence about student "learning" in the broadest sense. The evidence from the question "Estimate how much you learned" is a little trickier. We would not regard student self-reports of their learning as sufficient evidence to evaluate their learning because of the possibility of self-serving responses. But research has found that if, on a form to rate teaching, we ask students to estimate how much they have learned, their responses usually have a high positive correlation with independent measures of their learning. Thus, while examples of student work remain the best evidence of their learning, the response to the "Estimate learning" question can usually provide a strong indication of the level of overall student learning in a class and, thus, they can help answer an important question about teaching.
The class "average" responses to each of these questions can reflect the levels of student "achievements," but we want to offer a word of caution that is so important that we will repeat it in other sections. Averages can emerge from a variety of distributions of ratings. They may come from all of the numbers clustered fairly close to the mean. They may come from a combination of both high and low ratings. Each distribution might suggest something quite different about the success of the teaching. In the former case, the instructor might be only marginally successful in reaching everyone while in the latter, the instructor may be highly successful in helping many students but fail completely with others. If evaluators look only at averages, they may fail, for example, to recognize the qualities of a course that is highly successful in helping some students achieve spectacular results yet suffers the wrath of a disgruntled few. Rather than asking only about the averages, both faculty members and evaluators should look at individual ratings and ask, has the teaching been highly successful in reaching anyone? What percentage of the class? How many students, if any, reported that they learned a great deal? Or were challenged intellectually? How many did not? Why not?
Indeed, throughout the evaluation process, professors and their evaluators should focus on the qualities of learning objectives and the efforts to help students achieve them rather than on numbers. What does the teaching contribute to student learning? Does the instructor expect ambitious and creative learning objectives that make important contributions to the thinking about student learning within the discipline? What are the nature and qualities of the learning objectives? Do those objectives reflect the highest scientific and scholarly standards? Is there any reason to believe that the instructor helps any of the students to achieve that highest quality of work? What quality of work do most students achieve?
C. Is the teacher effective in helping and encouraging students to learn?
Again, important evidence will come from student responses to the right student rating questions. When students respond to questions like "Provide an overall rating of the instruction" they are indicating how well the instruction reached them educationally, how well the methods of instruction and the design of the course helped and encouraged them to learn. As already noted, when they respond to the questions noted above they are indicating how well the instructor and course stimulated their interest in the subject and challenged them intellectually.
The instructor can also provide important evidence with a thoughtful self-analysis that explains what he or she has done to promote the specific intellectual, physical, or emotional abilities that are the goals of the course and why there is reason to believe that those efforts have been successful or unsuccessful.
We are reluctant to recommend peer observations as evidence that the instructor has effectively helped students learn. Use of such peer observations has found that observers tend to give high marks to colleagues who provide the same kind of help the observer would offer and lower marks to colleagues who do it differently. Furthermore, observing only one or two classes of a course can provide a very distorted picture of what goes on in that classroom on a daily basis or leave observers with false impressions about the way the instructor understands and explains key concepts. An instructor might, for example, help students learn complex ideas by exposing them first to simple explanations then gradually, over several sessions, unfolding for the students the complexity of the concepts. An observer watching only the first iteration of the idea might believe that the teacher is leaving students with overly simplified notions that distort basic principles when, in fact, the instructor may have employed a strategy that helped students learn the complexities quite effectively. Furthermore, an instructor could, of course, have a bad day only when colleagues show up to observe. Most important, we are interested not in the specific methods the teacher uses but whether he or she helps and encourages students to learn on an appropriate level.
A teacher might, however, present as part of a larger body of evidence about a particular course (see above), a videotape of one or two sessions of class, along with both a written analysis from the teacher and a review from a colleague. That way the instructor can pick those sessions that best represent his or her efforts to help and encourage students to learn and that best capture what he or she is trying to teach. Other observers (students) are in the class on a regular basis to provide a broader report on how well the class is going (see discussion of the use of student ratings).
D. Is the teacher fostering learning without harming students?
Evidence about possible harms might come from a variety of sources. If the course or instructor discourages interest in the subject, the student will indicate such in the rating of effectiveness in stimulating interest. If the instructor evaluates students unfairly by basing the evaluation on abilities different than the stated objectives, that should be apparent from an examination of evaluation procedures (see above) and, no doubt, reflected in the ratings the students will give the instructor. We are not suggesting instructors must provide evidence of lack of harm, but that evaluators should be sensitive to evidence of harm that does emerge--always insisting on substantial evidence. While we believe that it is necessary to address this question, we suspect that rarely will it be an issue that needs extensive exposition.
We are suggesting that the faculty member think about teaching (in a single session or an entire course) as a serious intellectual act, a kind of scholarship; and that he or she develop a case, complete with evidence, exploring the intellectual meaning and qualities of that scholarship. Each faculty member under review for tenure or promotion could present two cases about their teaching: one case on a single course and another for teaching in general.(9)
Each case would consist of (1) a narrative (usually 3-5 typed pages)--a statement of teaching philosophy--that would
(a) define what students should be expected to do intellectually, physically, or emotionally, and why they should develop those abilities;
(b) explain and assess the efforts that have been made to help students learn, with references to specific aspects of the course (assignments, activities, issues, etc.) that have been designed to foster and/or assess the learning that is supposed to take place; and
(c) explore the learning that has taken place (what kind of sustained influence is the teaching likely to have on the way students think, act, or feel), with references to what the students' work, their responses to the student ratings, or other evidence reflect about the influence of the course on them.
It would also consist of (2) the evidence referenced in the narrative (student ratings, syllabus, examples of students work, assignment sheets, videotapes of class sessions with commentary, etc.).(10)
If we envision the presentation of materials about teaching quality as an argument, we can conceive of the evaluation of teaching as the evaluation of an argument, and the case becomes the pedagogical equivalent of the scholarly paper--to capture the scholarship of teaching. Hence--as with the traditional scholarly paper--while the general protocols for conducting the argument should be the subject of university consensus, the final form and content of the argument should remain the choice of the individual teacher. This conception of the case allows individual freedom in determining the data of evaluation, but still requires careful and rigorous thought on the part of the teacher being evaluated. He or she must make the argument, complete with evidence, inferences, and conclusions. If, as we noted above, the case lacks coherence, or is submitted merely as a container of documents on teaching, then the argument for teaching effectiveness has simply not been made.
We have outlined here a procedure that should work well for most faculty members, but departments, schools, and the university must decide who will review these cases. As we have already noted, even in the process of compiling the case and the evidence, the faculty member would work with the department to solicit from another expert in the field an initial judgement of the learning objectives that the faculty member has defined. This is especially important for the case on a single course, but can also be done for the general case on teaching quality. That review of objectives then becomes part of the evidence that other reviewers will see. Once that evidence has been created, the departments should work with the faculty member under review to select both internal and external reviewers who will receive the completed cases and make evaluations of the teaching. The evaluations from each level of review should become part of the growing portfolio that other, subsequent, reviewers will see.
There are some central points that we should repeat for the sake of emphasis and clarity:
1. The information (ratings and comments) collected from student rating form provide evidence that evaluators can use to make judgements about the quality of teaching if students are asked the five core questions. They are not, by themselves, evaluations.
2. Evaluators should look at the distribution of responses rather than the averages and should keep clearly in mind what each response might suggest about the success of teaching for the student who offered the response. [As we noted earlier, averages can emerge from a variety of distributions of ratings. They may come from all of the numbers clustered fairly close to the mean. They may come from a combination of both high and low ratings. Each distribution might suggest something quite different about the success of the teaching. In the former case, the instructor might be only marginally successful in reaching everyone while in the latter, the instructor may be highly successful in helping most students but fail completely with others.] Rather than asking only about the averages, both faculty members and evaluators should look at individual ratings and ask, has the teaching been highly successful in reaching anyone? What percentage of the class? How many students, if any, reported that they learned a great deal? Or were challenged intellectually? How many did not? Why not?
3. Each of the five core questions provides different kinds of information to the evaluator and must be read with care.
4. Some external factors beyond the control of the instructor can influence the way students respond to certain questions. These factors should be taken into consideration when using the information to make evaluations. Students who take courses to satisfy general interest or as a major elective tend to give slightly higher ratings; students who take courses to satisfy a major requirement or a general education requirement tend to give slightly lower ratings. Prior student interest in the subject can account for as much as 5.1 percent of a rating. Thus, senior courses filled with students who report high "interest before taking [the] class" and/or students who are not required to take the class should expect slightly higher ratings than introductory level classes filled with students with low prior interest and/or students who are required to take the class. Demographic questions provide information on prior interest and so forth.(11) Thus, evaluators must make comparisons between faculty members who teach classes with similar demographics on these issues.
5 The literature on the correlations between grades and student ratings is long and complex. Student ratings tend to be slightly higher if students expect to receive higher grades. But this does not necessarily mean that grade leniency accounts for the differences that have been noticed. Research has found that students, in general, tend to give higher ratings to courses they regard as intellectually challenging and helpful in meeting those challenges and lower ratings to courses that are easy and in which they do not learn much. Furthermore, students give higher ratings if (1) they are highly motivated and (2) they are learning more and can thus expect to get higher grades.(12)
6. The best way to determine if a course is leniently graded is through a review of course materials and methods and practices of evaluating students. Lenient grading, however, does not necessarily mean less learning. Because of the different standards by which different faculty members assign different letter grades, the only way to determine levels of learning is to look in detail at actual student performances (the papers they write, the types of questions they can answer, the problems they can solve, the performances they give) and the way those performances change over time; mere class grade point averages cannot provide that information.
7. As with all evaluations, the evaluators should keep in mind the questions they are attempting to answer. We believe that student ratings on the Core Questions can provide evidence to help answer the key question, is the teacher effective in helping and encouraging students to learn?(13)
Implementing the Program
In devising these recommendations, we have considered also how they might be implemented. To implement this program, departments, schools, and universities must first identify evaluators, provide them with training, and begin the discussion about the standards of teaching quality that will be expected of the faculty. Many disciplines have a long history of conversations about what courses should attempt to help students to be able to do intellectually, physically, or emotionally; others have not, but all department will have to engage in that conversation as this evaluation process emerges. It is a conversation that will arise every time an evaluator makes a decision about the learning objectives that a course or instructor tries to help students to meet. In many disciplines the expectations are well established and fairly exact; in others, they are more general. Some disciplines resist any attempt to spell out a list of what students should be taught, and rightly so, but all disciplines have standards of scholarship--intellectual standards--that they can apply to this conversation, in much the same way that they have always applied those scholarly standards to questions about the quality of research and published scholarship that faculty members have produced.
Developed by Kenneth R. Bain and James Lang
1. Research reflects what Ernest Boyer called the scholarship of discovery, which in Boyer's terms and in the terms we use later in this report represents only one aspect of a broad definition of scholarship. See Ernest L. Boyer, Scholarship Reconsidered: Priorities of the Professorate. The Carnegie Foundation for the Advancement of Teaching, Princeton: 1990.
2. As Pat Hutchings, director of the AAHE Teaching Initiative, has argued, evaluation done for formative purposes may eventually feed into evaluation for summative decisions. Material that faculty members gather for their own self-improvement could easily become part of the portfolios which they submit to chairs and deans for summative evaluations. An obvious caution here is that material gathered for formative purposes should not be required for summative evaluations; requiring--or even the threat of requiring--faculty members to submit information gathered for self-improvement for summative decisions would potentially stifle their willingness to pursue self-improvement. See Hutchings, "Peer Evaluation and the Review of Teaching" (14-5).
3. While we assume that it is harmful to discourage or "kill" students' interest in a subject, we must be careful to distinguish such intellectually damaging outcomes from careful and sensitive efforts to guide students into certain fields and careers and away from others because of their abilities, and to distinguish between the destruction of curiosity and interest and the necessary and proper use of high standards that may keep some students from pursuing study in a particular area because they have not demonstrated sufficient ability or preparation.
4. It should be noted here that it can happen that students don't learn for reasons which are beyond the teachers control. Students may have some exceptionally strong bias against the subject matter, or may be experiencing personal problems, or some external political or social issue on or off campus may disrupt the learning process, etc. Isolated instances of this should not harm a teacher's evaluation, but if it happens consistently with one instructor it is likely indicative of some problem with that instructor's methods.
5. Here we encounter one of the important distinctions that might be made between questions that may be valuable for a formative process and ones that work best for summative evaluations. If an instructor is trying to identify some changes in teaching behavior that could improve efforts to help students learn, it may be useful to ask about specific performances. But a failure to do well with a particular types of teaching performance does not necessarily indicate that the teacher has not helped and encouraged students to learn something worth learning and appropriate to the course and discipline. Therefore, the results of questions about means should not be used to judge the quality of teaching.
6. See James Lang and Ken Bain, "The Teaching Portfolio," The Teaching Professor. (December 1997): 1
7. It is the conception of the teaching portfolio as a "container" that has created documents with no consistency across faculty members or disciplines, leaving administrators and promotion and tenure committees understandably frustrated and reluctant to rely heavily on them to evaluate teaching. At Northwestern University's 1993 Focus on Teaching Conference, then-Provost David Cohen explained that the material he receives from faculty members on their teaching is not very well-defined or consistent across schools or departments. As a result, "while the dossiers are enormously improved with respect to the level of attention to teaching, the documentation is not working very well" (Proceedings, 7). He implicitly calls for the kind of guidelines we are setting out here.
8. Research has also found that we must be extremely careful what we ask students because they will answer truthfully. Thus, if we ask them a form of the question "did you enjoy the course," they may provide accurate responses, but their answers may say little about how much they learned.
9. We realize that such a model will not work for all faculty members because some people may teach only one course or have such limited experience outside one course that it would not make sense to divide matters in the fashion we have suggested. In other instances, a faculty member's entire teaching may consist of participation in a series of courses rather than responsibility for any one course. Departments and schools must adjust the expectations accordingly.
10. Encouraging the use of the statement of teaching philosophy will encourage faculty members to become what Donald A. Schon has called "reflective practitioners." A reflective practitioner continually examines and reexamines his or her practices, modifying or adjusting them in the light of new information and experiences. The reflective practitioner represents practical artistry at its best, and can handle skillfully what Schon describes as those "indeterminate zones of practice" (36) which we so often encounter in our classrooms: difficult and unexpected situations which can become, in the hands of a thoughtful and experienced teacher, moments of intensive teaching and learning. Peter Seldin, in the AAHE Bulletin, also recommends highly such uses of faculty self-assessment: "The trend toward wider and more structured information gathering is reflected in the growing popularity of self-assessment. Many academics--administrators and faculty alike--are convinced that self-evaluation provides useful insights into course and instructional objectives as well as classroom competency" (12).
11. See, for example, Herbert W. Marsh and M. Dunkin. "Students evaluations of University Teaching: A Multidimensional Perspective." in J. C. Smart, editor. Higher Education: Handbook of Theory and Research. Volume 8. New York: Agathon, 1992: 143-233; and H. W. Marsh. "The Influence of Student, Course, and Instructor Characteristics in the Evaluations of University Teaching." American Educational Research Journal 17 (Summer, 1980): 219-237.
12.See, for example, Howard and Maxwell. "Correlation Between Student Satisfaction and Grades: A Case of Mistaken Causation.": 810-820; and George Howard and Scott Maxwell. "Do Grades Contaminate Student Evaluations of Instruction?" Research in Higher Education 16 (1982): 175-188
13.The Carnegie Foundation for the Advancement of Teaching offers the following additional guidelines first made by researcher John Centra: "examining several sets of evaluation results for each professor for patterns or trends; making sure that a sufficient number of students evaluate each course; considering course characteristics and comparative data when interpreting results; relying primarily on global or summary items (rather than questions about specific aspects of the scholar's teaching) for purposes of personnel decision [the core questions are such global questions]; and not overestimating the importance of small differences in scores. John A. Centra, Reflective Faculty Evaluation: Enhancing Teaching and Determining Faculty Effectiveness. Jossey-Bass, San Francisco: 1993: 89-90. Cited in Charles E. Glassick, Mary Taylor Huber, and Gene I. Maeroff, Scholarship Assessed; Evaluation of the Professoriate. An Ernest L. Boyer Project of the Carnegie Foundation for the Advancement of Teaching. Jossey-Bass, San Francisco: 1997: 47