September 01, 2005
The problem with grades and other summary evaluations
In previous postings (see here and here), I discussed why college rankings vary so much depending on who does the survey. One of the reasons is that different criteria are used to arrive at the rankings, making it difficult to arrive at apples-to-apples comparisons. In this posting, I will discuss why I think that rankings may actually be harmful, even if the measures used to arrive at them are good.
The main problem with rankings is that it requires a single summary score obtained by combining scores from a variety of individual measures, and it seems as if people focus exclusively on that final score and not pay too much attention to the scores on individual measures that went into the summary.
This is a general problem. For example, in course evaluations by students of their teachers, there are usually many questions that ask students to evaluate their teachers on important and specific issues, for example, whether the teacher encourages discussions, is respectful to students, etc.
But there is usually also a question that asks students to give an overall evaluation of the teacher and when such questions exist, those people who usually read the results of the surveys (students, teachers, and department chairs) tend to focus almost exclusively on this summary score and not pay much attention to the other questions. But it is the other questions that provide useful feedback on what kinds of actions need to be taken to improve. For example, a poor score on "encouraging students to discuss" tells a teacher where to look to make improvements. But an overall evaluation of "good" or "poor" for teaching does not tell the teacher anything useful on which to base specific actions.
Teachers face the same problems with course grades. To arrive at a grade for a student, a teacher will make judgments about writing, participation, content knowledge, etc. using a variety of measures. Each of those measures gives useful feedback to the students on their strengths and weaknesses. But as soon as you combine them into a single course grade using a weighted average, then people tend to look only at the grade, even though that really does not tell you anything useful about what a student's capabilities are. But teachers are required to give grades so we cannot avoid this.
I often hear faculty complain that they give extensive and detailed feedback on students' written work, only to see students take a quick look at the grade for the paper and then put it away in the their folders. Faculty wonder if students ever read the comments. I too give students a lot of feedback on their writing and have been considering the following idea to try to deal with this issue. Instead of writing the final grade for the paper on the paper itself, I am toying with the idea of omitting that last step and ask the students to estimate the grade that I gave the paper based on their reading of my comments. I am hoping that this will make them examine their own writing more carefully in the light of the feedback they get from others. Then when they have shared with me what grade they think they got and why, I'll tell them their grade. I am willing to even change it if they make a good case for a change.
I am a little worried that this process seems a little artificial somehow, but perhaps because that is because it is not common practice yet and anything new always feels a little strange. I am going to try it this semester.
Back to college ratings, those can be harmful for another reason and that is that the goals of a school might not mesh with the way that scores are weighted. For example, the US News & World Report rankings take into account incoming students scores on things like the SAT and ACT. But a school that feels that such scores do not measure anything meaningful in terms of student qualities (and a good case can be made for this view) might wish to look at other things it values, like creativity, ingenuity, citizenship, writing, problem solving, etc. Such a school is doomed to sink in the USN&WR rankings, even though it might be able to provide a great college experience for its students.
I am a great believer that getting useful feedback, in whatever area of activity, is an excellent springboard for improving one's performance and capabilities. In order to do so, one needs criteria, and targeted and valid measures of achievement. But all that useful information can be completely undermined when one takes that last step and combines these various measures in order to get a single score for ranking or overall summary purposes.
I am a theoretical physicist and currently Director of 

Comments
"I am a great believer that getting useful feedback, in whatever area of activity, is an excellent springboard for improving one's performance and capabilities."
What a great reason to blog! Over the years I have said many stupid things and angered many people, but I have learned a lot from it. Real feedback from real people, real thoughts accessible to anyone in the world. Is there any better way to be criticized and learn?
I wish everyone blogged.
I often hear faculty complain that they give extensive and detailed feedback on students' written work, only to see students take a quick look at the grade for the paper and then put it away in the their folders. Faculty wonder if students ever read the comments.
As a student who has scanned for her grade and then put the paper away many times, allow me to make some slight defense for students. I am a sensitive writer. I tend to take negative comments rather harshly--this is true of all criticism for me, really, even when it's constructive. If a paper has a good grade, I'm likely to take a few moments in class to scan the comments, but I'll read them in depth later. In the case of a bad grade, though, or one that's simply lower than I expected, I often have to put the paper aside temporarily. I need time to get over the shock, if you will, and I prefer reading my professor's comments at a time when a) I am not as emotionally vulnerable and b) no one can see that vulnerability.
I can't speak for all students, obviously, but I imagine there are many who feel similarly.
I'll be interested to hear how things go with this new system you're trying. It seems to me that another way of achieving similar results would be to require turning essays in twice, with the second version accompanied by a half page or so responding to the professor's comments on the first version.
Nicole and Aaron,
Both of you have touched on an really important point that I need to mull over. I think ALL of us are sensitive to the response to our work. I think this is especially so with writing because whether we like it or not, writing reveals something about ourselves. So a "grade" is in some ways always taken as a judgment about us personally.
But clearly feedback is important in order for us to improve. So how can we separate feedback from judgment? This is a crucial question to which I do not know the answer.
Like Nicole and Aaron, I paid great attention to the comments. No matter what the grade received, I had to know "WHY?" But I too would wait until I was out of the classroom to review them. I wanted neither my classmates nor my professor to see my first reaction.
I also had a number of professors who expected rewrites as a matter of course. Given that the philosophy major required a lot of writing I think they felt it was most useful to comment on the first draft then give us the opportunity to review, rethink and rewrite. This was particularly helpful when the topic was both new and somewhat esoteric. We were also encouraged to meet with them during office hours if we needed clarification. Overall I think that worked out very well.
Dr. Singham:
A summary judgement I won't bother to support (at least in this comment), in other words an axiom: Some sort of summary judgement is necessarily employed in everyday life, inside or outside of school.
Given this axiom, it's better that one use relatively objective methods such as grades or tests since subjective methods will allow bias to creep into the evaluation. Teachers and other evaluators are not uniformly capable of filtering out their own biases: Given such variance in the quality of teachers, I think the wide adoption of your methods will result in a net increase in bias.
Two examples to buttress my point:
1. Both IQ tests and the SAT were crucial in dismantling the old WASP stranglehold on higher education in America, early in the 20th century. Poor, bright kids were able to show that they could compete with the kids of the elite. Such tests allow one to filter out the 'noise' introduced by all sorts of 'random' variables like class, rigor of curriculum, etc.
2. The MCAT is a good predictor of the pre-clinical year grades in med school and also performance on the USMLE-I. The poorer correlation with med student performance in the clinical years is due, I think, to the 'noise' introduced by the predominance of subjective evaluations in the clinical years.
Kumar
Kumar,
The point I was making was that it is the act of summarizing all the diverse data and evaluations into a single number/grade for ranking purposes that was problematic, since the input data may have more useful information tat the output number/grade.
The subjective/objective assessments issue is a different one. Here too, it is assumed that multiple choice tests, for example, are objective and don't allow for instructor bias, while grading of essays is considered subjective.
But again, a major subjective element in multiple choice arises in the examiner deciding what to test on, how to word the tests, and what kinds of options are given for the answers.
Similarly, constructing detailed scoring rubrics for essays/problems can make grading essays more "objective."
The key question that needs to be addressed is what do we want to measure and do our assessments measure it?