Click Nothing

design from a long time ago

GDC09 – Part 3 – Wines and Rants

April 2, 2009

The last set of slides I have to put up are those for my five minute talk about game rating systems that I gave as part of Richard Lemarchand's Microtalks session on Thursday morning.

Overall I really enjoyed the session – though the prep was a nightmare. I guess it's pretty easy to enjoy a session where you only have to do 10% of the work and then you get to listen to smart people explain to you what exactly it is that you do for a living.

Anyway, the five minute format is colossally difficult. Especially if you are trying to make a point and support it with arguments instead of being all hand-wavy and theoretical. I wish I could have brought to bear all the arguments I have as for why we should switch from 10, 20, 100 or even 1000 point rating systems to a simple five point rating system, but I just didn't have time.

I guess that's what blogs are for.

Here are the arguments I made in the talk:

100 point rating systems are so analog that the human brain cannot make sense of the granularity (what is the difference between an 86 and an 87 in real perceptible, measurable terms?). These systems nobly attempt to make a threshold-less system – but when the brain is overloaded we add thresholds back in… we say '61-70 is above average' and '71-80 is good' or whatever. This is effectively adding back in the thresholds that a 100 point system endeavors to remove… but it is adding them back in in a less rigorous way leading to a mysticism created around certain improbable thresholds at the far reaches of the curve. I called this mysticism the Cult of 90+ and talked about it specifically in the context of wine rating.

I talked also about the relationship between the review scores, the reviewer, the publishers of the review and the developers and publishers making the games. There are pressures in these relationships that push reviewers to give higher scores. This is very easy to do in a 100 point system (because of the bracketed question in the previous paragraph), meaning a reviewer under pressure can relieve that pressure incrementally by inflating a score ever so slightly to make everyone involved a little bit happier. This systemic, distributed pressure release into the system of review scores is inflationary and increases game review scores over time.

In terms of inflation, I present the evidence of it by looking at the aggregated game ratings taken from gamerankings.com and I discuss why it is bad: principally because it prevents cross-generational comparison of games (in fact, this need for cross generational comparison means we should impose and maintain a bell curve in ratings even if games are getting empirically better over time).

As an aside: Adam Sessler ranted against a different set of problems inherent with the way aggregators work at this years Rant session. Adam is also right, in my opinion, and a global switch to a five star system would also work alleviate his complaint to some extent.

In any case, if you want the discussion on the above points, you can download the slides and the text of the talk here.

Also of note is that I used Kent Hudson's 'How To Pick a Lock' talk from GDC 2008 as a point in the discussion and I have had a couple people ask me about it. For those interested in that discussion, Kent's slides and video support can be found here. Thanks Kent.

Now – over and above the points raised in the talk there are a couple other arguments for why we ought to undertake a global switch to a five star rating system for games that I did not have time to cover in the talk itself.

Firstly, 100 point rating systems leverage the 'fallacy of precision' to lend authority to the accuracy of the rating (difference between accuracy and precision is here). Essentially the argument here is that if I claim a game is an 82, then most readers are likely to think 'well, it must be somewhere between an 80 and an 85'. In fact, this is a fundamentally invalid assumption. If I were to use a 10 point system and give the game an 8, readers would assume it to be 'probably between a 7 and a 9'. If I were to use a five star system and give a 4, they would assume it 'probably between a 3 and a 5'.

The problem is that the precision afforded by an 82 does not make the accuracy of the rating higher. It is as likely to be a 60 or a 100, just as a 4 star rating is probably somewhere between a 3 and a 5. Accuracy and Precision are orthogonal concepts, but the fallacy of precision leverages the fact that our brains tend to correlate them. In a the highly subjective field of rating games, it is much more important that we be accurate than that we be precise and someone who plays a lot of games can be accurate in a five star system.

To go one step further with this argument, I would point out that in a world full of aggregators, we still have 100 and 1000 point systems and having these systems aggregating five star ratings into percentages is useful and good (provided everyone uses a five star system and their aggregation policies are standardized, fair and well understood – cf Sessler). Having them aggregating ratings that are already claiming 100 points of precision makes the aggregators susceptible to the effects of the aforementioned inflation (in fact it is in the aggregators that we are able to measure the inflation).

For the record – this point was misquoted in IGN's article on the talk wherein they wrote:

According to Hocking … 35 of the 100 current top-rated games (as tracked by an aggregate site) were released after 2001. That presents a skewed view of the industry's history and points toward a recent inflation of sorts in game reviews.

Note that an astute commenter then went on to reply this:

To say 35 of the top 100 rated games of all time came out since 2001 means that scores have been inflated since them is not consistent. Since 2001, there have been 8 years of games released. We can assume that almost all games with ratings have come out since around 1985, or 24 years ago. The past 8 years represents 1/3 of this time, and 35/100 is about 1/3 of the total game volume. It is consistent that ratings have NOT been inflated as 1/3 of the top rated games have come out in 1/3 of the time period of game releases.

The commenter is exactly right if 35 of the Top 100 rated games of all time were released since 2001. But the real quote was that 35 of the Top 41 highest rated games of all time were released since 2001. That`s not the 35% we would predict, that is 85%. Get it right IGN, you kind of invalidated the entire talk by erroneously reporting the most critical and central piece of data. Maybe it was the precision that threw you off?

I won't go into the other horribly broken arguments in the comments thread… I should know better… but here's my personal favorite:

Clint is a joke. Far Cry was overrated trash. The 100 point system is fine. It allows a more accurate score. Period.

That's like a straw-man multiplied by the denial of the fallacy of precision. Awesome stuff.

As a follow-on from the fallacy of precision point is the argument that the kinds of precision implied in 100 point systems feel elitist to 'ordinary' people. At a time when the industry is experiencing explosive growth – when new players of all demographics are starting to take up gaming – it serves our interest to give them reviews and rating scores they can understand. An 89.3 implies a level of refinement in perception among gamers and reviewers that not only does not exist, but is also intimidating for the mass market. It is the equivalent of terms like 'mature oak finish' and 'fruity nose' in wine discussion. These concepts are alienating for the market and make people feel excluded when – in the end – the entire purpose of game reviews and particularly of giving numerical scores at all is to help inform people about the product and invite them to play.

The final point I wanted to make was that – as members of a press with increasingly important social responsibilities – the reviewers, writers and editors giving the review scores should be the ones championing the push to a five star system. For all the reasons mentioned above, finer granularity systems are broken and compromise the integrity fo their work (and the utility of their work to developers). Chris Hecker also gave a mini rant this year and he called for the gaming press to do their job well. I believe the first step in doing that lies in radically reducing the importance and weight given to numerical scores and emphasizing the importance placed on quality reviews, well written.

I am pretty sure no matter what I say, this call is going to go unheeded… but I am also pretty sure that those who don't make the transition are going to find themselves very rapdily serving a smaller and smaller niche of an explosively growing market. I look very much forward to continuing to work with those who survive the transition.

Posted in Art, Current Affairs, Design Theory, Game Criticism, Game Developers, Game Industry, Game Reviews, Games, GDC, GDC09, Lectures, Meaning, Travel

12 responses to “GDC09 – Part 3 – Wines and Rants”

Mike Thomsen

April 2, 2009

IGN has a better summation of the rant article than the Joystiq grab bag:
http://ps3.ign.com/articles/967/967319p1.html
Have you emailed anyone at IGN to correct the article? If not, you should…

LikeLike

Reply
Nat Loh

April 2, 2009

Was nodding my head throughout this talk. The thing I notice about the 10 or 100 point system is that scores from 0-50 are typically underused. This probably has as much to do with rating inflation as it does with the human psychological condition. My brain is a lot more open to a 3/5 star restaurant than it is to a 60% rated restaurant. When I look through my Zagat restaurant guide, if the food is rated under 20, my brain automatically disqualifies them even if the food is actually ‘fine’ but not ‘great.’ I would even try a 2 star place but 40% is psychologically inhibitive. Even in the 5 star system – a quick sampling on Yelp.com shows that there is a higher skew of 5 and 4 star ratings than the 1 and 2 star ratings (although I hear yelp extorts restaurants suppressing bad reviews for a fee). And this doesn’t even address the issue that review scores often fail to reflect different target audiences. I.e. gamespot reviews games from the hardcore p.o.v. and to hardcore gamers, maybe random title X deserves a 4/10 but it really is a 8/10 for the intended audience.

LikeLike

Reply
Ilia

April 3, 2009

I was surprised that you made no note of the “target audience” of the review.
Board game reviews on BGG are a perfect example:
http://boardgamegeek.com/thread/308846
Basically, how much I will like the game has a lot to do with what kind of games I like to play, and basically with my own reviews of existing games.
The BGG example uses generic stereotypes of games (which may not be accessible to a non-gaming outsider), but at least get the point across really well.
The next step I’m always looking for is the Netflix system – I should get review scores based on my own recorded preferences. There are some FPS that are super-accessible for non-shooter fans and they should really explore them, and some may be perfect but only for die-hard shooter fanatics. I think this dimension is sorely missing from the current review system.

LikeLike

Reply
James O

April 3, 2009

Nat Loh: “40% is psychologically inhibitive.”
This brings up an interesting problem with the 100 point rating systems – schools generally also use 100 point rating systems, and grades below 70 are considered failures. Thus there’s an in-built aversion to anything below that, because we (or at least me) pre-judge a sub-70 rating as being a failed game. 70 reads to me as “mediocre” thanks to school, but logically I would think mediocre would be more like a 50 (i.e. right in the middle of the scale.) So maybe to some extent the scale is already positively shifted , because we only have 30 points of “usable” scale out of the 100 to judge something positively.

LikeLike

Reply
MJ Irwin

April 3, 2009

That would require the games press to agree on universal practices. A welcome idea, but it is something that is unlikely to transpire. Too much enthusiast site identity is wrapped in review scores.
Nonetheless, the argument for a simplified review scale has been bandied about for years within the game critics community. As publicized during Shawn Elliott’s epic review symposium (http://shawnelliott.blogspot.com/2008/12/symposium-part-one-review-scores.html) most dislike assigning review scores. Many critics would actually prefer that review scores were banished all together in favor of more thoughtful assessments of games. Of course such experiments (Computer Gaming World) have ended in disaster. The problem is that the most vocal game players–core consumers–demand the granularity. A site’s scoring system is so embedded in its readership that it would be nigh impossible to wean audiences from an 100- or 10-point scale without a sever backlash. Few are bold enough to risk it.

LikeLike

Reply
Mike Thomsen

April 3, 2009

@ MJ & Clint: All scores are abstractions. There can’t be anything wrong with anyone’s choice for how to abstract value. The larger issue is that people are still afraid to make honest statements with review scores, regardless of the scale. And that reviewers are afraid to write from an explicitly subjective point of view. I could make a pretty hard argument that Far Cry 2 is a 10 and an 1/10, I could write for both sides if I could hide behind the veil of objectivity. Objectivity is fundamentally dishonest and that’s the larger issue with reviews, as far as I see it. The abstractions are only vague and useless because the thinking put into their application is dishonest. If reviewers could speak only for themselves and score according to their own tastes there’d be nothing wrong with a 100 point scale. If I could have reviewed FF12 and given it a 4.7 instead of having to offer due dilligence about production value, graphicss, “fans of the genre,” and consumer value, the conversation might actually move forward. That would require a much deeper editorial overhaul than simply rearranging an arbitrary valuation abstraction.

LikeLike

Reply
Kim

April 3, 2009

Clint, a couple points:
1) I thought the talk was great. While I agree the 5 minute format is hard, it also forces one to boil an argument down to its essence, and I think you hit it on the head.
2) Only negative feedback I have is that I’d like to have heard a call to action. Some ideas on how to get the industry to go ‘five star’. A revolt? Torches and pitchforks mobs to whatever sites still to 100% reviews?
3) It’s false logic, but it could be argued that 85% in the last 1/3 of the period just indicates that we are getting better. I don’t beleive this at all, and think your conclusion is right, but if we’re playing follow the logic… (e.g. If we saw data that the 18 of 20 of the top years of airline safety were in last 20 years – I’m making that up – we’re probably think we were getting better, not that standards were being lowered).
– Doesn’t ‘5 star’ suffer from the same issue, but just coarsen it a bit? e.g. I just read a car magazine roundup where they use a 5 star system, but there’s still a perception that 5 star is hell of a lot better than 4 star…

LikeLike

Reply
Charles S.

April 3, 2009

I agree with Kim, I would like a proposed “plan of action” here. I know you don’t have all the answers or anything, but I think discussing it might be a good idea.
We have the ESRB to make sure we have a united ratings system (for their territory, anyway), but who would be in charge of making these rules and semi-enforcing them? As your quotes from the IGN comments point out, the consumer base doesn’t really care, so I don’t think the pressure will come from there.
Actually, I can’t think of a single unified review scale in existence. Maybe it’s all just a pipe dream. The way journalist review games is kind of like a personal “stamp”. Like Ebert and Roeper have thumbs up/down, Famitsu has the dreaded Quadra-10 point review.
It would be nice if we could invent an organization (perhaps supported by publishers? Only “good standing” members gain access to early game info?) that could instill some review rules… like the 5-star system, but with the potential for other rules as well.

LikeLike

Reply
Jake

April 3, 2009

@Mike All scores are abstractions, fair enough, but they also look different and mean different things to our stupid human brains, due to context and training and history.
“60%” and “3 out of 5 stars” read so amazingly differently to me at least, due to the aforementioned seeing 60% or 6.0 as a D on a test or a report card, versus seeing “” meaning “a decent enough thing — nothing special but not bad!” in restaurant/hotel/film reviews.
That “” and “60%” are aggregated together into the same pot by aggregate sites doesn’t help anything either. Metacritic argues that “3 out of 5” equates directly to “60%” in game reviews, but I’d wager that sites which use star ratings dip into the 2-3 star range more commonly than 10 or 100 point sites dip into 40-60% or 4.0-6.0 out of 10.0.
I don’t know how the numbers work out on any of this because I am lamely writing off the top of my head instead of actually investigating anything, but it’s something that’s often bothered me based on my own known internal bias when looking at “***” and saying to myself “hm not bad,” and then looking at “60%” and saying “eugh, yikes.”

LikeLike

Reply
Nat

April 5, 2009

I’ve been thinking this over and another idea I’d like to propose was to actually use a 6 star system. Why? Because I can EASILY do the math to convert a 5 star rating into a 100 point rating. Using 6 stars… what is 1/6 of 100? I don’t know. I could figure it out but generally it’s not something I keep stored in my fractions memory bank. I know the square root of 2. I know pi up to several decimal places but 1/6 is just not a fraction I have cached by default. This mental barrier is enough to stop me from even thinking about “what would this score be in a 100 pt scale?” whereas in a 5 point system, the math is so simple I can calculate it subconsciously. I also found through using Yelp that one more star is useful. I frequently find myself thinking about the 3.5 star rating because some restaurants just fall into that “good, I like this, but it’s not awesome.”
my star interpretation goes something like:
one star – bad, doubt anyone will like it.
two stars – fair, maybe someone else might like it.
three stars – good! a majority of people will like it.
four stars – great! a lot of people will like it.
five stars – awesome! you’ll be hard pressed to find someone who doesn’t like this.
six stars – mind blowing. generational greatness personified.
and there is also an impied zero stars for “WTF, absolutely no one, even the developers themselves could ever like this game”

LikeLike

Reply
Nat

April 5, 2009

one other thing. Going through the reviewing process and using 5 star system on yelp.com, I constantly think that “I can’t use the 5 star rating” because that is the top rating and I can’t go any higher than that. Having the 6th star, I feel like I have more flexibility in my praise. It’s sort of like how another poster mentioned we have 3 levels of “good” in the current ratings setup. We effectively maintain the 7,8,9,10 paradigm with 3,4,5, and 6 stars. 2 stars covers not very good games. 1 star covers bad games. 0 stars (if necessary, should be used as sparingly as the 6th star) covers epic mistakes and blights upon the gaming world.

LikeLike

Reply
Alex Epstein

April 7, 2010

I agree with Kim. Games are getting better. If you look at the first 30 years of movies, how many of those are even watchable today? Critics like to say nice things about NOSFERATU and THE CABINET OF DR. CALIGARI, but they’re really more interesting than good.
There’s a world of difference between the original PRINCE OF PERSIA and HEAVY RAIN, say. They’re not even vaguely comparable. The game engine technology has advanced unbelievably, and so have the AI’s, but storytelling and understanding of what is a game have also advanced, and continue to advance. We’re still under Moore’s Law, if you will.
So while, yeah, there is always some bias in favor of recent games — just as the Oscars tend to feature movies that came out in December, not March — but a lot of it is valid.

LikeLike

Reply