Correlating Goodreads vs. Amazon vs. Bookscan Numbers, Part 1
Time for another of my “boring but important” mathematical posts. One thing I’d like to know—and I think many other SFF observes as well—is how much SFF books actually sell. For many entertainment industries, this kind of information is readily available. There are great sites like Box Office Mojo for films or TV by the Numbers for TV. Both are free, well-designed, and easy to use.
But the book industry? They either don’t offer the information or, if they do, lock it behind paywalls. BookScan purports to track a fair portion of the field, but they have exorbitant rates (here’s an enrollment form asking $2,000 for a membership through Publishers Marketplace) and also have strict terms of service that would prevent anyone from broadly sharing that data in public. Other sites like BookStats (an annual survey of publishers) offer equally steep rates (that begin at an eye-popping $2,995 and only head up!).
What does all of this mean? That we don’t have access to free, reliable sales data for the book industry. I think this is a huge mistake on the part of the book industry; freely sharing data hasn’t hurt the movie industry, at it lets viewers hotly debate their favorite movies and the tricky relationship between sales and quality. If people are talking about your industry, they’re involved in it—and likely growing it. The more readers are locked out of conversations about books, the more likely they are to drift over to other industries that allow for fuller participation. People like numbers, and charts, and debating; they like to see how their favorite movie or TV show or book is selling.
Sadly, transparency has never been a feature of the book industry. So that means a site like Chaos Horizon is left to patch together popularity estimates through frustratingly inadequate and inaccurate techniques. Since we don’t have point-of-sales tracking, we’re left moving to a space like Amazon or Goodreads, which samples the reading public through the number of reader reviews. Since both those websites are fairly big, you can argue that both websites sample a large enough portion of the population to be statically meaningful. If a book sells 10,000 copies, and 1,000 people rank it on Goodreads, that’s a pretty solid 10%. Amazon tends to samples at a lower rate, so they might only grab 1% of the total readership. Still, that’s better than nothing . . .
If there wasn’t bias built into the Goodreads and Amazon user bases. I’m using bias in a purely statistical sense here, to indicate a demographic issue that skews from the norm. So, let’s say 50% of readers are men and 50% are women. To make a good sample, you’d have to be sure to have 50% men and 50% women in your sample (or you could correct your sample after it was done, if you like fancy math). Simple enough? The same with age, income level, etc., all the basic demographic categories.
If you take a look at the demographic information for Goodreads, you’ll see that it skews pretty substantially. If you head over to Quantcast, a web demographic site (why is this free but I can’t look at book sales numbers?), the report Goodreads as having a 71% women / 29% men visit ratio. That’ll definitely skew the data. All this info is at the bottom of the Quantcast page; if you click through it, you’ll see that Goodreads skews towards women, towards people from age 18-34, and towards the people with either undergraduate or graduate education. All of that makes a certain amount of sense: younger people are more likely to use social media, and avid book readers are probably more likely to be college educated. This means, though, that every bit of Goodreads data is going to be biased towards certain audience tastes.
Amazon’s demographic bias is harder to find. They’ve opted out of Quantcast, I’ve read several studies that suggest Amazon is biased towards older, high-income users, and the highly-educated. CBS News echoes all of that, and also reports Amazon as gender neutral. That’s from 2010 (when Quantcast data was still available for Amazon); I don’t know if it has changed or not. Still, that means the demographics between Amazon and Goodreads are hopelessly off: young versus old, average income versus high income, 70/30 gender versus 50/50. At least they converge on education level!
How much does bias like that matter in practice? An enormous amount. Let’s look at a comparison of # of Goodreads ratings to # of Amazon ratings for the 2015 Hugo contenders:
Take a look at the far right column: that’s the ratio of Goodreads/Amazon ratings. That range is what we’d call in statistical terms “a hot mess.” You range from Beukes having a 60x multiplier down to Gibson having a paltry 8x. Even if you toss out Beukes as an outlier, it would appear that some writers get ranked at a 4 time higher rate on Goodreads than Amazon. What’s happening?
This is demographic bias at work. Goodreads favors certain books and dislikes other books (in sampling terms). Since the Goodreads readership is younger + female, a book by Gibson (older + male) shows up much lower on the list. Books that appeal to the Goodreads demographic (presumably female-friendly books that market/cater to a slightly younger audience) do very well on Goodreads. Books catering to an older or more male audience tend to worse, at a rate of about 3 – 4 times (at the most extreme; most books are a little less than that).
So, what do we conclude? That Goodreads is biased. That Amazon is biased. If you wanted to correlate the Goodreads number to the Amazon numbers, you’d have to multiply the least Goodreads friendly books by around 3 or 4 times. But why correlate one set of biased numbers to another set of biased numbers? That’s statistically pointless: if you want to know the younger, more female audience, use Goodreads. If you want the older, richer, more gender-neutral audience, use Amazon.
So, we’re swirling around the question: can we correlate Amazon or Goodreads numbers to actual sales? What we’ve learned in this post is, according to Quantcast data either as openly accessible or reported by CBS:
1. In demographic terms, Goodreads is biased towards women, younger readers (18-34), and the highly educated.
2. In demographic terms, Amazon is biased towards older readers, higher income readers, and the highly educated.
3. This results in substantial differences (up to around 4 times) at the rate which they review books.
That’s all well and good. We now have a way to compare Amazon to Goodreads numbers through demographic correction (if we wanted to). But how do those two sets of biased numbers actually sync up with sales? I have some Bookscan data I’ll be sharing with you in the next post!