Correlating Goodreads vs. Amazon vs. Bookscan Numbers, Part 2
Over the past few posts, I’ve been looking at the correlation between Amazon data, Goodreads data, and the mythical “actual books sold” data that we don’t have. It would be nice if either Amazon or Goodreads correlated with that data, because then we’d be able to get a good estimate of actual sales.
Unfortunately, it appears that both the Goodreads and the Amazon data is demographically unreliable. That makes a certain amount of sense: websites cater to very specific audiences, and specific audiences don’t reflect the general reading public. Goodreads seem to be lean very young (under 40), and, according to Quantcast, has around a 70/30 female/male demographic. Amazon seems more neutral in terms of gender, but leans older (over 40) and wealthier (i.e. people who have enough money to buy lots of books online).
I’ve got one more set of data to present to you: for the past 5 months, I’ve been collecting data points that would compare Goodreads to BookScan numbers. BookScan is a point-of-sale recording service to see how many books are sold; instead of estimating from sampling, they actually try to count how many books are sold by different venues. They claim to cover some 80-90% of the market, although that’s probably inflated. Here’s a good article from Forbes that can serve as an introduction. I’ve heard plenty of authors say that Bookscan grabs less than 25% of their sales, and I think the more sell you through untraditional means (at cons, through small bookstores, etc.), the worse the BookScan numbers are.
Most BookScan data is locked behind a huge paywall—but Publisher’s Weekly prints weekly Hardcover bestseller lists on their website. They try to make it difficult (they only print # books sold this year), and they don’t include e-books. Still, if we were to take that data and compare it to Goodreads data . . . we’d have something.
This is exactly what I’ve done. Since early November, I’ve been tracking (weekly) any SFF book (broadly defined, and also including horror) that has shown up on the Top Hardcover Fiction list. I’ve then been comparing that to the number of Goodreads ratings for that week to see if there’s a sensible correlation between the two numbers.
While this isn’t perfect—some authors may sell a higher proportion of e-books than others—we’ll at least have a rough look at “actual physical” sales versus Goodreads ratings.
Only six SFF novels showed up on the Hardcover Bestseller list in the last 4 months. I put down the publication date because more people buy a book right when it comes out than read it right when it comes out; a long book like Revival might take people a month to read, so it may take a while for Goodreads ratings to catch up with sales. “Last Data” is the data for the week when the book fell off the cart; Gibson fell off quickly (two weeks), while Rothfuss stuck around longer. King is still going strong. The Hardcover column is the total amount of Hardcover sales as given by Publisher’s Weekly for that week; the Goodreads column is the number of Goodreads ratings from that same week.
Lastly we have the interesting column: the ratio of Hardcover sales/Goodreads ratings. In an ideal world, that would have been close the same number of everyone.
It is not. Goodreads is tracking the Rothfuss and Mandel in a totally different way than it is King, Rice, Koontz, or King. Perhaps this is because Mandel and Rothfuss are selling more to Goodreads’ specific audience (younger, female). Perhaps this is because the books are shorter. Perhaps this is because King and Rice sell more in places like Wal-Mart or Target, whose readers aren’t using Goodreads. Perhaps King and Rice are selling primarily to older readers, who are less inclined to use internet websites to record their reading habits.
In a complex statistical case like this, it probably comes down to a multitude of factors. With just 6 books, we don’t have enough data to hash that out. What we can say is that someone like Mandel is overperforming (compared to the average) on Goodreads in an enormous way. Even if we consider that a young author like Mandel might have a 50/50 Hardcover/e-book split in her sales (thus meaning she’s sold around 120,000 copies of Station Eleven, which seems reasonable), almost 25% of her total readers rated the book on Goodreads. That is astonishingly high. In contrast, King–whose probably tripled or quadrupled Mandel’s sales–only has about 5% of his readers on Goodreads.
That’s an enormous gap, and re-enforces what we learned in the last post: Goodreads is not a reliable indicator of total readers. It’s tracking Mandel and King in totally different fashion, and to compare Mandel to King via Goodreads makes Mandel seem more popular than she is and King less popular.
That doesn’t mean Goodreads is useless: it just means that it tracks a specific demographic. Whether that demographic is more in touch with the Hugo/Nebula awards is an open question.
One last chart for true stat geeks: let’s see what’s happened to the Amazon/Goodreads ratio over time. Not a ton of data here, as only 4 SFF books had a decent run on the Bestseller chart. Here it is:
You can see that Rice and King have reasonably shaped curves that are converging to around 15 in King’s case and at about 30 in Rice’s case. Mandel and Rothfuss have basically straight lines: they were popular on Goodreads to start, and haven’t changed at all. That re-enforces my last point: Goodreads treats King and Rice fundamentally differently than Mandel or Rothfuss.
With enough time and data—which we don’t have—we might be able to get a better sense of why books are tracked in different ways. Perhaps it would be a simple demographic correction (authors over 40 have this kind of ratio, authors under 40 have this kind of ratio). However, since Publisher’s Weekly doesn’t share enough data, we’re stuck. So be careful when looking at Goodreads numbers; they reflect a young audience, and are misleading when making comparisons between a Mandel and a Gibson.
I won’t lie: I’m a little disappointed that the Goodreads data isn’t more reliable. Given the large sample size, I’d hoped that Goodreads would flatten out any demographic bias. It doesn’t appear to do so, so any Goodreads numbers should be approached with healthy skepticism.
Next up for Chaos Horizon: start collecting Amazon data to see if that’s a better match to Publisher’s Weekly. Check back in a couple months, and we’ll see if that data lines up any better!