Building the Nebula Model, Part 1
The raison de’etre of Chaos Horizon has always been to provide numerical predictions for the Nebula and Hugo Awards for Best Novel based on data-mining principles. I’ve always liked odds, percentages, stats, and so forth. I was surprised that no one was doing this already for the major SFF awards, so I figured I could step into this void and see where a statistical exploration would take us.
Over the past few months, I’ve been distracted trying to predict the Hugo and Nebula slates. Now that we have the Nebula slate—and the Hugo is coming shortly—I can turn my attention back to my Nebula and Hugo models. Last year, I put together my first mathematical models for the Hugo and Nebulas. They both predicted eventual winner Leckie, which is good for the model. As I’ll discuss in a few posts, my currently model has around 67% accuracy over the last 15 years. Of course, past accuracy is not going to make things accurate in the future, but at least you know where the model stands. In a complex, multi-variable problem like this, perfect accuracy is impossible.
I’m going to rebuilding and updating the model over the next several weeks. There’s a couple tweaks I want to make, and I also wanted to bring Chaos Horizon readers into the process who weren’t around last year. Over the next few days, we’ll go through the following:
1. Guiding principles
2. The basics of the model
3. Model reliability
4. To-do list for 2015
Let’s get started today with the guiding principles for my Nebula and Hugo models:
1. The past predicts the future. Chaos Horizon uses a type of statistics called data-mining, which means I look for statistical patterns in past data to predict the future. There are other equally valid statical models such as sampling. In a sampling methodology, you would ask a certain number of Nebula or Hugo voters what there award votes were going to be, and then use that sample to extrapolate the final results, usually correcting for demographic issues. This is the methodology of Presidential voting polls, for instance. A lot of people do this informally on the web, gathering up the various posted Hugo and Nebula ballots and trying to predict the awards from that.
Data-mining works differently. You take past data and comb through it to come up with trends and relationships, and then you assume (and it’s only an assumption) that such trends will continue into the future. Since there is carryover in both the SFWA and WorldCon voting pools, this makes a certain amount of logical sense. If the past 10 years of Hugo data show that most of the time a SF novel always wins, you should predict a SF novel to win in the future. If 10 years of data show that the second novel in a series never wins, you shouldn’t predict a second novel to win.
Now, the data is usually not that precise. Instead, there is a historical bias towards SF novels, and first or stand alone novels, and past winners, and novels that do well on critical lists, and novels that do well in other awards, etc. What I do is I transform these observations into percentages (60% of the time a SF novel wins, 75% of the time the Nebula winner wins the Hugo, etc) and then combine those percentages to come up with a final percent. We’ll talk about how I combine all this data in the next few posts.
Lastly for this point, data-mining has difficult predicting sudden and dramatic changes in data sets. Huge changes in sentiment will be missed in what Chaos Horizon does, as that isn’t reflected in past statistical trends. Understand the limitations of this approach, and proceed accordingly.
2. Simple data means simple statistics. The temptation for any statistician is to use the most high-powered, shiny statistical toys on their data sets: multi-variable regressions, computer assisted Bayesian inferences, etc. All that has it’s place, and maybe in a few years we’ll try one of those out to see how far off it is from the simpler statistical modeling Chaos Horizon uses.
For the Nebulas and Hugos, though, we’re dealing with a low N (number of observations) but a high number of variables (genre, awards history, popularity, critical response, reader response, etc.). As a result, the project itself is—from a statistical reliability perspective—fatally flawed. That doesn’t mean it can’t be interesting, or that we can’t learn anything from close observation, but I never want to hide the relative lack of data by pretending my results are more solid than they seem. Low data will inevitably result in unreliable predictions.
Let’s think about what the N is for the Nebula Award. Held since 1966, 219 individual novels have been nominated for the Nebula. That’s our N, the total number of observations we have. We don’t get individual voting numbers for the Nebula, so that’s not an option for a more robust N. Compare that to something like the NCAA basketball tournament (since it’s going on right now). That’s been held since 1939. The field expanded to our familiar 64 teams in 1985. That means, in the tournament proper (the play-in round is silly), 63 games are contested every year since 1985. So, if you’re modeling who will an NCAA tournament game, you have 63 * (2014-1985) = 1827 data sets. Now, if we wanted to add in the number of games played in the regular season, we’d wind up with 347 teams (Division I teams) * 30 games each / 2 (they play each other, so we don’t want to use every game twice) = 5,205 more observations. That’s just one year of college basketball regular season games! Multiply that by 30 seasons, and you’re looking at an N of 150,000 in the regular season, plus an N of 2,000 for the postseason. You can do a lot with data sets that big!
So our 219 Nebula Best Novel observations looks pretty paltry. Let’s throw in the reality that the Nebulas have changed greatly over the last 40 years. Does 1970 data really predict what will happen in 2015? That’s before the internet, before fantasy became part of the process, etc. So, at Chaos Horizon, I primarily use the post 2000 data: new millennia, new data, new trends. That leaves us with an N of a paltry 87. From a statistical perspective, that should make everyone very sad. One option is to pack up and go home, to conclude that any trends we see in the Nebulas will be random statistical noise.
I do think, however, that the awards have some very clear trends (favoring certain kinds of novels, favoring past nominees and winners) that help settle down the variability. Chaos Horizon should be considered an experiment—perhaps a grand failed experiment, but those are the best kind—to see if statistics can get us anywhere. Who knows that but in 5 years I’ll have to conclude that no, we can’t use data-mining to predict the awards?
3. No black boxing the math. A corollary to point #2, I’ve decided to keep the mathematics on Chaos Horizon at roughly the high school level. I want anyone, with a little work, to be able to follow the way I’m putting my models together. As such, I’ve had to chose some simpler mathematical modeling. I think that clarity is important: if people understand the math, they can contest and argue against it. Chaos Horizon is meant to be the beginning of a conversation about the Hugos and Nebulas, not the end of one.
So I try to avoid the following statement: given the data, we get this prediction. Notice how that sentence isn’t logically constructed: how was the data used? What kind of mathematics was it pushed through? If you wanted to do the math yourself, could you? I want to write: given this data, and this mathematical processing of that data, we get this prediction.
4. Neutral presentation. To trust any statistical presentation, you have to trust that the statistics are presented in a fair, logical, and unbiased fashion. While 100% lack of bias is impossible as long as humans are doing the calculating, the attempt for neutrality is very important for me on this website. Opinions are great, and have their place in the SFF realm: to get those, simply go to another site. You won’t find a shortage of those!
Chaos Horizon is trying to do something different. Whether I’m always successful or not is for you to judge. Keep in mind that neutrality does not mean completely hiding my opinions; doing so is just as artificial as putting those opinions in the forefront. If you know some of my opinions, it should allow you to critique my work better. You should question everything that is put up on Chaos Horizon, and I hope to facilitate that questioning by making the chains of my reasoning clear. What we want to avoid at all costs is saying: I like this author (or this author deserves an award), therefore I’m going to up their statistical chances. Nor do I want to punish authors because I dislike them; I try and apply the same processing and data-mining principles to everyone who comes across my plate.
5. Chaos Horizon is not definitive. I hold that the statistical predictions provided on Chaos Horizon are no more than opinions. Stats like this are not a science; the past is not a 100% predictor of the future. These opinions are arrived at through a logical process, but since I am the one designing and guiding the process, they are my ideas alone. If you agree with the predictions, agree because you think the process is sound. If you disagree with the process, feel free to use my data and crunch it differently. If you really hate the process, feel free to find other types of data and process them in whatever way you see appropriate. Then post them and we can see if they make more sense!
Each of these principles is easily contestable, and different statisticians/thinkers may wish to approach the problem differently. If I make my assumptions, biases, and axioms clearly visible, this should allow you to engage with my model fully, and to understand both the strengths and weaknesses of the Chaos Horizon project.
I’ll get into the details of the model over the next few days. If you’ve got any questions, let me know.