Category Archives: Statistics

How to predict anything*

There has been a lot of press ever since the election about how Nate Silver (and others) correctly predicted the election. For whatever reason, I feel compelled to explain the minimal basics of how to go about doing similar predictions. I will make this so simple that only knowledge of arithmetic is required. Here is my recipe:

  1. Make a list of all the outcomes that you care about. For example, do you want to predict if you will pass or fail a class, or do you want to predict what your grade will be? In the first case there are only 2 outcomes (pass/fail), so the prediction will likely be easier. In the second case your prediction gives you more information but will be harder to compute.
  2. Guess the chances of each of those outcomes. You can try to be fair and say all the outcomes are equally likely. Or you can try to use all the information you have, for example by putting a higher chance on failing the class if you already know you had a low grade on some homework. Each of the “chances” should be a number greater than zero but below one, and if you add the chances for all of the outcomes the answer should be one.
  3. Update the chances whenever you learn new information that’s relevant to the outcomes. For example, if you get another homework grade and it’s good then you should make your chance of passing a little bit higher. And remember that all the chances should add to one, so if one goes up then the others have to go down.
  4. Make predictions based on the chances. There’s more than one way to do this–if you only care about which one outcome is the most likely then you should see which of the chances is the largest. You might also want to know about a range of outcomes, like what is the chance that you get at least a B in the class? In that case just add the chance that you get a B and the chance that you get an A.

That should probably be good enough for most peoples’ purposes. What Nate Silver and those other people did was more complicated. Here are a few bonus points, but they require a bit more knowledge than arithmetic (high school math and a little bit of computer programming).

  • Modeling: The above recipe describes a “multinomial” probability model, that is one with a finite list of outcomes that you care about. But maybe there are too many outcomes to list; maybe the outcome is a count like number of goals scored in a game, or maybe the outcomes are numbers that can be ordered like the amount of money that you make by investing in a stock (it would be silly to list outcomes like $0-100, $100-200, etc). There are many standard probability models that are useful depending on the situation. A few of the most common are: Binomial for the number of “successes” in a given number of “independent trials” (e.g. number of heads when tossing a coin 100 times), Poisson for counting the number of times something happens when you know the rate (e.g. if you usually have 5 customers visiting your store every day, what is the chance of having 100 or fewer customers in the next month?), Exponential for how long it will take before something happens (e.g. how long until the next customer visits your store?), and Normal (aka “bell curve”) for almost everything else- especially if the middle outcome is more likely than all other outcomes.
  • Bayes’ formula for updating chances is perhaps one of the most important formulas ever written down. It tells you how to update your model after you learn new information. Depending on your model it might be difficult to calculate the formulas explicitly, but you can always use computer help- which brings us to the next point.
  • Simulation: The models mentioned above are very helpful because they come with standard formulas for all the kinds of predictions you might want to make, like what is the single most likely outcome, or what is the chance of being a certain amount higher or lower than the most likely outcome, and so on. But sometimes your situation is too complicated for any of the simple models listed above, or the assumptions needed to make those models work are not true for your situation (like the “independence” of trials for the Binomial model). In these kinds of cases it can be very helpful to write a computer program that randomly simulates the outcomes many times. For example, maybe I have a good guess of the chances that Obama will win each state in the electoral college, and I want to know what are the chances that he wins the election with at least 300 electoral college votes. Then I could write a computer program that simulations a thousand elections and record the percentage of those simulations in which he won 300 or more votes. Another very useful kind of simulation method was invented in the Statistics department here, it’s called bootstrapping and everybody’s doing it.

If you master all these skills then not only will you be able to predict anything*, but you’ll be able to do it better than everyone else!

* Of course “anything” is an exaggeration. These methods usually work better if you are predicting the kinds of outcomes that happen repeatedly, so there is a history of similar outcomes that suggest a type of model for you to use or allow you to make good guesses about the various chances. Sometimes the outcome is something that you didn’t even consider, like neither team wins a game because the game is canceled due to weather.

Tagged ,