Predicting the second half of the Democratic primary

I’m a statistician and I support Bernie Sanders.

So this is how I spent part of my Saturday.

I gathered data from sources including the US census, Wikipedia, WolframAlpha, and some others (the Facebook map data from FiveThirtyEight, Google trends data using the gtrendsR package). Using this data–and not any polls–I built a predictive model in a standard fashion using ridge regression with cross-validation to choose the level of regularization. Since I only had data for 26 states that have voted so far, there is reason to believe (7-fold) cross-validation will not be very stable, so I also averaged the predictions from 100 models randomly generated by using different splits for cross-validation. The plot below shows that the resulting predictions come pretty close to Senator Sanders’s actual share of the vote. (Important note: that is what I’m predicting, vote share, not probability of winning).


The individual results for each state are listed below. My apologies for the absurdly large table… WordPress won’t let me change it (clearly they are part of the political establishment).

State Actual Predicted Error
Iowa 49.6 50.8 1.2
New Hampshire 60.4 60.8 0.4
Nevada 47.3 46.7 -0.6
South Carolina 26 25.4 -0.6
Alabama 19.2 22.6 3.4
Arkansas 29.7 31.2 1.5
Colorado 59 60.6 1.6
Georgia 28.2 26 -2.2
Massachusetts 48.7 51.1 2.4
Minnesota 61.7 61.9 0.2
Oklahoma 51.9 46.9 -5
Tennessee 32.4 32.7 0.3
Texas 33.2 33.2 0
Vermont 86.1 82.9 -3.2
Virginia 35.2 36.2 1
Kansas 67.7 62.4 -5.3
Louisiana 23.2 24.5 1.3
Nebraska 57.2 61.5 4.3
Maine 64.2 62.8 -1.4
Michigan 49.8 48.8 -1
Mississippi 16.5 20.2 3.7
Florida 33.3 36 2.7
Illinois 48.7 47.1 -1.6
Missouri 49.4 45.1 -4.3
North Carolina 40.8 38.3 -2.5
Ohio 42.7 46.3 3.6

Before we get to the predictions, let’s also look at how the models weighted each predictor variable. Below I show a boxplot for each predictor showing the weights given to that predictor by each of the 100 models (remember I’m averaging the predictions of these models).


Several things are worth noting here.

  • The census data I used had variables for portions of workers in all kinds of different industries and I threw most of these away. But I had a suspicion that IndustryFinance and IndustryManufacturing might be important. Apparently I was right about finance: states relying on that industry do not like Bernie. Surprise! Actually, the effect may not be as large as it appears in these models. Among states that have voted so far, the ones with the largest portion of finance industry are Florida and Massachusetts (tied) and the one with the lowest is Vermont. We’ll know more when New York votes on April 19.
  • Age effects: Age18to34 is actually a smaller effect than Age45to55, and this is a little surprising. It might reflect the lower turnout rates among younger voters. And even though Bernie wins millennials by a larger margin, he actually wins people under 55 as well, all other things being equal. To answer TNR’s question “Who is the Hillary voter?” — they are mostly fairly old.
  • None of the available exit poll data has information about Asians, so it is pretty surprising that RaceAsian is an important variable. It’s certainly not the narrative about race that we’ve been hearing.
  • Other things working in Sanders’s favor: education above the high school level, high speed internet access, and surpluses of likes on Facebook.
  • Things working in Clinton’s favor: poverty, high unemployment, lack of higher education, and proportions of population with high income or who are black or old.

And now for the predictions!

State Predicted
Arizona 44.8
Idaho 61
Utah 62.8
Alaska 66
Hawaii 79
Washington 65.5
Wisconsin 65.9
Wyoming 67.2
New York 46
Connecticut 44.6
Delaware 27.1
Maryland 38
Pennsylvania 48.7
Rhode Island 57.7
Indiana 58.7
West Virginia 45.5
Kentucky 49.8
Oregon 75.1
Puerto Rico 44
California 60.6
Montana 73
New Jersey 44.9
New Mexico 61.6
North Dakota 80.2
South Dakota 71.4
D.C. 26.6

These predictions are not great news for Sanders. They would translate to roughly 1923 delegates, not enough to win. Bernie will need to beat these expectations by about 5% across the board in order to win. This is nothing new: FiveThirtyEight has been singing this song since the day after the first Super Tuesday. However, I think there’s reason to still hope. Three reasons.

First, I think the IndustryFinance and RaceAsian effects are probably not as large as they appear based on the previous elections. This means the predictions for California and Hawaii might be a little too high, and the predictions for Delaware, Connecticut, New Jersey, Arizona, and New York might be too low.

Second, in a word: momentum. If Bernie can hold his own in Arizona and win big in Utah and Idaho, which is certainly within reach, he’ll be set up for a long stretch of big wins before New York votes on April 19th. This could yield the all-important media coverage he has been denied since early March (with the one exception of the Michigan upset). Bernie is probably too old for the nickname “comeback kid,” so media people reading this have an action item: think of a better nickname before April 5-9 (Wisconsin and Wyoming, both predicted over 60%).

The last reason is also the only reason I have any hope left for democracy in this country. Billionaire donors and high-rolling campaign bundlers are one thing, an army of volunteers is another. The grassroots supporting Sanders have been growing and improving in organization. We’ve made over 30 million phone calls to voters, and the rate is increasing. When this effort was focused on Michigan it was part of the reason the outcome swung by 20% from the polls. Spread out over 5 states on the last Super Tuesday it wasn’t enough. It’s currently focusing on 3 states 2 of which are already favorable, then another group of 3 which are all favorable, and then one at a time leading up to New York. It will then remain to be seen if our organization has improved enough to handle 5 states at once on April 26th, and 6 on June 7th (including the crux: California).

Taken together, these three things give me hope despite the numbers. And I’m a numbers guy. So Sanders supporters: keep up the great work! It’s gonna take a lot more of it to beat these expectations by large enough margins to win.

Now I will predict my own actions

I predict that I will post again with an update after the votes are tallied next week, and get back to phonebanking and facebanking in the meanwhile.

Speculation: once more western states have voted, the West variable will become less negative and yield higher predictions in California. Also, once New York votes the IndustryFinance variable will probably be less negative.

Something I probably won’t do: aggregate data at the county or congressional district level. I’m pretty sure that would yield much more accurate predictions, but it’s too much work for me. I don’t know the API necessary to scrape Facebook so I entered the state numbers by hand. If someone hired me to do it I would (psst, hey Jeff Weaver…).

Note: A previous version of this post had different results because I had not standardized the predictor variables properly.


3 thoughts on “Predicting the second half of the Democratic primary

  1. purple says:

    West Virginia and Kentucky seem too low, NY too high.

  2. PW says:

    AP reports the following delegate splits:

    Idaho — 17 Sanders, 5 Clinton (78%/21% of the vote).
    Arizona — 26 Sanders, 41 Clinton (40%/58% of the vote).
    Utah — 24 Sanders, 5 Clinton (80%/20% of the vote).

    So Sanders won the night, gaining a net of 16 delegates and reducing his delegate gap with Clinton from ~324 to ~308.

    The results for Sanders in UT and ID were much better than your projections and in AZ worse. Is there a way to adjust your projects in light of these results? I tend to agree with the commenter above that your NY projection is probably too high for Sanders, for example.

  3. […] week ago I predicted the second half of the Democratic primaries. Six states have voted since then. How did my predictions […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: