I’m a statistician and I support Bernie Sanders.
So this is how I spent part of my Saturday.
I gathered data from sources including the US census, Wikipedia, WolframAlpha, and some others (the Facebook map data from FiveThirtyEight, Google trends data using the gtrendsR package). Using this data–and not any polls–I built a predictive model in a standard fashion using ridge regression with cross-validation to choose the level of regularization. Since I only had data for 26 states that have voted so far, there is reason to believe (7-fold) cross-validation will not be very stable, so I also averaged the predictions from 100 models randomly generated by using different splits for cross-validation. The plot below shows that the resulting predictions come pretty close to Senator Sanders’s actual share of the vote. (Important note: that is what I’m predicting, vote share, not probability of winning).
The individual results for each state are listed below. My apologies for the absurdly large table… WordPress won’t let me change it (clearly they are part of the political establishment).
Before we get to the predictions, let’s also look at how the models weighted each predictor variable. Below I show a boxplot for each predictor showing the weights given to that predictor by each of the 100 models (remember I’m averaging the predictions of these models).
Several things are worth noting here.
- The census data I used had variables for portions of workers in all kinds of different industries and I threw most of these away. But I had a suspicion that IndustryFinance and IndustryManufacturing might be important. Apparently I was right about finance: states relying on that industry do not like Bernie. Surprise! Actually, the effect may not be as large as it appears in these models. Among states that have voted so far, the ones with the largest portion of finance industry are Florida and Massachusetts (tied) and the one with the lowest is Vermont. We’ll know more when New York votes on April 19.
- Age effects: Age18to34 is actually a smaller effect than Age45to55, and this is a little surprising. It might reflect the lower turnout rates among younger voters. And even though Bernie wins millennials by a larger margin, he actually wins people under 55 as well, all other things being equal. To answer TNR’s question “Who is the Hillary voter?” — they are mostly fairly old.
- None of the available exit poll data has information about Asians, so it is pretty surprising that RaceAsian is an important variable. It’s certainly not the narrative about race that we’ve been hearing.
- Other things working in Sanders’s favor: education above the high school level, high speed internet access, and surpluses of likes on Facebook.
- Things working in Clinton’s favor: poverty, high unemployment, lack of higher education, and proportions of population with high income or who are black or old.
And now for the predictions!
These predictions are not great news for Sanders. They would translate to roughly 1923 delegates, not enough to win. Bernie will need to beat these expectations by about 5% across the board in order to win. This is nothing new: FiveThirtyEight has been singing this song since the day after the first Super Tuesday. However, I think there’s reason to still hope. Three reasons.
First, I think the IndustryFinance and RaceAsian effects are probably not as large as they appear based on the previous elections. This means the predictions for California and Hawaii might be a little too high, and the predictions for Delaware, Connecticut, New Jersey, Arizona, and New York might be too low.
Second, in a word: momentum. If Bernie can hold his own in Arizona and win big in Utah and Idaho, which is certainly within reach, he’ll be set up for a long stretch of big wins before New York votes on April 19th. This could yield the all-important media coverage he has been denied since early March (with the one exception of the Michigan upset). Bernie is probably too old for the nickname “comeback kid,” so media people reading this have an action item: think of a better nickname before April 5-9 (Wisconsin and Wyoming, both predicted over 60%).
The last reason is also the only reason I have any hope left for democracy in this country. Billionaire donors and high-rolling campaign bundlers are one thing, an army of volunteers is another. The grassroots supporting Sanders have been growing and improving in organization. We’ve made over 30 million phone calls to voters, and the rate is increasing. When this effort was focused on Michigan it was part of the reason the outcome swung by 20% from the polls. Spread out over 5 states on the last Super Tuesday it wasn’t enough. It’s currently focusing on 3 states 2 of which are already favorable, then another group of 3 which are all favorable, and then one at a time leading up to New York. It will then remain to be seen if our organization has improved enough to handle 5 states at once on April 26th, and 6 on June 7th (including the crux: California).
Taken together, these three things give me hope despite the numbers. And I’m a numbers guy. So Sanders supporters: keep up the great work! It’s gonna take a lot more of it to beat these expectations by large enough margins to win.
Now I will predict my own actions
Speculation: once more western states have voted, the West variable will become less negative and yield higher predictions in California. Also, once New York votes the IndustryFinance variable will probably be less negative.
Something I probably won’t do: aggregate data at the county or congressional district level. I’m pretty sure that would yield much more accurate predictions, but it’s too much work for me. I don’t know the API necessary to scrape Facebook so I entered the state numbers by hand. If someone hired me to do it I would (psst, hey Jeff Weaver…).
Note: A previous version of this post had different results because I had not standardized the predictor variables properly.