FiveThirtyEight’s Nate Silver and Harry Enten are not data journalists (or “empirical journalists”), they are data contortionists. Throughout this entire election season their coverage has been so consistently inaccurate, both in description and prediction, that their work can only be viewed generously as a good parody of data analysis or journalism. This did not begin in 2015. Silver has in the past headlined articles saying that early polls are not important, only to show in the body of his article that they actually are somewhat predictive (R^2 of 0.4) of election outcomes. If his writing had been subjected to anything like the peer-review process that data analyses must pass in an academic setting, these inconsistencies between actual results and narrative would have raised many questions. But his celebrity status affords him a large audience with no real checks, so he can say whatever he wants regardless of what the data shows. And that’s what he does.
For example, as “data journalists” it is particularly surprising that they would ever tell us to ignore the data. That they consistently do so whenever the data shows an outcome that doesn’t fit with the status quo, or would raise doubts concerning their obvious favored candidate turns mere surprise into suspicion. The straw that broke the camel’s back, to me, is a recent article claiming that the Democratic primary season has actually been rigged in Sanders’s favor. This is a classic example of “big lie” propaganda: promote an idea so incredibly opposite to the truth that people cannot believe anyone would be able to lie so audaciously and conclude they must be telling the truth. In this case, their motivation is probably to role play as bold contrarians. It would be much easier to believe this act were it not for the fact that they are always biased in the same direction, and that direction is perfectly aligned with the rest of the media establishment, whose bias against Sanders verges on bloodlust.
To write the most ridiculous possible article it helps to begin with an absurd premise, so I must pause and explain the background. In this case, media reporting concerning Clinton’s lead in both the popular vote and delegates has been consistently inaccurate. Many news agencies have been including superdelegates in their delegate totals despite the fact that superdelegates do not vote until the convention and this has been explained to them by the DNC. Many news outlets report that Clinton’s lead in the popular vote is about 3 million, when in fact it is closer to 2.5 million. The incorrect number completely ignores votes from caucus states where the number of individual voters is usually not known, and the “correct” number is an estimate based on aggregating guesses by caucus precinct captains (which could be highly inaccurate). Sanders supporters have been angry about this for months, because we think our guy’s chances are being hurt by the media portraying him as a lost cause. Here is a timeline of 538 articles that all essentially say the same thing: Bernie has no chance, Clinton will certainly win, abandon all hope:
- July 8, 2015: Bernie Sanders Could Win Iowa And New Hampshire. Then Lose Everywhere Else.
- February 2: Bernie Sanders Needs More Than The Tie He Got In Iowa.
- February 10: It Gets Harder From Here For Bernie Sanders.
- March 2, before the halfway point of the primary: Hillary Clinton’s got this.
- March 30: It’s Really Hard To Get Bernie Sanders 988 More Delegates.
- April 8: Bernie Sanders Is Even Less Competitive Than He Appears.
- April 26: Today Is Clinton’s Chance To End The ‘Groundhog Day’ Campaign.
- April 28: A Sanders Comeback Would Be Unprecedented. (Is there ANYTHING about his campaign that has not been unprecedented?)
- May 24: Clinton Will Likely Clinch The Democratic Nomination In New Jersey.
This is not exhaustive; I stopped opening browser tabs because my machine was getting slow. And this is just from 538. The Upshot over at the New York Times has been almost an exact clone in terms of their coverage. Here they are dismissing Bernie immediately following his astounding upset in Michigan. And if we lower the bar to non-data journalism we can probably find thousands of other articles and TV clips pushing the same narrative: Bernie has no chance. They all said basically the same about Trump almost right up until he won. They’re probably right about Bernie, but they definitely helped make it true in his case with their absurd doom and gloom coverage. And they also exclusively reported the single most boring and uninsightful fact about his amazing campaign, completely missing what is history in the making and probably the most important change in politics in America for decades.
Against this backdrop, Shaun King at the New York Daily News was correcting the claim about Clinton’s 3 million vote lead by pointing out it did not include caucuses.
The Clinton campaign knows this. Their friends in the media know this, but they continue to allow the campaign to tout that 3 million number even though they know full well that it’s not accurate. The Democratic primaries and caucuses simply don’t have accurate popular vote totals.
He also made several other points about the unfairness of the Democratic primary, with his headline being about superdelegates. He pointed out that if superdelegates had split the opposite way than they have now, counting their votes (as the media does) would actually put Sanders ahead of Clinton.
The straw the broke the camel’s back
In response to all of this, Silver and Enten ignored every single thing about the entire process except for caucuses and wrote an article claiming that they have unfairly benefitted Sanders. It is true that caucuses generally have much lower turnout than primaries, despite the fact that they are usually held on the weekend while primaries are usually held on Tuesdays. The claim usually cited for this is that they take a long time. Except, in this primary season, due to systematic attempts at voter suppression in states like Arizona, many primary voters have ended up standing in line for longer than they would have in a caucus. These facts aside, let’s present the logic of Silver and Enten’s argument.
- Fact: Sanders has done better in caucus states than primary states.
- Claim: The demographic differences between these states is not enough to explain this over-performance.
- Fact: Sanders won the caucuses of Washington and Nebraska, and Clinton won their symbolic primaries.
They then present the predictions of a model based on demographics and the caucus effect to claim that if all states had done primaries, Clinton would have won some of the states where Sanders won and would currently be leading by much more. In other words, the suppressed voter turnout in caucuses benefited Sanders, and furthermore his claim that he wins when voter turnout is high is false.
It is a fact that Sanders has done better in caucus states. However, their claim that demographic differences cannot explain this is an essential point to their argument and they present exactly zero evidence to prove it. Fact #3 above is not even remotely close to being evidence in favor of this for reasons any statistically-literate person would know. A symbolic primary is likely to show a biased sampling of voters, and this bias is probably going to be in favor of people who were upset their candidate lost the caucus. When a sampling method is biased, it doesn’t matter how much large you make the sample. Increasing sample size reduces variance, but not bias–unless your sample becomes the entire population. The symbolic primary in Washington, for example, represents about 11% of their population. It is possible that this 11% is composed almost entirely of elderly loyal Democratic voters who have tended to favor to Clinton. We don’t have any exit polls from either the caucus or this pointless symbolic primary so we don’t know. The fact that Silver and Enten draw attention to the primary having a larger sample is, I believe, a tell that they are ignoring the possibility of bias completely, because any bias would make the sample size irrelevant.
What evidence could they have presented? They might have compared actual elections (instead of symbolic ones) between similar states. However, this would immediately disprove their entire claim. For example, Northern Marianas, Guam, American Samoa, and Hawaii all had caucuses. Sanders won Hawaii by a large margin. Clinton won the other 3 by equally large margins. What happened to the mysterious caucus effect in Guam? Why were the outcomes in Illinois, Missouri, and Iowa all roughly the same, instead of Sanders getting the eye-popping 20-25 point boost their model predicts in Iowa because of its caucus?
Pitfalls of regression analysis
Many of these questions would be easy to answer if they provided any details about their analysis. What were the units of analysis, states or counties/precincts? Since the primary/caucus rules apply at the state level, data at the county/precinct level would only help us in getting more accurate estimates of demographic effects, but not the caucus effect. If their data is at the state level, like mine, the most appropriate response would be to just throw statistics textbooks at them. They’ve got less than 50 rows of data that they are proposing to use to simultaneously estimate all the relevant demographic effects and the primary/caucus rule effects. This is insanity. In my own dataset, I have over 30 demographic predictor variables. If I did linear regression with ~50 observations and ~30 variables, I would have next to no confidence in the estimated coefficients, especially given the fact that a lot of demographic variables are highly collinear (e.g. income and education).
As an exercise in such idiocy, I just ran a few linear models and looked at the coefficients for the caucus effect and the open/closed (to independents) effect. In one model I got negative coefficients consistent with the 538 analysis. I noticed that some of the coefficient estimates with absurdly large in absolute value (because of collinearity) so I removed income and left education in the model… and the sign of the caucus effect changed, meaning that model predicted Bernie did better in primaries than in caucuses.
Maybe Nate Silver doesn’t actually know this. In that case, I am plainly frightened that the most famous “data journalist” doesn’t understand basic facts about introductory regression analysis. Another possibility is that they understand variability of regression coefficients and used a model with fewer variables… but that would only introduce bias by not adjusting for enough of the demographic effects.
The immediate response to this article on social media was disbelief. People were stunned, because 538 pushed the race narrative so hard from the beginning and then claimed Iowa, an extremely white state, would have gone 24 points in Clinton’s favor if it had not been a caucus. To me this is most worrisome. Even smart people can forget what they learned in a class years ago. That’s one thing. But if you see that prediction and you don’t immediately realize your model has failed a sanity test, you don’t even have intuition anymore. You’re flying without the instruments and you’ve lost your senses. You may as well not even know the meaning of the variable names.
Nate Silver and Harry Enten, you get a C- for this ridiculous article. Retract it and you might regain some credibility, but you’re making it harder and harder to take anything you say seriously.