Others have written about 538’s recent spate of journalistic/scientific malpractice (e.g. formulating a question in a limited manner that ignores the part of the data that doesn’t conform to ones hypothesis or narrative). I’m going to write about statistical malpractice and then tie it back in to scientific/journalistic standards at the end. I previous wrote about this but went a bit over the top with prose and punditry, so this version is going to be short and to the point.
What did they do?
In a recent article titled The System Isn’t ‘Rigged’ Against Sanders, they argue that Sanders has benefited from some kind of biasing effect caused by caucuses, and that if all states held primaries he would be significantly further behind. They do some kind of regression analysis to “control for” demographic differences and estimate this mysterious caucus effect. And they predict what the vote might have been if a state that held a caucus had been a primary. For example, they claim that in Iowa the result would have been Clinton winning by 24% instead of the tie that occurred.
Why is this wrong?
First of all, they begin the analysis by comparing vote margins in caucus states versus primary states to point out that Sanders has won more votes among all caucus states and Clinton has won more votes among all primary states. This comparison is pointless, which they acknowledge, because the states which held caucuses are demographically different from the states that held primaries. This method could only give a heavily biased estimate of the caucus effect because it does not control for demographics or other important differences. They included this pointless analysis anyway, along with its own gigantic table, right at the beginning of the article, perhaps in an attempt to anchor the readers to the conclusion they’re trying to show.
Omitted variable bias
Next they carried out a regression analysis attempting to control for some demographic differences and get an unbiased estimate of the caucus effect. Here’s what they said:
The model considers each 2016 contest and controls for (i) the black and Hispanic share of the Democratic vote in that state in the 2008 general election, (ii) whether that primary or caucus is “open” to independent voters unaffiliated with a political party, and (iii) the margin in national primary polls at the time the contest is held.
So they reduced all demographic differences between states with primaries and states with caucuses to two variables: proportion of black voters and proportion of Hispanic voters. Age? Income? Education? Geographic region? Religious differences? Economic indicators, like unemployment? Apparently none of these things matter, according to Silver and Enten. (They should know better: by their own admission they have tried various other models at different times during this election season controlling for other variables, and doing ad hoc things like leaving out Arkansas, New York, and Vermont on the premise that “home state” advantages would skew the results).
This is obviously incorrect, we know very well that at minimum age is an incredibly important variable. Young voters of all races prefer Sanders, and older voters of all races prefer Clinton. They might believe that states don’t differ much on age, but for example Florida has only 21.5% of its population age 18-34 while for Alaska this number is nearly 4% higher.
By leaving out many factors which are important for determining differences between states that could affect the vote outcomes, their analysis is subject to omitted variable bias. The practical consequence of this is that essentially we have no way of knowing if the caucus effect they estimated is anywhere near the truth. They might even have the wrong sign, meaning it is possible that caucuses actually hurt and do not help Sanders.
Collinearity and variance
The bias issue is already enough to completely kill their result, but there’s more. They also included the national polls in their regression model. This is problematic because the national polls are highly correlated with time. What else is correlated with time? Well, most caucuses happened after March 5th. There were 5 caucuses before and including March 1st, and March 1st is the day where Clinton expanded her votes the most. There were 11 caucuses later. So, the caucus effect is also probably correlated with time, and hence correlated with the national polls. In regression models, when multiple predictor variables are correlated with each other, it becomes harder to estimate their effects. The variance of the estimates will be high. This means it is more likely that if the data estimates a large caucus effect, its size is just due to noise rather than signal. And why would national polls have any relevance for state votes? Silver wrote a book about basically this, so he and Enten really should know better.
How could they correct this?
They should just retract the article, honestly. There are statistical methods that could be used in scenarios like this (I’m thinking of propensity score matching, for example), but the fundamental limitation is the size of the data. There are only about 16 states with caucuses. There are many potentially confounding effects, like the demographics I mentioned earlier. Further, even estimating the demographic effects to properly control for them may not be possible due to further confounding. For example, New York is one of the youngest states, but we have no idea how much the outcome there was affected by voter suppression–recall that over 100,000 voters had their registrations purged in just Brooklyn alone. Minority voters have gone for Clinton, but they were also directly targeted by pro-Clinton SuperPAC spending in the early voting states.
I think the only way to try to answer the underlying question concerning the effect of caucuses would be to collect new data. We would need polls/interviews of many registered voters, both ones who voted and who didn’t, preferably on either side of a state border where the two states are very similar with the only difference of importance being that one holds caucuses and the other primaries. Something like this could be done by looking at the vote outcomes for neighboring states. Many people have pointed out the Iowa +24% for Clinton estimate is absurd given that the surrounding states which held primaries were close to ties.
Ask more questions
Apparently, Silver and Enten aren’t interested in the whys. Suppose their analysis didn’t have holes in it large enough to drive a truck full of statistics textbooks through. Shouldn’t they wonder why caucuses would help Sanders? There is empirical evidence that strong support for Sanders is more widespread than strong support for Clinton. But there is also plenty of evidence that habitual/dutiful voters tend to be much older and favor Clinton. So if caucuses suppress turnout, why would the young strong supporters necessarily outnumber the older, dutiful voters?
Silver and Enten probably have one answer ready: race! It’s their favorite “demographic destiny” variable, after all. And I do believe that racial minorities might be less likely to participate in caucuses. However, the state of Iowa has about 2.9% black population and 3.7% Hispanic. So how can race explain a 24% increase for Clinton if Iowa had been a primary? Also, if this was the hypothesis for why the caucus effect exists, they should have included interaction terms between the caucus variable and race variables in their regression model.
Bigger problems with data journalism
As a statistician, I can mentally reverse-engineer their description of their analysis and make an educated guess about what precise mathematical model they’re using. But the audience for data journalism is the general public, not professional statisticians. FiveThirtyEight has a large audience and even larger indirect audience through the influence of their reporting on how other, less data-oriented news outlets. They are perceived as having a greater degree of rigor in their journalism because of the emphasis on analyzing data. But they are subject to none of the checks of peer-review in academic analysis. They don’t make their work reproducible by making the data and code openly available. I can understand the difficulty in finding a balance between giving more mathematical details and appealing to a wider audience. But the downside of the compromise they currently use is that they get all the benefits of appearing to be scientific without being subject to any of the aspects of the scientific method that actually make it rigorous.