Monthly Archives: May 2016

FiveThiryEight is wrong: the system IS rigged against Sanders

FiveThirtyEight is wrong

Nate Silver and Harry Enten claim The System Isn’t ‘Rigged’ Against Sanders. I’ve written at length already debunking their argument and drawing attention to the statistical malpractice they rely on to make it. To summarize, their argument is that caucuses have favored Sanders by suppressing the vote, and that somehow this disadvantages Clinton supporters more than Sanders supporters. Using a severely flawed statistical model they estimate that Clinton would have done 20-25% better in caucus states if they held primaries instead. To their credit, Silver and Enten attempted to address the question of having open vs closed primaries. But despite the sweeping title of their article (the system!), their focus is entirely too narrow. They identified two possible mechanisms by which the system could be influencing votes: caucuses vs primaries and whether or not the vote is open to independents.

The system IS rigged against Sanders

I conducted my own analysis to address some problems with theirs. Their model included percent of population that is black, percent Hispanic, whether the vote was a primary or caucus, whether it was open or closed to independents, and the national polling margin at the time of the vote. I’ll do several things slightly differently. Instead of national polling margin, I’ll just use the date–this is highly correlated with the national polling margin anyway. I did this more out of convenience than anything else, because my data already had date but not national polls. This difference is not important. Next, I’ll include just one more variable: whether or not the state has same-day registration. It just so happens that almost every caucus state also has same-day registration. Here are the coefficients of the resulting model:

Variable Estimate Std. Err. p-value
(Intercept) 69.2271 4.2055 <0.00
Date 7.5544 7.1372 0.3
Deadline -5.6614 5.1451 0.28
Type -5.946 5.2812 0.27
Independents 2.232 2.7884 0.43
RaceBlack -1.1415 0.1488 <0.00
RaceHispanic -0.3431 0.1974 0.09

Let me break this down for you. Ignore the (Intercept) variable. The Date variable estimate of roughly 7.5 means that, on average, Sanders has gained 7.5% comparing the most recent votes to the first votes early on. The Deadline variable at about -5.7 means Sanders loses about 5.7% on average when states do not allow same-day registration. The Type variable means Sanders loses 5.9% in primaries compared to caucuses, again on average. (As an aside, if I leave out Deadline and have almost the same model as 538, the Type variable estimate is about -10.46, still not quite the absurd estimate Silver and Enten present). The Std. Err. and p-value columns tell us roughly how certain we can be that the estimate is good and that the effect isn’t really just zero. Many of the p-values are above the traditional 0.05 “significance” cutoff because this model is not very good.

Let’s try a better model. As a described in a previous post, Silver and Enten are not adjusting for other important demographic variables like age, income, and so on. Due to the limited sample size (I have 44 rows in my data), it’s not realistic to simultaneously estimate many demographic effects. I’ll just include two more variables: median age and percent of population having a high school degree or less.

Variable Estimate Std. Err. p-value
(Intercept) 130.7826 23.6031 <0.00
Date 6.9082 6.3529 0.28
Deadline -5.3206 4.4755 0.24
Type 0.3334 4.9385 0.95
Independents 0.983 2.5311 0.7
RaceBlack -1.1497 0.1502 <0.00
RaceHispanic -0.8587 0.2384 <0.00
MedianAge -0.8648 0.5918 0.15
EduHSorless -0.7326 0.2329 <0.00

Surprise! The Type estimate is now only 0.33, meaning if you also do a slight adjustment for age and education Sanders only benefits by 0.33% in states having a caucus. The Deadline estimate is still roughly the same. The fact that the Deadline estimate is stable to this change in the model gives me more confidence that its effect is real. If I include another variable, InternetAccess–an estimate of what percent of the population has access to high speed internet–the Deadline estimate becomes -4.87 and Type is -0.25, consistent. If I also include some regional indicators for the South East, North East, and West (leaving the mid-west as part of the intercept) Deadline becomes -5.74 and Type becomes 0.55–meaning Sanders now actually benefits from primaries relative to caucuses.

The data and code for this analysis is available in this Github repo in the files DemPrimaryData.csv and Rigged.R

It’s the voter registration deadlines, stupid

I shouldn’t be writing any of this. I’m supposed to be finishing my thesis right now. So I’m not going to spend the time to find data for primary turnout this year and do a regression to show that turnout is depressed by early registration deadlines. Instead, I will cite several facts which are either obvious or easy to verify with Google.

  1. Young people are more likely to be first time voters.
  2. Young people and first time voters are less likely to be registered, and if they are registered they are more likely to be registered as Independents.
  3. Young people and first time voters are less likely to know that registration deadlines exist and can be surprisingly early.
  4. Some states with closed primaries, like New York, have even earlier deadlines for party affiliation changes. New York’s was back in October of 2015, four days before the first Democratic debate. New York’s turnout was also second lowest of any state…

Nate Silver and Harry Enten ignored all of this. They conducted a highly flawed statistical analysis that left out important demographic controls and had no data at all related to registration deadlines or other forms of voter suppression. Enten in particular with his background in political science should know there is a vast literature of research on voter suppression involving things like registration deadlines and voter ID requirements. By pretending that the caucus effect is the only one that matters, they claim to answer a far bigger and far more important question than they actually do, and the answer they give for their limited question is still flawed.

Bernie might have been winning…

My own analysis, controlling for more demographic variables and checking that my results are stable when I add or remove several of these controls, shows that Sanders probably lost at least 5% on average in states that did not allow same-day registration. Sanders currently has about 45% of the delegates. It’s impossible to say anything counter-factual about this with certainty, but try to imagine how different things would be. The first Super Tuesday would have been far less devastating, and we may never have seen the widespread media narrative that developed about Clinton’s commanding lead in “delegate math.” The following states might have switched from a loss/tie to a tie/victory.

State Advanced days Vote % Bernie
North Carolina 23 40.76
Arizona 28 41.39
NewYork 23 42.01
Ohio 27 43.13
Pennsylvania 28 43.56
Kentucky 28 46.33
Connecticut 1 46.42
Illinois 27 48.61
Massachusetts 19 48.69
Missouri 26 49.36

Conservatively, Bernie might have won 4 or 5 more states, and might have come close to a tie in New York. The clear change-point in this graph might not have happened:

depressed

I think it’s safe to say that the lack of same-day registration is a very significant factor in Clinton’s lead. In all of this, I did not even begin to ask how it might have been different if closed primaries were open to independents.

Statistical malpractice at FiveThirtyEight

Others have written about 538’s recent spate of journalistic/scientific malpractice (e.g. formulating a question in a limited manner that ignores the part of the data that doesn’t conform to ones hypothesis or narrative). I’m going to write about statistical malpractice and then tie it back in to scientific/journalistic standards at the end. I previous wrote about this but went a bit over the top with prose and punditry, so this version is going to be short and to the point.

What did they do?

In a recent article titled The System Isn’t ‘Rigged’ Against Sanders, they argue that Sanders has benefited from some kind of biasing effect caused by caucuses, and that if all states held primaries he would be significantly further behind. They do some kind of regression analysis to “control for” demographic differences and estimate this mysterious caucus effect. And they predict what the vote might have been if a state that held a caucus had been a primary. For example, they claim that in Iowa the result would have been Clinton winning by 24% instead of the tie that occurred.

Why is this wrong?

First of all, they begin the analysis by comparing vote margins in caucus states versus primary states to point out that Sanders has won more votes among all caucus states and Clinton has won more votes among all primary states. This comparison is pointless, which they acknowledge, because the states which held caucuses are demographically different from the states that held primaries. This method could only give a heavily biased estimate of the caucus effect because it does not control for demographics or other important differences. They included this pointless analysis anyway, along with its own gigantic table, right at the beginning of the article, perhaps in an attempt to anchor the readers to the conclusion they’re trying to show.

Omitted variable bias

Next they carried out a regression analysis attempting to control for some demographic differences and get an unbiased estimate of the caucus effect. Here’s what they said:

The model considers each 2016 contest and controls for (i) the black and Hispanic share of the Democratic vote in that state in the 2008 general election, (ii) whether that primary or caucus is “open” to independent voters unaffiliated with a political party, and (iii) the margin in national primary polls at the time the contest is held.

So they reduced all demographic differences between states with primaries and states with caucuses to two variables: proportion of black voters and proportion of Hispanic voters. Age? Income? Education? Geographic region? Religious differences? Economic indicators, like unemployment? Apparently none of these things matter, according to Silver and Enten. (They should know better: by their own admission they have tried various other models at different times during this election season controlling for other variables, and doing ad hoc things like leaving out Arkansas, New York, and Vermont on the premise that “home state” advantages would skew the results).

This is obviously incorrect, we know very well that at minimum age is an incredibly important variable. Young voters of all races prefer Sanders, and older voters of all races prefer Clinton. They might believe that states don’t differ much on age, but for example Florida has only 21.5% of its population age 18-34 while for Alaska this number is nearly 4% higher.

By leaving out many factors which are important for determining differences between states that could affect the vote outcomes, their analysis is subject to omitted variable bias. The practical consequence of this is that essentially we have no way of knowing if the caucus effect they estimated is anywhere near the truth. They might even have the wrong sign, meaning it is possible that caucuses actually hurt and do not help Sanders.

Collinearity and variance

The bias issue is already enough to completely kill their result, but there’s more. They also included the national polls in their regression model. This is problematic because the national polls are highly correlated with time. What else is correlated with time? Well, most caucuses happened after March 5th. There were 5 caucuses before and including March 1st, and March 1st is the day where Clinton expanded her votes the most. There were 11 caucuses later. So, the caucus effect is also probably correlated with time, and hence correlated with the national polls. In regression models, when multiple predictor variables are correlated with each other, it becomes harder to estimate their effects. The variance of the estimates will be high. This means it is more likely that if the data estimates a large caucus effect, its size is just due to noise rather than signal. And why would national polls have any relevance for state votes? Silver wrote a book about basically this, so he and Enten really should know better.

How could they correct this?

They should just retract the article, honestly. There are statistical methods that could be used in scenarios like this (I’m thinking of propensity score matching, for example), but the fundamental limitation is the size of the data. There are only about 16 states with caucuses. There are many potentially confounding effects, like the demographics I mentioned earlier. Further, even estimating the demographic effects to properly control for them may not be possible due to further confounding. For example, New York is one of the youngest states, but we have no idea how much the outcome there was affected by voter suppression–recall that over 100,000 voters had their registrations purged in just Brooklyn alone. Minority voters have gone for Clinton, but they were also directly targeted by pro-Clinton SuperPAC spending in the early voting states.

I think the only way to try to answer the underlying question concerning the effect of caucuses would be to collect new data. We would need polls/interviews of many registered voters, both ones who voted and who didn’t, preferably on either side of a state border where the two states are very similar with the only difference of importance being that one holds caucuses and the other primaries. Something like this could be done by looking at the vote outcomes for neighboring states. Many people have pointed out the Iowa +24% for Clinton estimate is absurd given that the surrounding states which held primaries were close to ties.

Ask more questions

Apparently, Silver and Enten aren’t interested in the whys. Suppose their analysis didn’t have holes in it large enough to drive a truck full of statistics textbooks through. Shouldn’t they wonder why caucuses would help Sanders? There is empirical evidence that strong support for Sanders is more widespread than strong support for Clinton. But there is also plenty of evidence that habitual/dutiful voters tend to be much older and favor Clinton. So if caucuses suppress turnout, why would the young strong supporters necessarily outnumber the older, dutiful voters?

Silver and Enten probably have one answer ready: race! It’s their favorite “demographic destiny” variable, after all. And I do believe that racial minorities might be less likely to participate in caucuses. However, the state of Iowa has about 2.9% black population and 3.7% Hispanic. So how can race explain a 24% increase for Clinton if Iowa had been a primary? Also, if this was the hypothesis for why the caucus effect exists, they should have included interaction terms between the caucus variable and race variables in their regression model.

Bigger problems with data journalism

As a statistician, I can mentally reverse-engineer their description of their analysis and make an educated guess about what precise mathematical model they’re using. But the audience for data journalism is the general public, not professional statisticians. FiveThirtyEight has a large audience and even larger indirect audience through the influence of their reporting on how other, less data-oriented news outlets. They are perceived as having a greater degree of rigor in their journalism because of the emphasis on analyzing data. But they are subject to none of the checks of peer-review in academic analysis. They don’t make their work reproducible by making the data and code openly available. I can understand the difficulty in finding a balance between giving more mathematical details and appealing to a wider audience. But the downside of the compromise they currently use is that they get all the benefits of appearing to be scientific without being subject to any of the aspects of the scientific method that actually make it rigorous.

 

 

Nate Silver: data contortionist

FiveThirtyEight’s Nate Silver and Harry Enten are not data journalists (or “empirical journalists”), they are data contortionists. Throughout this entire election season their coverage has been so consistently inaccurate, both in description and prediction, that their work can only be viewed generously as a good parody of data analysis or journalism. This did not begin in 2015. Silver has in the past headlined articles saying that early polls are not important, only to show in the body of his article that they actually are somewhat predictive (R^2 of 0.4) of election outcomes. If his writing had been subjected to anything like the peer-review process that data analyses must pass in an academic setting, these inconsistencies between actual results and narrative would have raised many questions. But his celebrity status affords him a large audience with no real checks, so he can say whatever he wants regardless of what the data shows. And that’s what he does.

For example, as “data journalists” it is particularly surprising that they would ever tell us to ignore the data. That they consistently do so whenever the data shows an outcome that doesn’t fit with the status quo, or would raise doubts concerning their obvious favored candidate turns mere surprise into suspicion. The straw that broke the camel’s back, to me, is a recent article claiming that the Democratic primary season has actually been rigged in Sanders’s favor. This is a classic example of “big lie” propaganda: promote an idea so incredibly opposite to the truth that people cannot believe anyone would be able to lie so audaciously and conclude they must be telling the truth. In this case, their motivation is probably to role play as bold contrarians. It would be much easier to believe this act were it not for the fact that they are always biased in the same direction, and that direction is perfectly aligned with the rest of the media establishment, whose bias against Sanders verges on bloodlust.

Background

To write the most ridiculous possible article it helps to begin with an absurd premise, so I must pause and explain the background. In this case, media reporting concerning Clinton’s lead in both the popular vote and delegates has been consistently inaccurate. Many news agencies have been including superdelegates in their delegate totals despite the fact that superdelegates do not vote until the convention and this has been explained to them by the DNC. Many news outlets report that Clinton’s lead in the popular vote is about 3 million, when in fact it is closer to 2.5 million. The incorrect number completely ignores votes from caucus states where the number of individual voters is usually not known, and the “correct” number is an estimate based on aggregating guesses by caucus precinct captains (which could be highly inaccurate). Sanders supporters have been angry about this for months, because we think our guy’s chances are being hurt by the media portraying him as a lost cause. Here is a timeline of 538 articles that all essentially say the same thing: Bernie has no chance, Clinton will certainly win, abandon all hope:

This is not exhaustive; I stopped opening browser tabs because my machine was getting slow. And this is just from 538. The Upshot over at the New York Times has been almost an exact clone in terms of their coverage. Here they are dismissing Bernie immediately following his astounding upset in Michigan. And if we lower the bar to non-data journalism we can probably find thousands of other articles and TV clips pushing the same narrative: Bernie has no chance. They all said basically the same about Trump almost right up until he won. They’re probably right about Bernie, but they definitely helped make it true in his case with their absurd doom and gloom coverage. And they also exclusively reported the single most boring and uninsightful fact about his amazing campaign, completely missing what is history in the making and probably the most important change in politics in America for decades.

Against this backdrop, Shaun King at the New York Daily News was correcting the claim about Clinton’s 3 million vote lead by pointing out it did not include caucuses.

The Clinton campaign knows this. Their friends in the media know this, but they continue to allow the campaign to tout that 3 million number even though they know full well that it’s not accurate. The Democratic primaries and caucuses simply don’t have accurate popular vote totals.

He also made several other points about the unfairness of the Democratic primary, with his headline being about superdelegates. He pointed out that if superdelegates had split the opposite way than they have now, counting their votes (as the media does) would actually put Sanders ahead of Clinton.

The straw the broke the camel’s back

In response to all of this, Silver and Enten ignored every single thing about the entire process except for caucuses and wrote an article claiming that they have unfairly benefitted Sanders. It is true that caucuses generally have much lower turnout than primaries, despite the fact that they are usually held on the weekend while primaries are usually held on Tuesdays. The claim usually cited for this is that they take a long time. Except, in this primary season, due to systematic attempts at voter suppression in states like Arizona, many primary voters have ended up standing in line for longer than they would have in a caucus. These facts aside, let’s present the logic of Silver and Enten’s argument.

  1. Fact: Sanders has done better in caucus states than primary states.
  2. Claim: The demographic differences between these states is not enough to explain this over-performance.
  3. Fact: Sanders won the caucuses of Washington and Nebraska, and Clinton won their symbolic primaries.

They then present the predictions of a model based on demographics and the caucus effect to claim that if all states had done primaries, Clinton would have won some of the states where Sanders won and would currently be leading by much more. In other words, the suppressed voter turnout in caucuses benefited Sanders, and furthermore his claim that he wins when voter turnout is high is false.

It is a fact that Sanders has done better in caucus states. However, their claim that demographic differences cannot explain this is an essential point to their argument and they present exactly zero evidence to prove it. Fact #3 above is not even remotely close to being evidence in favor of this for reasons any statistically-literate person would know. A symbolic primary is likely to show a biased sampling of voters, and this bias is probably going to be in favor of people who were upset their candidate lost the caucus. When a sampling method is biased, it doesn’t matter how much large you make the sample. Increasing sample size reduces variance, but not bias–unless your sample becomes the entire population. The symbolic primary in Washington, for example, represents about 11% of their population. It is possible that this 11% is composed almost entirely of elderly loyal Democratic voters who have tended to favor to Clinton. We don’t have any exit polls from either the caucus or this pointless symbolic primary so we don’t know. The fact that Silver and Enten draw attention to the primary having a larger sample is, I believe, a tell that they are ignoring the possibility of bias completely, because any bias would make the sample size irrelevant.

What evidence could they have presented? They might have compared actual elections (instead of symbolic ones) between similar states. However, this would immediately disprove their entire claim. For example, Northern Marianas, Guam, American Samoa, and Hawaii all had caucuses. Sanders won Hawaii by a large margin. Clinton won the other 3 by equally large margins. What happened to the mysterious caucus effect in Guam? Why were the outcomes in Illinois, Missouri, and Iowa all roughly the same, instead of Sanders getting the eye-popping 20-25 point boost their model predicts in Iowa because of its caucus?

Pitfalls of regression analysis

Many of these questions would be easy to answer if they provided any details about their analysis. What were the units of analysis, states or counties/precincts? Since the primary/caucus rules apply at the state level, data at the county/precinct level would only help us in getting more accurate estimates of demographic effects, but not the caucus effect. If their data is at the state level, like mine, the most appropriate response would be to just throw statistics textbooks at them. They’ve got less than 50 rows of data that they are proposing to use to simultaneously estimate all the relevant demographic effects and the primary/caucus rule effects. This is insanity. In my own dataset, I have over 30 demographic predictor variables. If I did linear regression with ~50 observations and ~30 variables, I would have next to no confidence in the estimated coefficients, especially given the fact that a lot of demographic variables are highly collinear (e.g. income and education).

As an exercise in such idiocy, I just ran a few linear models and looked at the coefficients for the caucus effect and the open/closed (to independents) effect. In one model I got negative coefficients consistent with the 538 analysis. I noticed that some of the coefficient estimates with absurdly large in absolute value (because of collinearity) so I removed income and left education in the model… and the sign of the caucus effect changed, meaning that model predicted Bernie did better in primaries than in caucuses.

Maybe Nate Silver doesn’t actually know this. In that case, I am plainly frightened that the most famous “data journalist” doesn’t understand basic facts about introductory regression analysis. Another possibility is that they understand variability of regression coefficients and used a model with fewer variables… but that would only introduce bias by not adjusting for enough of the demographic effects.

Postmortem

The immediate response to this article on social media was disbelief. People were stunned, because 538 pushed the race narrative so hard from the beginning and then claimed Iowa, an extremely white state, would have gone 24 points in Clinton’s favor if it had not been a caucus. To me this is most worrisome. Even smart people can forget what they learned in a class years ago. That’s one thing. But if you see that prediction and you don’t immediately realize your model has failed a sanity test, you don’t even have intuition anymore. You’re flying without the instruments and you’ve lost your senses. You may as well not even know the meaning of the variable names.

Nate Silver and Harry Enten, you get a C- for this ridiculous article. Retract it and you might regain some credibility, but you’re making it harder and harder to take anything you say seriously.

 

A brief thought on the Democratic Party

According to polls, a majority of all voters prefer most of Bernie Sanders big policy proposals. By absurd margins, the next generation of progressive voters (and many of the libertarians as well) prefer Bernie as the candidate to enact those policies. The same is true of the largest group of voters in America: independents, who outnumber both Democrats and Republicans.

Despite all of this, and despite him winning over 45% of the delegates so far, the Democratic Party establishment is doing everything in its power to stop the candidacy of Bernie Sanders and to curb his influence on the Democratic convention. Not only do they want to prevent him from being the nominee, they want to prevent his policy proposals from becoming part of their platform. This is true even though a majority of all Americans favor his policy proposals.

Superdelegates and party insiders have gone to Hillary by absurd margins even when the voters in their states made it very loud and clear they prefer Bernie. SuperPACs have been spending to boost her in the primary instead of waiting until the general election to defeat Republicans. Think about that. SuperPACs are spending to defeat Bernie and his coalition of young voters and independents. In fact, they specifically targeted minority voters with ads associating Hillary with the historic and enormously favorable presidency of Barack Obama (despite the fact that she was his opponent in 2008 and tried all kinds of racist attacks against him).

My argument is not to disenfranchise the people who voted for Hillary. But that’s absurd anyway, because it’s the opposite of what’s happening. She has less than 55% of the delegates but she is on track to have 100% control of the Democratic Party. Again, this is despite the fact that a majority of all voters prefer Bernie’s policies even if they prefer her as the candidate. Given the low voter participation in our sad excuse for a Democracy, the restrictive rules in many states disallowing independents from voting in the primaries, and strong statistical evidence of systematic election fraud favoring Hillary, I’m not even sure a majority of voters prefer her as the candidate.