In case you haven’t heard, Election Day is Nov. 3rd! While ads on TV, text, social media, and pretty much any other method you can think of are hard at work encouraging everyone to get out and vote, data science has been hard at work building models to try and use the plethora of data available to understand voter behavior and predict the outcomes of this election.
Much like everything else in our current political climate, our relationship with predictive models and polling data has been complex over the past two election cycles (if not longer). What first began as a blog about how prediction strategies based off of polling data have adjusted since the 2016 Presidential election, evolved into an interesting timeline of analyses/ literature reflecting how our thoughts about the polls have changed over the past few years, questions about the place of polling in the democratic process and why it’s not the only data that matters.
Going back to 2015 (which feels like much more than 5 years ago…) Jill Lepore’s New Yorker article covered some polling basics, and included a deep dive into some of the history polling and analytics. She discussed how polling response rates have been declining at the same time as efforts to leverage this data for analytics have increased. This piece also raised the questions about the power of polls and data science to reflect public opinion on the one hand, and have undue influence on the other.
Shortly thereafter, more questions about election polling, as well as the data science models built off of this data, came to the forefront of the public consciousness after the results of the 2016 Presidential election – as the press pondered “Why did the polls get it so wrong?”
This wave of incredulity seemed to be relatively short-lived, as by 2017/2018 the consensus seemed to be that the 2016 polls were about as accurate as could be expected. A FiveThirtyEight blog post provided a more detailed breakdown on how some of the polling errors were within a normal range of expected error, and analyzed why there was so much criticism of the way the 2016 polls performed. However, this conversation validating that the 2016 predictions were “all right” also included some criticism of subjective statistical modeling.
Speaking of those subjective predictive models, as analytics has become more a part of the common vocabulary the discussions of polling and analytics in 2020 have started to include more critical analysis of what data science can and can’t do in terms of predicting elections.
A general suspicion of polling data seems to have further spurred a desire for more transparency about the methodology of model building. In an August 2020 blog post, the well known Nate Silver of FiveThirtyEight spelled out his methodology in terms of adjustments he has made to his models in terms of Covid-19 specifically, as well as new features he and his team have worked on in the name of continuous improvement. The post also includes a detailed explanation of the steps the FiveThirtyEight model follows – from data collection to simulation.
Similarly, The Economist proposed their own model for election forecasting (based on polling, economic, and demographic data), explained their methodologies, and even published their source code so that readers who are interested can play around with it.
However, it is also important to note that there are data models at work that aren’t based primarily on polling data.
First, we have the AI engines that analyze sentiment analysis – studying what people’s words and actions on social media say about public opinion, and the intent to vote.
Taking what we learned in 2016 about the ability of fake social media accounts to influence the opinions of voters – these engines can also identify factors that would flag a potentially fake social media account that was spreading misinformation to a target demographic. Sentiment analysis may be an especially important point in this election. Since voting is already underway, and has been for some time, changes in sentiment in September and October could affect how voters vote on any given day. Given this, as well as looking back at the sentiment analysis predictions from 2016, there is some argument that sentiment analysis could end up being a better predictor for elections than polling. However, it is also worth noting that the two AI engines referenced above produced dramatically different predictions.
Another type of predictive model not built on polling data comes from quantitative historian Allan Lichtman – who bases his statistical pattern recognition algorithm on thirteen variables in terms of their likelihood to produce stability (incumbent/their party remain in power) or instability (challenger wins). In some ways this approach is uniquely suited to survive even a chaotic year such as 2020, as the model is immune to some of the fickleness that plagues polling and public sentiment on social media.
While the exploration of Lichtman’s model in the above linked piece is interesting, I found the key prediction from the article to be that: “TV advertising is likely to have little practical effect other than to annoy most voters.” Personal experience confirms this, and I would extend the prediction to encompass all of those political text messages I have been getting as well.
Fortunately, the end of this election season is almost upon us. In the months to come, I would expect to see another round of analysis about how and why the various polls and predictive efforts either succeeded or failed for this cycle. Based on the points covered in the above linked articles, some points that will be particularly interesting to review include: comparing the accuracy of the various models available, analysis of how such models were able to adapt (both after what was learned from the 2016 election as well as to accommodate for all of the unique curveballs that 2020 has provided), and discussions of the value of polling data/sentiment analysis when so many voters were turning in their ballots early.
And those are my predictions about the predictions.
Until next time…
My only problem with Katie’s newsletters is, how is she going to top herself next month? I love this one! It didn’t lower my anxiety about the election but it was great work, Katie! (now get to work on next month’s! 🙂 )
MAKE STAFF QUALITY A GOAL FOR 2021 Dataspace provides contract staffing in analytics, data science, and data engineering. The top reason clients give for working with us is how well we screen our consultants. In fact, only about 1% of the people we see make it through our screen. Data managers who get fed up with stacks of unscreened, unqualified candidates become our best customers. So, if you’re thinking about adding contractors, give me a call, I’d love to discuss our process and how we might be able to help.
MOVING TO THE CLOUD IS HARD! Have you ever tried to move an application to a cloud platform, such as AWS or Google Cloud? I’ve been working on porting Golden Record (see below) to AWS for a while now and, it turns out, it’s not easy. I’m being mentored by an experienced tech genius on this. So far, our AWS stack includes the following AWS products: CLI, VPC, Route 53, EC2, EFS, RDS, Cloudwatch, ECS, Fargate, ECR, and IAM (EKS was in there, too, until recently). Yes, it’s a long list of acronyms and you don’t need to look them all up but, to move a database-centric application to AWS, they’re pretty much all required. The learning curve, and the opportunity for frustration, is enormous. No wonder cloud architects are so valuable.
While we’re almost done, we are also investigating simpler options. For example, services like Heroku and PythonAnywhere hide all of this complexity. They don’t give you all of the flexibility nor all of the tools, but they will let you get an application up in the cloud quickly. If we do move to one of these simpler options I know we’ll eventually end up back at AWS or Google Cloud but, while we’re still young, let’s start with training wheels. We’ll take them off when we’re ready to ride faster.
While I’m talking about Golden Record, our cloud-based record matching and deduplication technology, let me add a bit about what’s up with that. We have a number of enhancements in mind for right after we move to the cloud. First is multipass matching. This will allow users to define different ways to match records. Users will then be able to review matches and determine if they’re valid or not. We’ve also started on first name synonyms. Once implemented, Golden Record will be able to tell that Ben and Benjamin could be the same person. And, we’re not done, there are a ton more coming after these. Golden Record is already strong (and somewhat unique!) and, over time, it will become even better.
If you’re interested in becoming a beta tester and finding common records across and unlimited number of data sets, for free, please reach out. I’d love to add you to our program!
That’s all for now. Thanks for reading and stay safe!