Why Social media polling doesn’t work, and why we don’t want it to

In this post I will explain why people have attempted to use social media for political polling, why the results of this aren’t accurate under the current circumstances, and quite importantly, why we don’t want it to work. Or in internet speak: TLDR: Data from Facebook and other social media sites are a terrible measure of how well a candidate is doing, and be thankful for that.

This post was originally written in response to a Reddit question posted in /r/ask_politics by user ‘bean0s0rz’. Below is my response to the question. (Are facebook “Likes” a good measure how well a candidate is doing?)

 

Many academics have attempted to use social media as a tool for polling – for the simple reason it’s cheap to acquire data. It’s easy for anyone to make a quick comparison of candidate successes through of the number of followers each candidate has. This is in contrast to most polls which include the purchasing or collection of surveys. The collection of polling data can cost into the millions. This is because you have to pay people to collect the data, phone up, hassle, jump on, door-knock until you have enough people completing surveys. EVEN after all that you will have to pay people to complete a forum. Certain demographics (for example company directors) who are a hard to contact audience for polling companies that they are willing to pay a premium price for completed surveys from this demographic – it’s not uncommon for prices to reach £50+ for 10 questions per completed survey.

Furthermore, accurate polling data taken from social media, if perfected, would provide instant data on candidates – elections would surly change if a political campaign can see the impact of a particular debate, speech, or PR stunt instantly. Rather than holding up in their rooms for the next day waiting for a press/public reaction, they could use Social media data, taken free, to automatically optimise their campaign to achieve the highest number of ‘likes’ or positive statements taken from semantic text analysis by the public. We are already feeling the effects of popularisation of politics by social media – but, at the risk of sounding like a futurologist, if social media could be used to gauge the how well a particular candidate was doing – we would see a fairly quick shift in the way politics was conducted, and we don’t know if there would be a negative of positive result on democracy on the whole.

However the reason why we haven’t seen a huge change in the way politics is conducted is this: regardless of how many people have attempted to use social media for polling – they can’t get it to work. Some people have attempted to cheat – that is to make the model fit the result, but these models of prediction always fail on the next election. This is due to a large number of reasons, but here are some.

Firstly, Social media users are not representative of the general demographics of most countries. The old saying that the ‘internet is mostly urbanised and educated’ may not be as a pronounced assumption of the population of internet users as it used to be. But the users of the internet, and even more-so social media, are not reflective of states own populations. This results in a much skewed poll that only represents a certain level of the population. The normal result would be to weight the data; I’ll go into uses with this in a later point.

Secondly, people lie on the internet. The social de-individualisation and anonymisation of people on the internet has led to important differences in social communication. YouTube comments can be generally hateful, Reddit is a proportion of people lying for karma or stealing other peoples content, and Twitter is full of trolls. The key is here that online the goals for social success do generally change depending on the website. This means people may lie about their true political affiliations. For example Republican parents may force their children into liking particular pages that support their political view. This is generally why your individual vote at the ballot box will never be released. Therefore, what Facebook pages you support, or who you follow on Twitter may be an indication of who you support, but you can’t take it as a full blown truth.

Candidates may also lie too! A high number of Likes or Followers on social media is desirable for candidates. They may use it for political capital, or a high number of followers may start the “rich get richer” process. This is where people on social media get more likes, because as an individual you make the social judgement that if other people follow this particular company/candidate they must have some social value following, therefore you follow them too. So, let’s suggest there is a way to get more followers less-legitimately than organic user attraction (Paying for Facebook Likes/ Retweets) it would not be out of the question that this would be advantageous to the candidate. Therefore unless you can certify that the users who follow a candidate are genuine, the number is worth squat.

Thirdly, – you can’t tell from just the top line numbers of followers what demographics followers come from. For example, 5m likes does not tell us that followers for a particular page geographically from. Trump may have millions of followers on social media, but if most of them are from Russia – then you cannot make assumptions on how many people will vote for Trump come an election.

More of an issue is if you cannot understand the demographic make-up of a group of people, then statisticians can’t do something called ‘weighting’. Ideally, a selected sample (in this instance the number of people who follow each of the presidential candidates on social media) is a miniature of the population it came from. This should be reflected in the sample being representative with respect to all variables measured in the survey. Unfortunately, this is usually not the case. To fix this a weighting adjustment is used to make data representative of a country on the whole. In an ideal world we would assign an adjustment weight to each person following a particular candidate. Persons from a particular under-represented demographic would get a weight larger than 1, and those in over-represented groups get a weight smaller than 1. This would mean if you was to have 1 million people from the 18-24 age group, and only 35,000 from the 55+ group we would give more value to the older age group. Without this, we cannot make generalisations on the population as a whole from a sample of users from social media. Therefore, we cannot use data from social media to assume how the whole country, or states, would vote for a particular candidate based on demographic assumptions.

Fourthly, computers are stupid. One of the biggest problems in trying to assume a person’s political afflictions though social media posts with machine learning is that computers can’t detect sarcasm (this is more of an issue in the United Kingdom than the USA due to the different uses in language). Computers aren’t yet at the point where they can 100% understand different peoples speech patterns. This can skew results in sematic text analysis. Therefore if you was to use social media to see how the general population wanted to vote, you would have to verify with humans each and every post… then that becomes more expensive than traditional polling methods.

There are other problems too with machine programming so at the current moment; computers can’t tell us if particular posts are happy/sad, politically left or right, or in support of one candidate or another.

These are only some of the reasons why we can’t use social media to make assumptions on how well a particular candidate is doing on the run up to an election. In summary, the users of social media don’t represent countries demographics. Secondly, candidates can lie, and so can users of the internet – particular websites change how we react and once you are offline your values might actually change. We can’t accurately assume how candidate’s user-bases on Social media are made up – so we cannot apply important statistical mechanisms. And fourthly, we are not in a technological position for computers to understand us just yet, and therefore make predictions.

Then again, in my discussion on if we could tell how well candidates are doing instantly from social media, do we really want political campaign’s to be able to instantly tell how well they are doing? I don’t think we have correctly assumed the implications of a hyper-reactive politician, and how this would change democracy. After all, a politician’s job is to spend time thinking rationally on our behalf, and not making snap judgements that we as a citizenry would do. Would law’s we today think of progressive have been passed? Would democracy inch towards anarchy or at the least a form closer to direct-democracy? This is something we don’t know – but my assumption is if Politician’s started to act instantly to citizens views without thinking or debating about particular issues, this would only be bad for democracy.