13 common data mistakes you should learn to avoid

You might have heard most of these terms before. Maybe you nodded in assent when someone claimed one of them must be involved, even though you had no clue what it meat. If that’s the case, this text is for you.

I’ve collected 13 common data mistakes or data fallacies that we all make at times. I will explain both the background and I’ll give you some real-life examples. Because I know you are working hard on getting more comfortable working with data. And you shouldn’t have to make these data mistakes yourself or get fooled by someone else making them – if you don’t want to.

1. Regression towards the mean

Regression towards the mean happens when luck plays a role. But all luck and unluck, will even out with time and most of the events end up close to average over time.

An example of Regression towards the mean

Think of long jumping. Sometimes a strong wind against the athlete will lead to even the best jumper showing poor results. Similarly, a strong wind can “help” a mediocre athlete creating a remarkable (but temporary) bump in her results. Both these effects will disappear when the conditions change, and the results will move back to “normal”. So with more jumps, the results will tend to move towards an average.

Therefore, if you run an experiment over and over again, and there’s a component in luck (and trust me, that’s always the case) most of your results will be “average” over time.

2. Cherry Picking

“Cherry picking” is when you choose to communicate only the results that confirm a particular position and exclude the ones that don’t. This behaviour is well-spread and happens when you point to data or individual cases that suits your goals while ignoring a large number of related cases or data that go against that position.

An example of Cherry Picking

Remember the last time you made a CV? You probably didn’t include everything thing you’ve ever done but picked the parts you thought was most likely to give you the job, to fulfil your goal. You did, however, leave lots of things out, things that you felt was irrelevant. Still, the data you left out could have painted a very different picture about who you are. (This is probably why a CV is not the only way data collected by potential employers).

Sure, cherry picking data for your CV, or online dating profile, is standard practice. But in many situations, it’s not. If you have employees presenting a small fraction of result data, it might be a good idea to ask about the rest…

3. Sampling Bias

Sampling bias happens when you draw conclusions from a dataset that isn’t representative of the population you’re trying to understand. It’s a systematic error that can appear if you don’t have a random sample. Sampling bias is the same as the “selection effect”.

An example of Sampling Bias

What do Americans think of Donald Trumps presidency? Most of us have an instinctive answer. But most people use Facebook as their primary information source when answering the question, and what they see there isn’t a public opinion – it’s their friends’ opinion. This is classic selection bias. You use data that is easy-to-access, but it only captures a particular, unrepresentative subset of the whole population.

Another case of sampling bias is rape and crime statistics. These datasets only contain the known and reported cases, but are missing a lot of cases who never see the light. The statistics over rape is therefore never showing the rates of actual rapes but of reported ones.

4. The Observer Effect

The Observer Effect is when people modify aspects of their behaviour because they know they’re being observed. So, when you monitor someone, you might not get findings that are representative for situations outside the controlled setting. The Observer Effect is sometimes called the Hawthorne effect.

An example of the Observer Effect

I remember when I, as a child, was supposed to estimate how much time I spent on brushing my teeth as part of my homework. I decided to set the timer, and then I brushed and brushed and brushed (I was pretty ambitious as a kid). This was of course very far from the actual amount of tooth brushing I usually did. Since I knew someone would look at my data point, it became unrepresentative. This is often through for self-reported data as well.

5. False Causality

False Causality is when two events appear at the same time we sometimes falsely assume that one must have caused the other. But sometimes this is just coincidence, or it’s something else creating both events.

An example of False Causality

So, if I eat an apple before I take a big test, and do really well… the apple must have caused, or at least had a significant impact on the result, right? Or, if the number of crimes goes up about as much as the number of ice creams sold, surely eating ice cream must make people more criminal? You realise that both these two are wrong, but sometimes this fallacy is not as obvious as this.

(It’s also good to learn the difference between causation and correlation.)

6. The Cobra Effect

The Cobra Effect is when you create an incentive that accidentally produces the opposite result to the one you intended. (Oops!).

An example of the Cobra Effect

The Cobra Effect got its name from when the British government wanted to reduce the number of Cobras in India during colonial times. Therefore, they offered a bounty for every dead cobra. Initially, this was a successful tactic, but soon enterprising people began to breed cobras for the income. The Brits quickly terminated the program, the cobra breeders set all the bred snakes free, and the end result was an increased wild cobra population. So the intended a solution made the problem even worse.

The Cobra effect is sometimes called “Perverse Incentive”.

7. Gerrymandering

Gerrymandering is when you manipulate geographical boundaries that group data because you want to change the result in an election. In practice, it’s often about drawing district lines to give a specific political party, minority, or other interest groups a disadvantage in an election.

An example of Gerrymandering

I wish that Gerrymandering was mostly a hypothetical fallacy, or at least that was not common practice to redraw election districts to give certain groups a disadvantage. I’m sad to say it’s not.

8. The Monte Carlo fallacy

The Monte Carlo fallacy is the mistaken belief that, if something occurs more frequently than usual during a given period, it will happen less often in the future (or vice versa). You might know this as Gamblers Fallacy or “fallacy of the maturity of chances”.

An example of the Monte Carlo fallacy

You know that friend with 4 children, all of whom were daughters? It’s straightforward to assume that when the fifth child is on the way, it must be a son. Still, the probability is still the same as always.

The previous turn-out rarely has anything to do with the results of upcoming events created by chance.

9. The Danger of Summary metrics

The danger of summary metrics appears when you only look at summary metrics. But, the summary is only part of the story, since there might be a lot of variation in a dataset that a summary won’t tell you. So, you can easily miss interesting or significant differences in a dataset by doing this.

The danger of summary metrics is also why you need to know the difference between mean, median and mode by heart.

An example of the Danger of Summary metrics

Say you are the CEO of a large fishing industry. To be able to sell your fish, each one needs to weigh about to 500 grams. Every week you get updates about the weight of the average fish in a sample and the number of total fish. It looks good, the average fish weighs about 500 grams, and you feel confident and base your profit calculations on these values.

When it’s time to sell the fish, and it’s captured and prepared. About half of the fish weighs 200 grams, and the other half weighs 800 grams. Neither of these is fish you can sell to full price. So, by looking at only the average [(200 + 800)/2 = 500], you got fooled into thinking everything was perfect, while it most certainly was not.

10. Data Fishing

Data fishing is when you misuse data analysis to search for patterns in data that reach statistical significance when there is no real underlying effect. By repeatedly testing new hypotheses against your data, you forget that most correlations will be the result of chance. Instead, you keep going until you find some significant effect to communicate. Data fishing is the same as data dredging, data snooping, data butchery, and p-hacking.

An example of Data Fishing

Imagine you take a huge sample of people winning the lottery. You don’t know why some people are more likely to succeed than others, but you know there must be some pattern you can find if you look closely enough. So, you start testing hypotheses: length, food consumption, number of siblings, and you keep going until you see a pattern that shows significance. Finally! Of course, you should have known that people born in August are more likely to win the lottery.

Well, they’re not. You just found a pattern that appeared in the dataset by chance.

11. Survivorship Bias

Survivorship bias is a logical error of focusing on the people, things or events that have “survived” some selection criteria. Overlooking those that did not, typically because they are not visible. This error can lead to false conclusions and is a form of selection bias. Sometimes called “Survival bias”.

An example of Survivorship Bias

You might have heard about the damaged US airforce planes in World War II? The returning bomber planes were filled with bullet holes, and the US armed forces realised they needed to reinforce them with armour. They started to think about where to put the reinforcement and plotted out the damages on some planes. The wholes were spread out but concentrated around the planes wings, tail and body.

But Abraham Wald, a statistician, made an interesting observation. He claimed that reinforcing the plane in these areas would be a tremendous mistake. When looking at the bullet holes, the army had only looked at the aircraft they had in front of them and had missed to factor in the damage on those who didn’t make it back.

Some planes didn’t make it back because their bullet holes weren’t in the same areas as the ones in the sample of returned ones. Most of these planes were hit in the engine, a part that – compared to the tail, wings and body – was extremely vulnerable. A bullet in the engine made the plane crash, so it didn’t return back home to be part of the sample.

12. McNamara fallacy

The McNamara fallacy is when a decision is based exclusively on quantitative observations (i.e., metrics, hard data, statistics) while ignoring all qualitative factors. You, therefore, lose sight of the bigger picture.

An example of the McNamara fallacy

One good example of the McNamara fallacy happens daily in classrooms – learning and performance. While it’s easy to measure performance, measuring learning is hard. We tend to focus on the thing that is easy to measure, using it as a proxy for learning through test results. We are then using our test result data to improve standards.

But with too much focus on test results, we might get worse teaching or even cheating, because the focus moved away from making sure students learn, to a focus only on scoring specific numbers on a test. So, when we focus on something that is easy to measure, it’s a significant risk that everything else is considered unimportant. Instead, we should be looking to improve the amount of learning and let this improvement drive up test results.

13. Overfitting

Overfitting is a modelling error. It happens when a function fits a limited data set to firmly because you’ve created a too complicated model to explain dataset you study in detail. This makes your model overly tailored to the data you have and, therefore, not representative of the general trend.

An example of Overfitting

“Oh no! Lisa is leaving the marketing department. How will we ever find a good replacement?”
Wanted: 36-year-old female with degrees in marketing and political science from Stockholm School of Economics. Needs to have a boring husband and two kids (4 and 6). You should spend your weekends hiking with your Golden Retriever. You should be 67 inches tall with blonde hair and freckles, and loudly curse people who eat fish in the microwave.

In this case, the employer is unable to differentiate relevant and irrelevant characteristics. The asked for qualifications are probably only met by the person who they know is right for the job, because she uses to have it. The problem is that she no longer wants it.


13 common data mistakes you should learn to avoid

Leave a Reply

Your email address will not be published. Required fields are marked *