fbpx

How to prepare data for analysis with simple tools

So:

Data is one everybody lips these days. And we all know, being more data-driven makes sense. So, now we need to get started.

But you might ask:

“How do I get started?”

Well, getting started with data is a bit tricky, because it tends to become advanced pretty quickly – and when that happens, most of us give up. Not everyone is a data scientist – or a statistician, or an engineer. Some of us are more normal and less advanced.

But shouldn’t we be able to work with data too? Is it THAT hard? The answer is NO, at least not always – not even most of the time.

So, if you are saying:

“I don’t need a rocket, just a skateboard. Where do I start?”

Then, this post is for you.

Why do you need to prepare data for analysis?

Here’s the thing:

The first step when you work with data is to turn your collected data into something meaningful. Most data don’t arrive like that; it comes full of missing values and errors and line breaks where they shouldn’t be.

And, 99% of the time you need to prepare your data in some way before you can get anything done.

The step before doing any data analysis is often called cleaning data or data processing. According to Wikipedia, data processing is: ”the collection and manipulation of items of data to produce meaningful information.”

You can find simple tools for data processing here and there. And more of them are popping up. These tools can make the work a lot easier.

Here’s the thing:

I’m not going to list them all here.

Why?

I don’t know them all.

Instead, I’m guiding you to the ones I regularly use when I prepare data, and when and how I use them.

So, here we go…

Simple tools to prepare data

Here are the tools I use to prepare data for analysis. I will walk you through them from most basic to most advanced:

  • Google Sheets
  • Microsoft Excel
  • Atom
  • Tableau Prep
  • R and RStudio

Use Google Sheets when you prepare data for analysis

Use Google Sheets to prepare data for analysis
  • Price: Free
  • Skill-level: Basic
  • Platforms: Google Sheets works on any computer running Chrome, Firefox, Internet Explorer 11, Microsoft Edge, or Safari as their web-browser. Or as an app for Android (running version 4.4 KitKat or later) or iOS (running version 9.0 or later) devices.

Some beginner user cases:

  • Merge data files
  • Re-organise your data
  • Remove duplicate values
  • Get rid of columns and rows you don't need
  • Remove empty cells
  • Make simple calculations
  • Use filters to see only the information you want
  • Collaborate with others
  • And much more

What is Google Sheets?

Google Sheets is a free and accessible web-based spreadsheet tool for data processing. It is cloud-based, and part of the Google Drive Suite. It is very similar to Microsoft Excel, but lack some functionality.

Google Sheets is an excellent way for beginners to start working with data. It is especially useful for certain types of data, more specifically medium-sized data-sets structured in rows and columns – for example, files exported as .xlsx or .csv.

How do I use Google Sheets to prepare data for analysis?

So:

My first real relationship with spreadsheets started with Google Sheets. The web-browser interface made the tool feel less intimidating than Excel, that is very mythical. Google Sheets made me feel safe(r than other spreadsheet tools), and that was what I needed to get started.

Do you want to know another thing that was crucial for me?

I could find the answers to most of my Google Spreadsheet related questions online – in blog posts, forums and on YouTube. That was invaluable since I didn’t have anyone to ask.

Some simple use cases for Google Sheets at work

My first real use of Google Sheets at work was when I had to calculate a large number of KPIs and deliver monthly performance reports.

You might ask:

“So, how did Google Sheets help you to calculate KPIs effectively?”

I had a large number of data sources that all had an “export data” function. I only needed a small subset of data in the files I exported to calculate my KPIs, but the calculations were quite elaborate.

How did I use Google Sheets for this?

I exported standard raw data files from Facebook, Twitter, Pinterest and the other sources I needed.

In my KPI Spreadsheet-file, I had two types of sheets:

  • A “master sheet” that calculated all my KPIs
  • One sheet per data source (Facebook, Twitter, Pinterest etc)

The “master sheet” was picking up the specific data – “referring to” data – from the other sheets I had created per data source. And making the necessary calculations to get the KPI value.

When a new monthly report was due, and I wanted to update – recalculate – my KPIs, I downloaded export files from all channels with data from the previous month.

I then navigated to the correct sheet in the KPI spreadsheet (say Pinterest) and used the import function “Replace Spreadsheet” in Google Sheets to upload my new (Pinterest) export-file.

The import overwrote all the previous (Pinterest) data in the sheet, and all calculations in my master sheet refreshed automatically.

As long as all the columns in my export file were in the same place every time – so that all my referral formulas where intact, this method worked seamlessly. And, usually, that was the case.

Smart Google Sheets functionality for beginners to learn

  • Filter data
  • Hide sheets
  • Hide rows and columns
  • Conditional formatting (how to make a column turn green or get a bold text when it meets specific criteria)
  • Protect ranges, to make sure that individual cells or the complete sheet don't get edited by mistake
  • Data validation, to only have people choose from pre-determined choices instead of writing text free-hand

Use Microsoft Excel to prepare data for analysis

  • Price: 129 USD
  • Skill-level: Basic
  • Platforms: Microsoft Excel works on any PC with Windows or Mac with OSX

Some beginner user cases:

  • Everything Google Sheets can do
  • Change from a decimal . to a , or vice versa
  • Work with large(r) data files
  • When you don't have an internet connection
  • Advanced calculations
  • Clean data from weird symbols or remove spaces
  • Split data into multiple cells
  • Automate some reccuring tasks with macros

What is Microsoft Excel?

Here’s the thing:

Microsoft Excel is the go-to software when working with data in many fields. Especially in the business world where people use it for budgeting, time-plans and Gantt-schemes, calculations and more.

While more advanced users usually disagree, I would say that Google Sheets and Excel are very similar. You can do most things on a fundamental level in both tools. Some things are quicker to do in one than the other, and I guess this is why I continue to use them both in parallel.

How do I use Google Sheets to prepare data for analysis?

I tried to stay as far away as possible from Excel for very long. To me, Excel was something only boring people needed, and my biggest fear in life was to come across as boring. 

The result:

For a very long time, I opened my .xlsx files in Google Sheets.

Since Google Sheets and Excel are so similar for many users, I will focus on the differences.

When is Excel better than Google Sheets?

You might ask:

“When should I use Excel and not Google Sheets?”

First:

The first feature I ever needed in Excel was “offline mode”.

Boring answer, I know. But, I didn’t always have a stable internet connection, and when you work in your web browser, you do.

Yes, I know – it is possible to work with Google Sheets in “offline mode”. But offline mode doesn’t feel as safe as having Excel installed and running locally on my computer. And the risk of losing a data analysis I’ve worked on for hours is not a risk I want to take.

Second:

The more I work with data, the more time I spend cleaning and prepping data files for data analysis.

Also, my data files have gotten larger and heavier (measured in megabytes) over time. They still do.

Quite quickly, I learned that Excel is a bit more stable than Google Sheets when working with large files. (It’s not like Excel is perfectly durable though, I’m often watching the spinning rainbow ball when I work with Excel).

There’s a nifty feature that helps with this:

In Excel, you can turn off “automatic calculation” and make all your changes to the document before it calculates what’s in your cell.

Compared to the standard setting, when “automatic calculation” is turned on, a cells auto-updates when you’ve changed something – that affects them – somewhere in your sheet.

When you turn off “automatic calculation” you usually get some extra processor power and trust me, you will need it at times.

When do I use Excel today?

The reason for me to choose Excel over Google Sheets these days is because it has more advanced functions. And with functions, I don’t mean functionality, but the ones you remember from math class.

Most of the time, I could probably do the same thing in Google Sheets – if I used a plugin – but since it’s already in place with Excel, it saves me time.

When do I use Google Sheets over Excel?

I use Google Sheets when I import data from an API using Google Script. Feeding in data from APIs is not something I do in Excel, even though it’s possible. Also, as soon as I need to share my documents with someone else, working in Google Sheets is often easier.

Smart Excel functionality for beginners to learn

  • Functions like: Split, Concatenate and Trim
  • Turn off Automatic Calculations
  • Convert formulas into values
  • Different "count" functions (Useful when you want to answer questions like How many of our users live in Stockholm?)
  • Simple macros, like "Save all sheets in this file as .pdf in this Dropbox folder"

Use Atom for simple data cleaning

  • Price: Free
  • Skill-level: Basic
  • Platforms: Mac and Windows (+ a couple of other ones you probably haven’t heard of…)

Some beginner user cases:

  • Format raw data
  • Make the same change at many places in a file all at once with multiple cursors
  • Find, change and/or remove words or symbols you don't want

What is Atom?

Atom is a text editor (just like TextEdit on a Mac or Notes on a PC). GitHub – a popular platform for developers to keep track of their code – is behind Atom.

You might ask:

“Why do I need a text editor to work with data?”

Well, at times when I prepare data, I get across data stored in JSON or XML-files. These files look like gibberish, and it’s impossible to read what’s in them. To decode these files, you need a good text editor.

Here’s the thing though:

When you are preparing data for analysis or doing other more advanced tasks, any text editor won’t help you. You need a good enough one (like Atom or Sublime Text).

The reason is that a good text editor formats the text in your files to make it more readable. This process is often called “prettifying” in the developer world. It unfolds code that is hard to read into a structure that is friendly and easy to understand for a human.

Atom uses colours, space, new rows and highlights to make your text easy to read. Getting helpful syntax is even more critical if you are unfamiliar to code.

Using Atom when preparing data for analysis:

So:

A text editor might not be what you think of when someone says “great tool for data processing”?

But Atom is one of the tools I use most often and to a wide variety of things: Notes, Collecting data, Cleaning formatting from text, and much more. I write most of my e-mails in Atom.

But the use case might become clearer with a real world example:

For example:

The Facebook Ads API displays information about “targeting” for an ad as a JSON-object. JSON is a type of code.

If you try to read this information in a spreadsheet column, you will struggle (a lot). But if you instead open the data in Atom, it formats JSON beautifully.

It’s as simple as this:

  • 1. Create a new file
  • 2. Paste your Facebook targeting data into the file
  • 3. Save it to your hard drive as anything.json

When you save the file, it will magically become formatted in an (almost) understandable way, with colours and space and highlights. And you can easily spot and pick the information you need.

Also, this method same is valid for many other file formats that you might need to decode.

Smart Atom functionality for beginners to learn

  • Search and replace things in your file
  • Use multiple cursors at the same time
  • "Soft wrap" your text
  • Intendent your text automatically
  • Go from uppercase to lowercase and vice versa

Use Tableau Prep when preparing data for analysis – automatically

Tableau Prep to preparing data
  • Price: 70 USD per month
  • Skill-level: Medium
  • Platforms: Mac and Windows

Some beginner user cases:

  • Clean data automatically with a click
  • Merge data files in different formats into one file
  • Split longer strings into multiple variables
  • When you are tired of do the same cleaning operations over and over again for many files

What is Tableau Prep?

So:

Tableau Prep is a tool that lets users cleanse, aggregate, merge or otherwise prepare their data for analysis. It’s made specifically with this use case in mind. The idea is that you should then continue to work with your data in Tableau, but that is not required.

You might ask:

“So, why do I need it when I have the other tools you’ve listed here?”

First: Compared with all the other tools, Tableau Prep is very visual. You drag and drop files and actions in a visual interface, and you can see how your different data files branch together, and you can see the various cleaning steps in a visual timeline.

Second: A lot of things are quick in Tableau Prep, but might take forever using Excel even if it would be technically possible.

Now you might think:

“Yay, this sounds great let’s get started!”

Well. There’s one big minus: $

Tableau prep comes with Tableau, and it’s pricy. A bit too pricy for most users with primary use cases. But they have a trial, so you can always try it out and see the magic happen if you want to.

Smart Tableau Prep functionality for beginners

  • Rename fields in the input step
  • Find your next step with smart recommendations
  • Document your process by describing your changes
  • Set specific Data Roles to save time

Use R and RStudio for data preparation

  • Price: Free
  • Skill-level: Medium
  • Platforms: Mac and Windows + several UNIX-platforms

Some beginner user cases when preparing data for analysis

  • Work with BIG data sets
  • Execute the same process over and over again
  • Look at multiple variables (such as time, message and channel) at the same time

What are R and RStudio?

R is a programming language and software environment for statistical computing and graphics. It has a powerful core, with a broad variety of mathematical and graphical methods.

RStudio is the visual interface that lets you use R without touching the terminal, something not everyone likes.

Typical uses are linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc. Also, it’s possible to extend R with different libraries, and you can take the legos you need to build a tool that suits your needs.

For most users, R is way too advanced. But for anyone planning to work a lot with data or become a data scientist, it’s a tool you will have to learn sooner or later.

Here’s a harsh reality:

R is not very useful if you don’t know enough math to use it.

Now you might think:

” So why are you even bringing it up in a guide with beginner tools?”

Well:

R can solve some problems that are not possible with the tools I’ve talked about earlier in this post. And sometimes it might be the easiest way to solve a problem.

What makes it hard?

When you work in R and RStudio, you tell the tool what you want to accomplish by writing code instead of using a friendly visual interface.

Even when you use RStudio, the process is a bit similar to the terminal on your computer, in many ways. You write code, and you get answers back.

When do I use R?

I use R in several cases:

  1. When I have massive data files
  2. If I need to calculate relationships between data points or data sets
  3. If I’m going to do the same thing with many data sets over and over again

Neither of these needs appears very often in my everyday life.

Now, you might say:

“Tell us about a time where R was your only option!”

The one critical feature of R that I use when preparing data is that I can “borrow” processor power from the cloud.

When you have large data sets, sometimes your computer processor cannot handle them. And buying a bigger computer isn’t always an option – or necessary.

Instead, you run your operations on a virtual machine in the could and the result get sent back to you. For the user, it’s not so different from doing the same thing (calculate, merge, etc.) on your computer.

Are you with me?

Well, I’m not saying you should start doing that. But it’s good to know that it’s possible if you run into that type of problem.

Two other times when I needed R, and no other tool could help me, was in the analysis step:

  • Like a couple of years ago when I did a network analysis with data from open Facebook groups. That analysis wasn’t possible to do in any of the other tools. 
  • Or when I had to calculate the overlaps between data sets and create a Venn diagram. Calculating overlaps and creating Venn-diagrams is often hard, or not possible, with other tools, but R does it nicely (even though you might need to install some libraries).

 

Smart R functionality for beginners to learn

  • Use Help! to get more info about different commands
  • Install and use different libraries
  • Data types, arithmetic & logical operators
  • Create and print matrices, adding columns and slice
  • The difference between categorical & continuous variables

Some final thoughts about preparing data for analysis

You might ask:

“Can I do this?”

I’m confident that you can. Start with small tasks and learn from there.

I learned almost all my data and analytics skills on my own by trial and error. I started with small side projects, and I’ve used MOOCs and tutorials online to learn what I need to know.

Sure, I had some university credits in basic statistics, before I started this journey – and I once knew how to perform multivariate analysis in SPSS. But, to be honest, that wasn’t too helpful.

To tell the truth:

I have learned everything I know because no one else around me knew how to do data analysis

I didn’t try to figure something out on my own; I wouldn’t have any quantitative insight at all.

And not using the data I had access to in any way felt like a worse idea than to try to do some fundamental data analysis on my own.

You can do it too. I’m sure.

Share on facebook
Share on google
Share on twitter
Share on linkedin
Share on pinterest
Share on print
Share on email

Leave a Reply

Your email address will not be published. Required fields are marked *

Work with me?

I help people and businesses use data and technology to reach their goals.

Do you want help to reach your goals?

Hi!

Every Sunday,
more than 100 people get my executive newsletter with the most essential Internet-related news from the previous week.

Do you want it too?

All you have to do is fill in your name and email address below. Also, the newsletter is free.