# Load package
library(tidyverse)
# Import data
survey_data <- read_csv("data/raw/survey_student_rawdata.csv")Lesson 1: Data Cleaning
Learning Objectives
By the end of this lesson, you’ll be able to:
- Describe the importance of data cleaning for the research data lifecycle.
- Identify data cleaning needs for an example dataset.
- Use the pipe operator and functions from the
tidyverseto clean data of an example dataset.
- Write reproducible code by applying concepts of literate programming.
Activity - What Needs to Be Cleaned?
Before cleaning a dataset, you should first explore your data so that you can understand what actions you must take in order to answer your research questions. In a previous lesson, you imported and looked at the dataset in RStudio. Now, you will perform a deeper inspection to make a plan for data cleaning.
Lecture - Data Cleaning
Lecture
Data Cleaning
Data cleaning is the process of checking, fixing, or transforming your data. Depending on how it was collected, your data may contain errors (e.g., typos) or it may be in a format that does not allow it to be easily used. During data cleaning, you will correct and transform the data to ensure it is ready for analysis.
Data cleaning can be time consuming, but it is a crucial step in the research data lifecycle. If your data contains errors, the results of your analysis will not be reliable, regardless of how good your methods are.
As you clean your data, use the data management practices you’ve learned so far to document your data cleaning process. A preferred way to document this step is to create a script that performs and describes all of the required data cleaning actions. The comments in your script will not only help you in the future, but enhance the reproducibility of your research for others.
Best Practices for Data Cleaning
The single most important rule in data cleaning is to never directly edit your raw data. Once your data has been collected, save the source file in the appropriate data folder in your project directory and do not change it. This ensures that you can reload the original dataset if you make a mistake. It also allows others (and yourself) to understand the ways in which you cleaned and manipulated your data. This is where file versioning practices are necessary, as any changes to the data should be saved and labelled as a new version.
When performing data cleaning using a script, create a single data cleaning script that uploads the raw data, performs all necessary data cleaning steps, and then saves the cleaned dataset in a new file in the correct folder in your project directory.
When sharing the dataset with others (which we will cover in more detail in another lesson), always remember to share the clean version of your data. This file will be easy to identify if you have named it appropriately.
Data Cleaning Steps
Each dataset is unique and the type of data cleaning practices you use will need to be adapted for each new project. But, here is a general idea of what a dataset may require for cleaning:
- Fix column names.
- Remove duplicate observations.
- Correct spelling mistakes.
- Standardize letter case and use of spaces or other special characters.
- Group values into larger categories.
- Convert variables to correct variable types (e.g., numeric, categorical, date, etc).
- Merge and split columns.
- Create new variables.
- Identify missing values.
- Create a subset of the data.
- Reshape the dataset for analysis.
Tutorial - Data Cleaning
In this tutorial, we are going to learn about common cleaning steps and techniques. As you go through this tutorial, transcribe the code into your own script file, and then run the code to execute the commands.
The Pipe Operator
You’ve already learned that the tidyverse package has many functions that are helpful for data cleaning. You will use these functions in this lesson, but before you do, it is helpful to understand one additional rule about the syntax used in this library. The tidyverse uses a pipe operator, represented by the symbol |>. The pipe is used to feed the result of one function directly into the next function.
In an earlier lesson, you used “nesting” to view frequencies of observations by age and gender in our example dataset. We will continue in this script to learn how to use the pipe. For that, first open your R Project by clicking in the .RProj file in your project directory. This will open the project in R Studio. Then, to run the nesting code again, open your script in RStudio, and run the first lines of code loading the tidyverse and uploading the data, and the final line of code exploring the count by age and gender:
# Get the number of observations by age and gender and print all results
print(count(survey_data, Age, Gender), n = Inf) # A tibble: 51 × 3
Age Gender n
<dbl> <chr> <int>
1 16 Female 5
2 16 Male 16
3 16 Non-binary 4
4 16 <NA> 2
5 17 Female 31
6 17 Male 17
7 17 Non-binary 1
8 17 Prefer not to say 1
9 18 Female 27
10 18 Male 19
11 18 Non-binary 2
12 18 Prefer not to say 1
13 19 Female 28
14 19 Male 36
15 19 Non-binary 1
16 19 <NA> 3
17 20 Female 21
18 20 Male 34
19 20 Prefer not to say 1
20 20 <NA> 3
21 21 Female 31
22 21 Male 23
23 21 Prefer not to say 1
24 21 <NA> 2
25 22 Female 17
26 22 Male 19
27 22 Non-binary 4
28 22 Prefer not to say 2
29 22 <NA> 2
30 23 Female 21
31 23 Male 23
32 23 <NA> 1
33 24 Female 19
34 24 Male 14
35 24 Non-binary 1
36 24 Prefer not to say 1
37 24 <NA> 1
38 25 Female 6
39 25 Male 8
40 25 Non-binary 4
41 25 <NA> 2
42 26 Female 2
43 26 Male 6
44 27 Female 5
45 27 Male 5
46 29 Male 1
47 NA Female 27
48 NA Male 18
49 NA Non-binary 1
50 NA Prefer not to say 4
51 NA <NA> 1
Here’s where the pipe operator becomes useful. Instead of nesting count(survey_data, Age, Gender) inside the print() function, you can use the |> symbol to feed the result of the first part of the argument (count(survey_data, Age, Gender) into the second part (print()).
# Get the number of observations by age and gender and print all results
count(survey_data, Age, Gender) |>
print(n = Inf) # A tibble: 51 × 3
Age Gender n
<dbl> <chr> <int>
1 16 Female 5
2 16 Male 16
3 16 Non-binary 4
4 16 <NA> 2
5 17 Female 31
6 17 Male 17
7 17 Non-binary 1
8 17 Prefer not to say 1
9 18 Female 27
10 18 Male 19
11 18 Non-binary 2
12 18 Prefer not to say 1
13 19 Female 28
14 19 Male 36
15 19 Non-binary 1
16 19 <NA> 3
17 20 Female 21
18 20 Male 34
19 20 Prefer not to say 1
20 20 <NA> 3
21 21 Female 31
22 21 Male 23
23 21 Prefer not to say 1
24 21 <NA> 2
25 22 Female 17
26 22 Male 19
27 22 Non-binary 4
28 22 Prefer not to say 2
29 22 <NA> 2
30 23 Female 21
31 23 Male 23
32 23 <NA> 1
33 24 Female 19
34 24 Male 14
35 24 Non-binary 1
36 24 Prefer not to say 1
37 24 <NA> 1
38 25 Female 6
39 25 Male 8
40 25 Non-binary 4
41 25 <NA> 2
42 26 Female 2
43 26 Male 6
44 27 Female 5
45 27 Male 5
46 29 Male 1
47 NA Female 27
48 NA Male 18
49 NA Non-binary 1
50 NA Prefer not to say 4
51 NA <NA> 1
In the example above, print(n = Inf) is indented on a new line for readability. When using the pipe symbol (|>), editors such as RStudio may automatically indent the next line to show that the command continues.
Note: The symbol %>% is another way of expressing the pipe syntax in tidyverse, while |> is used in base R. Both symbols will work exactly the same in most cases. In this course, we will use |>, which is now considered the standard pipe symbol, but if you see %>% in older tutorials or in tidyverse documentation, you can interpret it in the same way as |>. If you want to know more about their differences, you can refer to this page on differences between the base R and magrittr pipes.
Now that we have learned how to use the pipe, we can move on to creating the data cleaning script. You will continue to use the pipe throughout this lesson, so there will be plenty of opportunities to practice this new syntax. Save your current script and close it. We will start data cleaning in a new script.
Set up the Data Cleaning Script
It is good practice to start a new task in R with a clean working environment. This removes existing objects and packages, which ensures that your code runs predictably and includes all necessary steps.
In RStudio, in the topmost menu bar, click on ‘Session’ and then ‘Restart R’.
Now, open a new R script and save it with a meaningful name in the correct folder location.
At the top of your script, create a title and add a brief description of its purpose. Remember, this comment should be preceded by a hash # symbol.
# DATA CLEANING
# this script:
## uploads the raw social media survey data \+ cleans it
## outputs a cleaned dataset ready to be used in analysis Organize your working environment
Before you start with the actual data cleaning tasks, make sure your working environment is ready to go. Since you are working within an R Project, the working directory will already be set.
Next, add a line that loads the packages you need. In this lesson, we’ll be using tidyverse, though it is common to see more than one package here. Be sure to include a comment about what the script is doing. It may seem obvious to you, but the more you get into the habit of writing comments, the more likely you are to do it consistently.
# Load the tidyverse package
library(tidyverse) Remember to check the console (below your script editing window) to ensure your code ran without errors.
Then, load the dataset using read_csv()
If you can’t remember how this function works, look back to your practice scripts, or review Day 3.
# import data
survey_data <- read_csv("data/raw/survey_student_rawdata.csv") You should now see a survey_data object listed under Data, in the top right panel Environment tab.
Take some time to inspect the console output for the code chunk you just created. What kind of information does each line provide? You may need to scroll up in the console to see the entire output. We won’t review them again here, but you should be seeing terms like tibble, <chr>, and print(n = ...), along with a sample of the survey data, both in the console and in a new window.
Update Column Names
Now that we know that the data have been imported correctly, it is time to begin cleaning. Remember, each dataset is unique, so what we do in this lesson may not apply to the next project you work on. Use what you learn here as a starting point.
The first thing we need to address are the column names, as they are currently very long with extra characters such as spaces and punctuation. Why do you think the column names look like this?
# check names
names(survey_data) [1] "Timestamp"
[2] "Name"
[3] "Email"
[4] "Date of birth"
[5] "Age"
[6] "Province"
[7] "Gender"
[8] "Year of Study"
[9] "Field of Study"
[10] "How many hours per day do you use each platform? [Facebook]"
[11] "How many hours per day do you use each platform? [Instagram]"
[12] "How many hours per day do you use each platform? [Snapchat]"
[13] "How many hours per day do you use each platform? [Tiktok]"
[14] "How many hours per day do you use each platform? [X (Twitter)]"
[15] "How many hours per day do you use each platform? [Other]"
[16] "On average, how many hours of sleep do you get per night?"
[17] "What devices do you use to access social media"
[18] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media helps me feel connected to others.]"
[19] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media makes me feel anxious or stressed.]"
[20] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media distracts me from academic work.]"
[21] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media improves my mood when I feel lonely.]"
[22] "Select any of the following reasons for using social media:"
The column names consist of the question text from the survey form. While the single-word column names are acceptable, the other ones will be more difficult to work with, and require cleaning.
To change column names we will use rename(). This function is part of the dplyr package that is installed with tidyverse.
Type the following code into your script to change “Date of birth” to “date_birth”:
# change column names
survey_data_fixnames <- survey_data |>
rename(date_birth = "Date of birth") What happened when you ran the names(survey_data)code again? Did you see the name change? Remember, we didn’t change column names in the survey_data object, we created a new object called survey_data_fixnames and changed the column name there.
rename() can change multiple column names at once if you provide it with a comma separated list. Modify your code so that it changes the column “How many hours per day do you use each platform? [Facebook]” to “hours_facebook”. You can do this by adding a comma after date_birth = "Date of birth.
# change column names
survey_data_fixnames <- survey_data |>
rename(date_birth = "Date of birth",
hours_facebook = "How many hours per day do you use each platform? [Facebook]") Now, let’s change the rest of the column names. Use the code below as a guide, but get used to navigating your dataset in RStudio. Also, if you want to use copy / paste to collect the column names, do it in RStudio, not here. You should never copy code from a website for use on your computer. Not only can the HTML markup cause errors, the code itself could be malicious.
Remember, a script is a document. Take your time to create it, and check for typos as you go. Take the same care as you would when writing a paper or formatting a list of citations. Grammar, punctuation, and style matter.
You can now check if column names were correctly modified.
# Check if names were correctly changed
names(survey_data_fixnames) [1] "timestamp" "name" "email"
[4] "date_birth" "age" "province"
[7] "gender" "year_study" "field_study"
[10] "hours_facebook" "hours_instagram" "hours_snapchat"
[13] "hours_tiktok" "hours_xtwitter" "hours_other"
[16] "hours_sleep" "devices_used" "feel_connected"
[19] "feel_stress" "feel_distracted" "feel_improvedmood"
[22] "reasons_use"
Did everything work as you expected? You should see a list of all of the updated column names. If you don’t, go back through this lesson from the beginning and try again. Remember, another way to keep track of your objects in RStudio is to check the Environment tab.
Remove Duplicates
Duplicated information can happen in a dataset for many reasons. There may have been an error in the survey collection tool that created a duplicate column, or user errors could lead to the same data appearing in multiple records, or rows. It is good practice to search for duplicate data.
One way to check for duplicates in a survey dataset like this, is to check that the number of respondents is the same as the number of rows in the dataset.
To see the number of respondents, we will use the name and email columns. While some people in a dataset could have the same name, they will have a different email address. Use the function distinct() from dplyr, which shows the unique combinations of the variables you choose.
# Number of respondents in the dataset.
survey_data_fixnames |>
distinct(name, email) # A tibble: 500 × 2
name email
<chr> <chr>
1 Mei Lopez mei.lopez@outlook.com
2 Yuki Smith yuki.smith@proton.me
3 Arjun Khan arjun.khan@uottawa.ca
4 Lucas Chen lucas.chen@icloud.com
5 Nina Anderson nina.anderson@icloud.com
6 Olivia Garcia olivia.garcia@wilfridlaurier.ca
7 Aisha Singh aisha.singh@hotmail.com
8 Jamal Wilson jamal.wilson@yorku.ca
9 Jamal Nguyen jamal.nguyen@gmail.com
10 Nina Brown nina.brown@icloud.com
# ℹ 490 more rows
We are not creating a new object with this information, as we’ve not used the <- operator to assign new values. We are just displaying this information in the console.
The console output shows the following information: A tibble: 500 x 2
This means that the function has retrieved two columns (name and email) and 500 rows (500 unique respondents).
To check the total number of rows in the dataset, use the function nrow().
# Number of rows in the dataset.
nrow(survey_data_fixnames) [1] 525
What was the output from nrow()? If there are more rows than unique respondents, it means there are duplicate observations.
To remove them, we will use the distinct() function and assign the function’s output to a new object. If you do not specify column names inside the parentheses, this function will search for and keep unique rows from across the entire data frame or tibble.
# Remove duplicated values
survey_data_nodupl <- survey_data_fixnames |>
distinct() Now, check the number of rows in the dataset again.
# Check number of rows in the dataset and the number of distinct rows
nrow(survey_data_nodupl)
distinct(survey_data_nodupl) You should now have the same number of rows as respondents, which shows that duplicates have been removed. If you don’t see the same result, refresh your R session and go through your code from the top.
Tackle Personal Identifiers
This dataset includes the names and email address of respondents, as well as other personal details, such as age and gender. While the design of a survey instrument is outside of the scope of this course, it is always a good idea to consider how much data you actually need before you start collection. Your project may have documentation about how respondent data are to be collected or used, so make sure you follow the norms of your lab or research group.
It is common practice to remove personal identifiers from survey data, but you may also wish to create a code for each entry, should you ever need to connect information back to the respondent. In this example, we will create a new ID column, and then write it to a new object with the personal information removed.
Create an ID Column
The mutate() function, part of the dplyr package, is useful when you want to modify columns or create new columns based on other functions or calculations with existing variables.
We can use mutate() with the row_number() function to create a new column that assigns a unique number to each row. This has the added benefit of identifying the unique entries in the dataset.
# add a column with an anonymous ID per respondent
survey_data_id <- survey_data_nodupl |>
mutate(id = row_number()) Now, check if the id column was created.
# add a column with an anonymous ID per respondent
View(survey_data_id) Scroll down to the right to see the final column in the dataset, it should be the newly created id column.
Reorder Columns
Now, we want to move our newly created id column from the right of the tibble to the left, so that it is easier to see when we view the data.
The relocate() function, which is also part of the dplyr package, is the next function we will learn. It is a simple function that relocates columns in a dataset.
The default of relocate() is to put the specified column in the first position, so we don’t need to specify location here, though that is possible with added arguments. We will learn those below.
# moves id column to the left most position (default of relocate function)
survey_data_ordered <- survey_data_id |>
relocate(id)
# check if it worked
survey_data_ordered # A tibble: 500 × 23
id timestamp name email date_birth age province gender year_study
<int> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
1 1 3/18/2026 19:0… Mei … mei.… 9/16/2001 24 British… Female 3
2 2 3/5/2026 17:00… Yuki… yuki… 4/16/2004 21 MB Female 1
3 3 3/6/2026 08:00… Arju… arju… 2/14/2007 19 Manitoba Male 4
4 4 3/6/2026 12:00… Luca… luca… 7/20/2001 24 BC Female 2
5 5 3/10/2026 08:0… Nina… nina… 5/3/2001 NA ON Male 1
6 6 3/18/2026 18:0… Oliv… oliv… 1/15/2007 19 ontario Male 2
7 7 3/8/2026 08:00… Aish… aish… 12/30/2008 17 british… Female 1
8 8 3/5/2026 11:00… Jama… jama… 3/2/2005 21 MB Female 1
9 9 3/18/2026 19:0… Jama… jama… 2/1/2002 24 <NA> Female 1
10 10 3/8/2026 08:00… Nina… nina… 8/27/2005 20 Alberta <NA> 2
# ℹ 490 more rows
# ℹ 14 more variables: field_study <chr>, hours_facebook <dbl>,
# hours_instagram <dbl>, hours_snapchat <dbl>, hours_tiktok <dbl>,
# hours_xtwitter <dbl>, hours_other <dbl>, hours_sleep <dbl>,
# devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
# feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>
However, if you want to put a column in another location, you can specify the new location in reference to another column by using the arguments .before or .after. For example, the code below moves the gender column to the position after the age column.
# move gender column to the right of age column
survey_data_ordered <- survey_data_ordered |>
relocate(gender, .after = age)
# check
survey_data_ordered # A tibble: 500 × 23
id timestamp name email date_birth age gender province year_study
<int> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
1 1 3/18/2026 19:0… Mei … mei.… 9/16/2001 24 Female British… 3
2 2 3/5/2026 17:00… Yuki… yuki… 4/16/2004 21 Female MB 1
3 3 3/6/2026 08:00… Arju… arju… 2/14/2007 19 Male Manitoba 4
4 4 3/6/2026 12:00… Luca… luca… 7/20/2001 24 Female BC 2
5 5 3/10/2026 08:0… Nina… nina… 5/3/2001 NA Male ON 1
6 6 3/18/2026 18:0… Oliv… oliv… 1/15/2007 19 Male ontario 2
7 7 3/8/2026 08:00… Aish… aish… 12/30/2008 17 Female british… 1
8 8 3/5/2026 11:00… Jama… jama… 3/2/2005 21 Female MB 1
9 9 3/18/2026 19:0… Jama… jama… 2/1/2002 24 Female <NA> 1
10 10 3/8/2026 08:00… Nina… nina… 8/27/2005 20 <NA> Alberta 2
# ℹ 490 more rows
# ℹ 14 more variables: field_study <chr>, hours_facebook <dbl>,
# hours_instagram <dbl>, hours_snapchat <dbl>, hours_tiktok <dbl>,
# hours_xtwitter <dbl>, hours_other <dbl>, hours_sleep <dbl>,
# devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
# feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>
Save Data with Personal Identifiers
Before we go further, let’s save a copy of our data as it is currently formatted, just as our project lead asked us to do in the project scenario. This copy will include the personal identifiers and the ID codes. We won’t be using this dataset for analysis, but it will be necessary should the project lead need to connect with respondents.
To save the data, we can use the function write_csv(), which is very similar to the function read_csv that you learned in an earlier lesson. The general syntax is:
write_csv(object_to_save, “file-path-to-file-to-be-saved.csv”)
It is important to know that if you use the name of an existing file, R will overwrite it without warning. As we mentioned earlier, you should never edit your original data, so in this step we are going to give the file a new descriptive name.
# Save data linking ID to names
write_csv(survey_data_ordered, "data/processed/survey_student_identifiers-IDs.csv") Remember, if you want to inspect your file directory, you can see it under the ‘Files’ tab in the bottom right panel of RStudio. Did your file save as expected?
Remove Columns
Now that we have saved a new version of the data, we will remove the email and name columns to de-identify the dataset. We will also use the same command to remove unnecessary columns.
There are a number of ways to remove columns in R, but here, we’ll focus on the select() function from the dplyr package in tidyverse.
This function uses the same general syntax that we have been learning as we explore the tidyverse. In the parentheses, you must insert a dash - or minus symbol, followed by the name of the column you want to remove.
Remove the name and email columns using the select() function.
# Remove name and email columns
survey_data_anon <- survey_data_ordered |>
select(-name, -email) You can now check that all column removals worked:
# check that it worked
survey_data_anon # A tibble: 500 × 20
id gender date_birth age province year_study field_study hours_facebook
<int> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 Female 9/16/2001 24 British … 3 Statistics 1
2 2 Female 4/16/2004 21 MB 1 Computer S… 1
3 3 Male 2/14/2007 19 Manitoba 4 Psychology 1
4 4 Female 7/20/2001 24 BC 2 Chemistry 1
5 5 Male 5/3/2001 NA ON 1 Biology 1
6 6 Male 1/15/2007 19 ontario 2 psycology 1
7 7 Female 12/30/2008 17 british … 1 psychology 1
8 8 Female 3/2/2005 21 MB 1 stats 1
9 9 Female 2/1/2002 24 <NA> 1 Psychology 1
10 10 <NA> 8/27/2005 20 Alberta 2 Computer S… 1
# ℹ 490 more rows
# ℹ 12 more variables: hours_instagram <dbl>, hours_snapchat <dbl>,
# hours_tiktok <dbl>, hours_xtwitter <dbl>, hours_other <dbl>,
# hours_sleep <dbl>, devices_used <chr>, feel_connected <chr>,
# feel_stress <chr>, feel_distracted <chr>, feel_improvedmood <chr>,
# reasons_use <chr>
Before we continue to the next step, it is useful to know that the select() function can also be used to select specific variables from your dataset. For example, if you want to show the variables year_study and field_study, use the following code:
# select specific columns
survey_data_anon |>
select(year_study, field_study) # A tibble: 500 × 2
year_study field_study
<dbl> <chr>
1 3 Statistics
2 1 Computer Science
3 4 Psychology
4 2 Chemistry
5 1 Biology
6 2 psycology
7 1 psychology
8 1 stats
9 1 Psychology
10 2 Computer Science
# ℹ 490 more rows
Save current version of cleaned data
Now that we removed the personal identifiers, it is good practice to save a version of the data at this stage, before any more steps of data cleaning, so we can share this version later on.
To continue the script, we want to import the data that we just saved. This will make sure that the rest of the code is reproducible later on when we share the anonymized data with clean column names.
# import data
survey_data_anon <- read_csv("data/processed/survey_student_anonymized-cleancolnames.csv") Create New Variables
Depending on your research question, you may need to create a new variable based on your existing variables. For example, because we are investigating overall social media usage, we could combine the time spent in different platforms into a single variable. This is called a “derived variable”, as it does not appear in the original dataset. However, if our question was about comparing the usage of different platforms, we would want to keep the variables separate. It is always important to go back to your questions to have more clarity on how you want your variables to be.
To create a derived variable, we will use the mutate() function. Remember, the syntax for this function requires an argument inside the parentheses. Last time, we used mutate() to create the new ID column. This time, we are using it to combine variables, so we need to tell R the new column name, followed by the names of the columns we want to merge. Since the column names are existing variables in our dataset, we don’t need to enclose them in quotations. Take a look at the code below to see how the mutate() function works for this example.
# add a variable for total amount of time spent on social media
survey_data_tothour <- survey_data_anon |>
mutate(
hours_total = hours_facebook + hours_instagram + hours_snapchat + hours_tiktok + hours_xtwitter + hours_other
) Remember, by default, R creates the new variable as the last column in the dataset. For a cleaner dataset, we may want to move this variable closer to the individual platform variables.
Now that we have created this new variable, let’s see if the numbers from each individual column were added together correctly in the new column. First, let’s use the range() function to see the range of total hours spent in social media:
# check the range of hours\_total
range(survey_data_tothour$hours_total) [1] NA NA
If your code ran correctly, you’ll be seeing NA, or ‘not available’ in your console. NA in R means that a value is missing or unavailable.
Deal with Missing Data
Missing data can create difficulties when working with certain functions because they can’t be used in calculations. Certain statistical analyses, such as linear models, will not work if there is missing data. This means that dealing with NA values during data cleaning is very important. There are many different ways to handle NAs, but we will only cover some of them here.
First, let’s look at how many NAs we have in our dataset.
# check NAs in the dataset
survey_data_tothour |> summary() id gender date_birth age
Min. : 1.0 Length:500 Length:500 Min. :16.00
1st Qu.:125.8 Class :character Class :character 1st Qu.:18.00
Median :250.5 Mode :character Mode :character Median :20.00
Mean :250.5 Mean :20.47
3rd Qu.:375.2 3rd Qu.:23.00
Max. :500.0 Max. :29.00
NA's :47
province year_study field_study hours_facebook
Length:500 Min. :1.000 Length:500 Min. :0.000
Class :character 1st Qu.:1.000 Class :character 1st Qu.:1.000
Mode :character Median :2.000 Mode :character Median :1.000
Mean :2.213 Mean :0.831
3rd Qu.:3.000 3rd Qu.:1.000
Max. :4.000 Max. :2.000
NA's :16 NA's :3
hours_instagram hours_snapchat hours_tiktok hours_xtwitter hours_other
Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0
1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0
Median :1.000 Median :1.0000 Median :1.000 Median :0.0000 Median :0
Mean :1.472 Mean :0.9919 Mean :1.306 Mean :0.0261 Mean :0
3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:0
Max. :3.000 Max. :2.0000 Max. :3.000 Max. :1.0000 Max. :0
NA's :4 NA's :8 NA's :4 NA's :2 NA's :7
hours_total hours_sleep devices_used feel_connected
Min. : 0.00 Min. :4.00 Length:500 Length:500
1st Qu.: 4.00 1st Qu.:6.00 Class :character Class :character
Median : 4.00 Median :7.00 Mode :character Mode :character
Mean : 4.61 Mean :6.95
3rd Qu.: 6.00 3rd Qu.:8.00
Max. :11.00 Max. :9.00
NA's :28 NA's :18
feel_stress feel_distracted feel_improvedmood reasons_use
Length:500 Length:500 Length:500 Length:500
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
If you inspect the output in the console, you will see detailed information about each of the variables. The numeric variables show a set of descriptive statistics, while the character variables show the character count and other information. You may remember seeing these descriptive statistics in an earlier lesson. At the bottom of each column generated by the summary() function, you will also see the number of NAs present in the data for each variable.
As you can see in the columns related to the hours variables, there are NA values across all of them. For example, there were three participants that did not answer the question about hours spent on Facebook. Because this data is missing, R can’t accurately calculate the total hours, which explains the NAs in the hours_total variable.
Every missing data situation is different, so when you’re working with your own data, you will have to make a decision about how you want to deal with NAs based on your dataset and research question.
In some cases, you may want to preserve the missing values. Since we don’t have information to replace the missing data about social media usage, it would be unwise to add incorrect information simply to get rid of the error. In this situation, we can ask R to prevent the inclusion of NA values by using the na.rm=TRUE argument inside our function. This tells R to remove the NA values before performing a calculation.
# check range of total_hour_per_day, removing NAs
range(survey_data_tothour$hours_total, na.rm = TRUE) [1] 0 11
With this additional argument, you can now see the range of the hours_total variable, which shows the values 0 and 11. This means that the lowest value in the hours_total column is zero and the highest value is 11.
In other situations, you may wish to replace the NA values, rather than ignore them. In this dataset, we can safely substitute the NAs in the hours variables with zeroes, as the sum of all hours will still give us a minimum number of hours. To do this we will return to the mutate() function, combining it with the replace_na() function. The replace_na function takes a vector in its argument followed by a value to replace the missing values.
# replace NAs in the hours variable with 0
survey_data_replna <- survey_data_tothour |>
mutate(hours_facebook = replace_na(hours_facebook, 0),
hours_instagram = replace_na(hours_instagram, 0),
hours_snapchat = replace_na(hours_snapchat, 0),
hours_tiktok = replace_na(hours_tiktok, 0),
hours_xtwitter = replace_na(hours_xtwitter, 0),
hours_other = replace_na(hours_other, 0))Now you can check if there still are NAs in those columns.
# check: that NAs were replaced by zeroes in hours-of-social-media-use columns
summary(survey_data_replna) Now, we can update our hours_total column by re-running the code snippet again.
# update our 'hours_total' column
survey_data_replna <- survey_data_replna |>
mutate(
hours_total = hours_facebook + hours_instagram + hours_snapchat + hours_tiktok + hours_xtwitter + hours_other
)Now you can check if there still are NAs in the hours_total column.
# check no NAs in hours_total
summary(survey_data_replna$hours_total) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.000 4.000 4.584 5.250 11.000
We can also check the range again.
# check the range again
range(survey_data_replna$hours_total) [1] 0 11
Looks good, now we have only numerical values, without having to use the na.rm argument.
Finally, another way that you can deal with NAs is to remove them for your dataset. For example, it might be that knowing the number of hours people sleep on average is very important for your research question, in which case you might want to remove all the entries where hours of sleep were not specified from the dataset.
We can remove the entries (i.e., rows) where hours_sleep are NA using the function drop_na():
# remove observations with NAs for hours slept
survey_data_dropna <- survey_data_replna |>
drop_na(hours_sleep) To drop NAs, you have to drop the entire observation, which means you end up with a smaller dataset.
# check number of rows now
nrow(survey_data_dropna) [1] 437
Therefore, dropping NAs is only recommended when essential for your analysis. Here, as we have a larger sample size, it is okay to continue with this smaller dataset.
Now that we have dealt with NAs, let’s continue with other data cleaning steps.
Fix Typos and Recode Variables
Now that we have organized our columns and dealt with NAs, it is time to start looking at the actual values of variables to see if we will be easily able to analyze them, or if they need cleaning too. A quick inspection of the dataset shows us that some variables like province and field of study need some work to be put into a stage that is ready for analysis. When we want to substitute some values of a variable for other values, we call that recoding. This can be done both to update an existing variable, or to create a new variable. Let’s look at both options.
The province variable was entered as a free entry text, and there are a lot of typos that need to be corrected. A good way to see all possible entries is to use the function distinct(), which we have already seen.
# check all entries for the province variable
survey_data_dropna |>
distinct(province) |>
arrange(province) |>
print(n = Inf) # A tibble: 18 × 1
province
<chr>
1 AB
2 Alberta
3 Alta
4 B.C.
5 BC
6 British Columbia
7 MB
8 Manitoba
9 NS
10 Nova Scotia
11 ON
12 Ontario
13 QC
14 Quebec
15 Québec
16 british columbia
17 ontario
18 <NA>
Look at the results. Can you spot the typos that need to be fixed?
Now you can use these values to fix the variable. In this case, we want to update the variable, and not create a new one. For that, we can use the replace_values() function from the dplyr package. This function is useful when you want to replace values in a column with new values. We will use this inside the mutate() function because we will be modifying an existing column. This is how we would use it for the province variable:
# fix values for the province variable
survey_data_notypos <- survey_data_dropna |>
mutate(
province = replace_values(province,
c("AB", "Alta") ~ "Alberta",
c("B.C.", "BC", "british columbia") ~ "British Columbia",
"MB" ~ "Manitoba",
"NS" ~ "Nova Scotia",
c("ON", "ontario") ~ "Ontario",
c("QC", "Québec") ~ "Quebec"
)
) Now we can see if that worked:
# check if it worked
survey_data_notypos |>
distinct(province) |>
arrange(province) |>
print(n = Inf) # A tibble: 7 × 1
province
<chr>
1 Alberta
2 British Columbia
3 Manitoba
4 Nova Scotia
5 Ontario
6 Quebec
7 <NA>
Great, we fixed the typos and inconsistencies in the field of study variable.
Sometimes you might also want to create a new variable based on values from other variables. For example, it might be useful to group participants according to certain ranges of social media usage, instead of looking at the exact number of hours used per day.
For that, we can use the case_when() function inside mutate() to create a new categorical variable that bins participants into four categories of social media usage. The case_when function works well for when you want to create a whole new variable using values from an existing variable (this is different from replacing an existing variable, as we did above for the province and field_study variables, when we were interested in updating an existent column, rather than creating a new one). To create this new variable, case_when looks for statements of conditions, and creates the new values when the condition is TRUE. Let’s see how this works:
# create categorical variable for hours\_total
survey_data_bins <- survey_data_notypos |>
mutate(bin_hours_total = case_when(
hours_total <= 2 ~ "Less than or equal to 2 hours",
hours_total <= 4 ~ "2-4 hours",
hours_total <= 6 ~ "4-6 hours",
hours_total > 6 ~ "More than 6 hours"))If you want to check how that worked, you can see how the bin_hours_total variable maps to the hours_total variable using this code:
# check if it worked
survey_data_bins |>
distinct(hours_total, bin_hours_total) |>
arrange(hours_total) # A tibble: 11 × 2
hours_total bin_hours_total
<dbl> <chr>
1 0 Less than or equal to 2 hours
2 1 Less than or equal to 2 hours
3 2 Less than or equal to 2 hours
4 3 2-4 hours
5 4 2-4 hours
6 5 4-6 hours
7 6 4-6 hours
8 7 More than 6 hours
9 8 More than 6 hours
10 9 More than 6 hours
11 11 More than 6 hours
Separate Values
In our final cleaning step for this lesson (but there is more to come in the next lesson), we’ll talk about how to separate values from a variable. In certain cases, you might have a variable that actually contains two variables. For example, the date_birth variable can actually be understood as containing information about the date, month, and year of birth. To separate the values of this column into three columns, we can use the function separate_wider_delim() of the tidyr package, which specializes in dealing with columns like this.
The separate_wider_delim() function will separate a column into multiple columns, therefore making the dataset wider, hence its name. It also uses a delimiter to separate the column, and that’s why it has delim on its name (we all love an informative function name, right?). The function has three main arguments: the column(s) you want to separate, the delimiter that should be used to separate the values, and the names for the new columns. In the case of the date_birth column, the delimiter is the character /.
# separate 'date of birth' into 'day', 'month and 'year' columns, using the '/ as the separator
survey_data_sep <- survey_data_bins |>
separate_wider_delim(cols = date_birth,
delim = "/",
names = c("day", "month", "year")) Let’s see if it worked:
# check the result
survey_data_sep # A tibble: 437 × 24
id gender day month year age province year_study field_study
<int> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 1 Female 9 16 2001 24 British Columbia 3 Statistics
2 2 Female 4 16 2004 21 Manitoba 1 Computer Sc…
3 3 Male 2 14 2007 19 Manitoba 4 Psychology
4 4 Female 7 20 2001 24 British Columbia 2 Chemistry
5 5 Male 5 3 2001 NA Ontario 1 Biology
6 6 Male 1 15 2007 19 Ontario 2 Psychology
7 7 Female 12 30 2008 17 British Columbia 1 Psychology
8 8 Female 3 2 2005 21 Manitoba 1 Statistics
9 9 Female 2 1 2002 24 <NA> 1 Psychology
10 10 <NA> 8 27 2005 20 Alberta 2 Computer Sc…
# ℹ 427 more rows
# ℹ 15 more variables: hours_facebook <dbl>, hours_instagram <dbl>,
# hours_snapchat <dbl>, hours_tiktok <dbl>, hours_xtwitter <dbl>,
# hours_other <dbl>, hours_total <dbl>, hours_sleep <dbl>,
# devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
# feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>,
# bin_hours_total <chr>
Great, seems like we managed to separate the variable into three!
We can also unite the columns together using the unite() function, which does the opposite as separate_wider_delim(). This function takes in the name of the new column you want to create, followed by the current columns to be united, followed by the separator to be added between column values. For example, to recreate the date_birth variable, you can use this code:
# Now unite them
survey_data_unit <- survey_data_sep |>
unite(date_birth, day, month, year, sep = "/")
# check the result
survey_data_unit # A tibble: 437 × 22
id gender date_birth age province year_study field_study hours_facebook
<int> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 Female 9/16/2001 24 British … 3 Statistics 1
2 2 Female 4/16/2004 21 Manitoba 1 Computer S… 1
3 3 Male 2/14/2007 19 Manitoba 4 Psychology 1
4 4 Female 7/20/2001 24 British … 2 Chemistry 1
5 5 Male 5/3/2001 NA Ontario 1 Biology 1
6 6 Male 1/15/2007 19 Ontario 2 Psychology 1
7 7 Female 12/30/2008 17 British … 1 Psychology 1
8 8 Female 3/2/2005 21 Manitoba 1 Statistics 1
9 9 Female 2/1/2002 24 <NA> 1 Psychology 1
10 10 <NA> 8/27/2005 20 Alberta 2 Computer S… 1
# ℹ 427 more rows
# ℹ 14 more variables: hours_instagram <dbl>, hours_snapchat <dbl>,
# hours_tiktok <dbl>, hours_xtwitter <dbl>, hours_other <dbl>,
# hours_total <dbl>, hours_sleep <dbl>, devices_used <chr>,
# feel_connected <chr>, feel_stress <chr>, feel_distracted <chr>,
# feel_improvedmood <chr>, reasons_use <chr>, bin_hours_total <chr>
Although this is a simple example, there are multiple cases of weirdly structured data where the separate_wider_delim() and unite() functions will come in handy.
Save Your Script
Wow, that was a lot of data cleaning in one day! We are going to stop for now, but we will continue working on this script in Day 5 - Lesson 1, so there is no need to save a final cleaned dataset just yet. However, you will want to document and save everything that you did.
You will have time to update the project documentation in the next lesson.
Optional Data Cleaning Activity
Practice your skills in data cleaning with another dataset! This is an optional activity not required for the course. But if you choose to complete this activity, please let us know so that we know how many students were interested in our optional activities. You are also welcome to ask us any questions about it. We estimate it should take you 30–45min to complete this activity.
- Download this dataset. If you want to read more about it, check out the dataset documentation.
- Create a data cleaning script that:
- Renames variables to use consistent names with lower caps, no spaces, and no units of measurements
- Checks for duplicates and corrects if needed
- Reorders columns to move the Latitude and Longitude columns to after Squirrel ID
- Fix typos and inconsistencies in the Height Aboveground column
- You will have to decide what you want to do with inconsistent values, but if you do not know the height, you can assign them an NA value
- Recodes location variable to specify only “groundplane”, “aboveground”, or “both”.
- There is no need to mention “Specific Location” in the values here as a column for specific location exists and details their location.
- All entries that say solely “Specific Location” can be assigned to “aboveground” as they describe squirrels seen on a bench, tree, handrail or fence, as detailed in the specific location variable.
- Saves the cleaned dataset with an informative name different from the raw dataset name
If starting from scratch feels like too much for you, you could also download this fill-in-the-blanks data cleaning script template with hints for you to complete this activity.
As always with R, there are multiple ways to perform the same actions, so your script won’t necessarily be the same as ours. But if you want a reference script that performs the above data cleaning tasks, you can find one here