Lesson 1: Data Cleaning

Learning Objectives

By the end of this lesson, you’ll be able to:

  • Describe the importance of data cleaning for the research data lifecycle.
  • Identify data cleaning needs for an example dataset.
  • Use the pipe operator and functions from the tidyverse to clean data of an example dataset.
  • Write reproducible code by applying concepts of literate programming.

Activity - What Needs to Be Cleaned?

Before cleaning a dataset, you should first explore your data so that you can understand what actions you must take in order to answer your research questions. In a previous lesson, you imported and looked at the dataset in RStudio. Now, you will perform a deeper inspection to make a plan for data cleaning.

NoteBreakout Room

In your breakout rooms, open the survey dataset in R (e.g., use the View() function that you learned in Day 3), and discuss the following questions:

  1. Is this dataset in a consistent format for subsequent work and analysis? Why or why not?
  2. What fields would you want to use to answer the project research question? Do they exist already?
  3. Are there any fields that you think might be difficult to work with? Why?
  4. What fields would you want to modify or correct in this dataset? How would you do that?

Please make a list of things you would want to modify, transform, or correct in this dataset, and choose someone in your group to share that list with the larger group.

You will have 20 minutes in the breakout room.

Lecture - Data Cleaning

Lecture

Data Cleaning

Data cleaning is the process of checking, fixing, or transforming your data. Depending on how it was collected, your data may contain errors (e.g., typos) or it may be in a format that does not allow it to be easily used. During data cleaning, you will correct and transform the data to ensure it is ready for analysis.

Data cleaning can be time consuming, but it is a crucial step in the research data lifecycle. If your data contains errors, the results of your analysis will not be reliable, regardless of how good your methods are.

As you clean your data, use the data management practices you’ve learned so far to document your data cleaning process. A preferred way to document this step is to create a script that performs and describes all of the required data cleaning actions. The comments in your script will not only help you in the future, but enhance the reproducibility of your research for others.

Best Practices for Data Cleaning

The single most important rule in data cleaning is to never directly edit your raw data. Once your data has been collected, save the source file in the appropriate data folder in your project directory and do not change it. This ensures that you can reload the original dataset if you make a mistake. It also allows others (and yourself) to understand the ways in which you cleaned and manipulated your data. This is where file versioning practices are necessary, as any changes to the data should be saved and labelled as a new version.

When performing data cleaning using a script, create a single data cleaning script that uploads the raw data, performs all necessary data cleaning steps, and then saves the cleaned dataset in a new file in the correct folder in your project directory.

When sharing the dataset with others (which we will cover in more detail in another lesson), always remember to share the clean version of your data. This file will be easy to identify if you have named it appropriately.

Data Cleaning Steps

Each dataset is unique and the type of data cleaning practices you use will need to be adapted for each new project. But, here is a general idea of what a dataset may require for cleaning:

  • Fix column names.
  • Remove duplicate observations.
  • Correct spelling mistakes.
  • Standardize letter case and use of spaces or other special characters.
  • Group values into larger categories.
  • Convert variables to correct variable types (e.g., numeric, categorical, date, etc).
  • Merge and split columns.
  • Create new variables.
  • Identify missing values.
  • Create a subset of the data.
  • Reshape the dataset for analysis.

Tutorial - Data Cleaning

In this tutorial, we are going to learn about common cleaning steps and techniques. As you go through this tutorial, transcribe the code into your own script file, and then run the code to execute the commands.

The Pipe Operator

You’ve already learned that the tidyverse package has many functions that are helpful for data cleaning. You will use these functions in this lesson, but before you do, it is helpful to understand one additional rule about the syntax used in this library. The tidyverse uses a pipe operator, represented by the symbol |>. The pipe is used to feed the result of one function directly into the next function.

In an earlier lesson, you used “nesting” to view frequencies of observations by age and gender in our example dataset. We will continue in this script to learn how to use the pipe. For that, first open your R Project by clicking in the .RProj file in your project directory. This will open the project in R Studio. Then, to run the nesting code again, open your script in RStudio, and run the first lines of code loading the tidyverse and uploading the data, and the final line of code exploring the count by age and gender:

# Load package
library(tidyverse)

# Import data
survey_data <- read_csv("data/raw/survey_student_rawdata.csv")
# Get the number of observations by age and gender and print all results
print(count(survey_data, Age, Gender), n = Inf) 
# A tibble: 51 × 3
     Age Gender                n
   <dbl> <chr>             <int>
 1    16 Female                5
 2    16 Male                 16
 3    16 Non-binary            4
 4    16 <NA>                  2
 5    17 Female               31
 6    17 Male                 17
 7    17 Non-binary            1
 8    17 Prefer not to say     1
 9    18 Female               27
10    18 Male                 19
11    18 Non-binary            2
12    18 Prefer not to say     1
13    19 Female               28
14    19 Male                 36
15    19 Non-binary            1
16    19 <NA>                  3
17    20 Female               21
18    20 Male                 34
19    20 Prefer not to say     1
20    20 <NA>                  3
21    21 Female               31
22    21 Male                 23
23    21 Prefer not to say     1
24    21 <NA>                  2
25    22 Female               17
26    22 Male                 19
27    22 Non-binary            4
28    22 Prefer not to say     2
29    22 <NA>                  2
30    23 Female               21
31    23 Male                 23
32    23 <NA>                  1
33    24 Female               19
34    24 Male                 14
35    24 Non-binary            1
36    24 Prefer not to say     1
37    24 <NA>                  1
38    25 Female                6
39    25 Male                  8
40    25 Non-binary            4
41    25 <NA>                  2
42    26 Female                2
43    26 Male                  6
44    27 Female                5
45    27 Male                  5
46    29 Male                  1
47    NA Female               27
48    NA Male                 18
49    NA Non-binary            1
50    NA Prefer not to say     4
51    NA <NA>                  1

Here’s where the pipe operator becomes useful. Instead of nesting count(survey_data, Age, Gender) inside the print() function, you can use the |> symbol to feed the result of the first part of the argument (count(survey_data, Age, Gender) into the second part (print()).

# Get the number of observations by age and gender and print all results

count(survey_data, Age, Gender) |>  
  print(n = Inf)  
# A tibble: 51 × 3
     Age Gender                n
   <dbl> <chr>             <int>
 1    16 Female                5
 2    16 Male                 16
 3    16 Non-binary            4
 4    16 <NA>                  2
 5    17 Female               31
 6    17 Male                 17
 7    17 Non-binary            1
 8    17 Prefer not to say     1
 9    18 Female               27
10    18 Male                 19
11    18 Non-binary            2
12    18 Prefer not to say     1
13    19 Female               28
14    19 Male                 36
15    19 Non-binary            1
16    19 <NA>                  3
17    20 Female               21
18    20 Male                 34
19    20 Prefer not to say     1
20    20 <NA>                  3
21    21 Female               31
22    21 Male                 23
23    21 Prefer not to say     1
24    21 <NA>                  2
25    22 Female               17
26    22 Male                 19
27    22 Non-binary            4
28    22 Prefer not to say     2
29    22 <NA>                  2
30    23 Female               21
31    23 Male                 23
32    23 <NA>                  1
33    24 Female               19
34    24 Male                 14
35    24 Non-binary            1
36    24 Prefer not to say     1
37    24 <NA>                  1
38    25 Female                6
39    25 Male                  8
40    25 Non-binary            4
41    25 <NA>                  2
42    26 Female                2
43    26 Male                  6
44    27 Female                5
45    27 Male                  5
46    29 Male                  1
47    NA Female               27
48    NA Male                 18
49    NA Non-binary            1
50    NA Prefer not to say     4
51    NA <NA>                  1

In the example above, print(n = Inf) is indented on a new line for readability. When using the pipe symbol (|>), editors such as RStudio may automatically indent the next line to show that the command continues.

NoteExercise

You can also use the pipe to feed the dataset survey_data into the count() function. What would the code look like?

You can check the answer below, but please make sure to try it yourself before you see the answer. The best way to learn is to try to do it by yourself!

# Get the number of observations by age and gender and print all results  
survey_data |>  
  count(Age, Gender) |>  
  print(n = Inf) 

Using pipes can help the code look cleaner and easier to understand. When formatted this way, you can see each step:

  1. survey_data is fed into the count() function using |>
  2. count() is counting the frequencies of the Age and Gender observations in survey_data, which is fed into the print() function using another |>.
  3. print() shows the data you’ve requested from count() in the console and uses the n = Inf flag to ask R to show all of the data available.

This makes the code both machine-readable and easier to understand by humans, following the practices of literate programming.

Note: The symbol %>% is another way of expressing the pipe syntax in tidyverse, while |> is used in base R. Both symbols will work exactly the same in most cases. In this course, we will use |>, which is now considered the standard pipe symbol, but if you see %>% in older tutorials or in tidyverse documentation, you can interpret it in the same way as |>. If you want to know more about their differences, you can refer to this page on differences between the base R and magrittr pipes.

Now that we have learned how to use the pipe, we can move on to creating the data cleaning script. You will continue to use the pipe throughout this lesson, so there will be plenty of opportunities to practice this new syntax. Save your current script and close it. We will start data cleaning in a new script.

Set up the Data Cleaning Script

It is good practice to start a new task in R with a clean working environment. This removes existing objects and packages, which ensures that your code runs predictably and includes all necessary steps.

In RStudio, in the topmost menu bar, click on ‘Session’ and then ‘Restart R’.

Now, open a new R script and save it with a meaningful name in the correct folder location.

CautionNote

Consult the project’s file naming and folder structure conventions (Day 3) if needed.

At the top of your script, create a title and add a brief description of its purpose. Remember, this comment should be preceded by a hash # symbol.

# DATA CLEANING

# this script:  
## uploads the raw social media survey data \+ cleans it  
## outputs a cleaned dataset ready to be used in analysis  

Organize your working environment

Before you start with the actual data cleaning tasks, make sure your working environment is ready to go. Since you are working within an R Project, the working directory will already be set.

Next, add a line that loads the packages you need. In this lesson, we’ll be using tidyverse, though it is common to see more than one package here. Be sure to include a comment about what the script is doing. It may seem obvious to you, but the more you get into the habit of writing comments, the more likely you are to do it consistently.

# Load the tidyverse package  
library(tidyverse)  

Remember to check the console (below your script editing window) to ensure your code ran without errors.

Then, load the dataset using read_csv()

If you can’t remember how this function works, look back to your practice scripts, or review Day 3.

# import data  
survey_data <- read_csv("data/raw/survey_student_rawdata.csv")  

You should now see a survey_data object listed under Data, in the top right panel Environment tab.

NoteExercise

Was the data imported correctly? Use some of the functions you learned in Day 3 to check.

# check that data was correctly imported  
survey_data # prints in the console  
View(survey_data) # prints in a separate window  
head(survey_data) # prints the first 6 rows of data  

Take some time to inspect the console output for the code chunk you just created. What kind of information does each line provide? You may need to scroll up in the console to see the entire output. We won’t review them again here, but you should be seeing terms like tibble, <chr>, and print(n = ...), along with a sample of the survey data, both in the console and in a new window.

Update Column Names

Now that we know that the data have been imported correctly, it is time to begin cleaning. Remember, each dataset is unique, so what we do in this lesson may not apply to the next project you work on. Use what you learn here as a starting point.

The first thing we need to address are the column names, as they are currently very long with extra characters such as spaces and punctuation. Why do you think the column names look like this?

# check names  
names(survey_data)  
 [1] "Timestamp"                                                                                                                                                                  
 [2] "Name"                                                                                                                                                                       
 [3] "Email"                                                                                                                                                                      
 [4] "Date of birth"                                                                                                                                                              
 [5] "Age"                                                                                                                                                                        
 [6] "Province"                                                                                                                                                                   
 [7] "Gender"                                                                                                                                                                     
 [8] "Year of Study"                                                                                                                                                              
 [9] "Field of Study"                                                                                                                                                             
[10] "How many hours per day do you use each platform? [Facebook]"                                                                                                                
[11] "How many hours per day do you use each platform? [Instagram]"                                                                                                               
[12] "How many hours per day do you use each platform? [Snapchat]"                                                                                                                
[13] "How many hours per day do you use each platform? [Tiktok]"                                                                                                                  
[14] "How many hours per day do you use each platform? [X (Twitter)]"                                                                                                             
[15] "How many hours per day do you use each platform? [Other]"                                                                                                                   
[16] "On average, how many hours of sleep do you get per night?"                                                                                                                  
[17] "What devices do you use to access social media"                                                                                                                             
[18] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media helps me feel connected to others.]"  
[19] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media makes me feel anxious or stressed.]"  
[20] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media distracts me from academic work.]"    
[21] "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media improves my mood when I feel lonely.]"
[22] "Select any of the following reasons for using social media:"                                                                                                                

The column names consist of the question text from the survey form. While the single-word column names are acceptable, the other ones will be more difficult to work with, and require cleaning.

To change column names we will use rename(). This function is part of the dplyr package that is installed with tidyverse.

Type the following code into your script to change “Date of birth” to “date_birth”:

CautionNote

In this script, we are going to create a new object every time we perform an action on the dataset (such as, survey_data_fixnames or survey_data_nodupl). We do this to keep a visual record of the changes we are making to the dataset. However, when working on your own projects, avoid creating too many similar objects in your working environment. Instead, you will overwrite the original object with the changes implemented (e.g., always overwrite the object named survey_data without adding the description).

# change column names  
survey_data_fixnames <- survey_data |>  
  rename(date_birth = "Date of birth")  
CautionStep-by-step explanation:
  • survey_data_fixnames - this is a new object name
  • <- - this is the assignment operator. It assigns a value to a name and will take whatever comes next (on the right side) and apply it to the object on the left.
  • survey_data - this is our original dataset, loaded into RStudio from a file.
  • |> - this is the pipe. It passes information to the next function, taking the result from the left and using it as the first input on the right.
  • rename() - this is a function used to rename columns. Inside the parentheses, provide the new column name followed by the existing column name.
  • date_birth = "Date of birth" - you are renaming the “Date of birth” column to date_birth. This new column name is written without quotes because it has no spaces.

Run names(survey_data) again to check if the name change worked.

What happened when you ran the names(survey_data)code again? Did you see the name change? Remember, we didn’t change column names in the survey_data object, we created a new object called survey_data_fixnames and changed the column name there.

rename() can change multiple column names at once if you provide it with a comma separated list. Modify your code so that it changes the column “How many hours per day do you use each platform? [Facebook]” to “hours_facebook”. You can do this by adding a comma after date_birth = "Date of birth.

# change column names  
survey_data_fixnames <- survey_data |>  
  rename(date_birth = "Date of birth",  
         hours_facebook = "How many hours per day do you use each platform? [Facebook]")  
NoteYour turn!

Edit the code above so that it also changes the following column names in the same call of the the rename function:

  • “Timestamp” to timestamp
  • ”Name” to name
  • “Email” to email

You can check the answer below, but please make sure to try it yourself before you see the answer. The best way to learn is to try to do it by yourself!

# change column names  
survey_data_fixnames <- survey_data  |>  
  rename(date_birth = "Date of birth",  
         hours_facebook = "How many hours per day do you use each platform? [Facebook]",  
         timestamp = "Timestamp",  
         name = "Name",  
         email = "Email")  

Now, let’s change the rest of the column names. Use the code below as a guide, but get used to navigating your dataset in RStudio. Also, if you want to use copy / paste to collect the column names, do it in RStudio, not here. You should never copy code from a website for use on your computer. Not only can the HTML markup cause errors, the code itself could be malicious.

Remember, a script is a document. Take your time to create it, and check for typos as you go. Take the same care as you would when writing a paper or formatting a list of citations. Grammar, punctuation, and style matter.

# change column names  
survey_data_fixnames <- survey_data  |>  
  rename(date_birth = "Date of birth",  
         hours_facebook = "How many hours per day do you use each platform? [Facebook]",  
         timestamp = "Timestamp",  
         name = "Name",  
         email = "Email",  
         age = "Age",  
         province = "Province",  
         gender = "Gender",  
         year_study = "Year of Study",  
         field_study = "Field of Study",  
         hours_instagram = "How many hours per day do you use each platform? [Instagram]",  
         hours_snapchat = "How many hours per day do you use each platform? [Snapchat]",  
         hours_tiktok = "How many hours per day do you use each platform? [Tiktok]",  
         hours_xtwitter = "How many hours per day do you use each platform? [X (Twitter)]",  
         hours_other = "How many hours per day do you use each platform? [Other]",  
         hours_sleep = "On average, how many hours of sleep do you get per night?",  
         devices_used = "What devices do you use to access social media",  
         feel_connected = "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media helps me feel connected to others.]",  
         feel_stress = "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media makes me feel anxious or stressed.]",  
         feel_distracted = "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media distracts me from academic work.]",  
         feel_improvedmood = "Please indicate how much you agree or disagree with the following statements about social media and your mental health: [Social media improves my mood when I feel lonely.]",  
         reasons_use = "Select any of the following reasons for using social media:"  
  )

You can now check if column names were correctly modified.

# Check if names were correctly changed  
names(survey_data_fixnames)
 [1] "timestamp"         "name"              "email"            
 [4] "date_birth"        "age"               "province"         
 [7] "gender"            "year_study"        "field_study"      
[10] "hours_facebook"    "hours_instagram"   "hours_snapchat"   
[13] "hours_tiktok"      "hours_xtwitter"    "hours_other"      
[16] "hours_sleep"       "devices_used"      "feel_connected"   
[19] "feel_stress"       "feel_distracted"   "feel_improvedmood"
[22] "reasons_use"      

Did everything work as you expected? You should see a list of all of the updated column names. If you don’t, go back through this lesson from the beginning and try again. Remember, another way to keep track of your objects in RStudio is to check the Environment tab.

Remove Duplicates

Duplicated information can happen in a dataset for many reasons. There may have been an error in the survey collection tool that created a duplicate column, or user errors could lead to the same data appearing in multiple records, or rows. It is good practice to search for duplicate data.

One way to check for duplicates in a survey dataset like this, is to check that the number of respondents is the same as the number of rows in the dataset.

To see the number of respondents, we will use the name and email columns. While some people in a dataset could have the same name, they will have a different email address. Use the function distinct() from dplyr, which shows the unique combinations of the variables you choose.

# Number of respondents in the dataset.  
survey_data_fixnames |>  
  distinct(name, email)  
# A tibble: 500 × 2
   name          email                          
   <chr>         <chr>                          
 1 Mei Lopez     mei.lopez@outlook.com          
 2 Yuki Smith    yuki.smith@proton.me           
 3 Arjun Khan    arjun.khan@uottawa.ca          
 4 Lucas Chen    lucas.chen@icloud.com          
 5 Nina Anderson nina.anderson@icloud.com       
 6 Olivia Garcia olivia.garcia@wilfridlaurier.ca
 7 Aisha Singh   aisha.singh@hotmail.com        
 8 Jamal Wilson  jamal.wilson@yorku.ca          
 9 Jamal Nguyen  jamal.nguyen@gmail.com         
10 Nina Brown    nina.brown@icloud.com          
# ℹ 490 more rows
CautionStep-by-step explanation:
  • survey_data_fixnames - this is the object with the renamed columns that holds our survey data.
  • |> - this is the pipe.
  • distinct() - this is a function used to show unique rows or unique combinations of values in a dataset.
  • name, email - these are the column names you want the distinct() function to check

We are not creating a new object with this information, as we’ve not used the <- operator to assign new values. We are just displaying this information in the console.

The console output shows the following information: A tibble: 500 x 2

This means that the function has retrieved two columns (name and email) and 500 rows (500 unique respondents).

To check the total number of rows in the dataset, use the function nrow().

# Number of rows in the dataset.  
nrow(survey_data_fixnames)  
[1] 525

What was the output from nrow()? If there are more rows than unique respondents, it means there are duplicate observations.

To remove them, we will use the distinct() function and assign the function’s output to a new object. If you do not specify column names inside the parentheses, this function will search for and keep unique rows from across the entire data frame or tibble.

# Remove duplicated values  
survey_data_nodupl <- survey_data_fixnames |>   
  distinct()  
CautionStep-by-step explanation:
  • survey_data_nodupli - this is a new object to hold the deduplicated dataset.
  • <- - the assignment operator.
  • survey_data_fixnames - the object we want to read the data from
  • |> - the pipe passes information to the function.
  • distinct() - this function shows unique rows or unique combinations of values in a dataset. Without an argument in the parentheses, it will search all columns.

Now, check the number of rows in the dataset again.

# Check number of rows in the dataset and the number of distinct rows  
nrow(survey_data_nodupl)  
distinct(survey_data_nodupl)  

You should now have the same number of rows as respondents, which shows that duplicates have been removed. If you don’t see the same result, refresh your R session and go through your code from the top.

Tackle Personal Identifiers

This dataset includes the names and email address of respondents, as well as other personal details, such as age and gender. While the design of a survey instrument is outside of the scope of this course, it is always a good idea to consider how much data you actually need before you start collection. Your project may have documentation about how respondent data are to be collected or used, so make sure you follow the norms of your lab or research group.

It is common practice to remove personal identifiers from survey data, but you may also wish to create a code for each entry, should you ever need to connect information back to the respondent. In this example, we will create a new ID column, and then write it to a new object with the personal information removed.

Create an ID Column

The mutate() function, part of the dplyr package, is useful when you want to modify columns or create new columns based on other functions or calculations with existing variables.

We can use mutate() with the row_number() function to create a new column that assigns a unique number to each row. This has the added benefit of identifying the unique entries in the dataset.

# add a column with an anonymous ID per respondent  
survey_data_id <- survey_data_nodupl |>  
  mutate(id = row_number())  
CautionStep-by-step explanation:

You can see that the grammar works in the same way as the rename() and distinct() function, which is one of the strengths of using the tidyverse.

  • survey_data_id - an object to hold the dataset with the new id column.
  • <- - the assignment operator.
  • survey_data_nodupli - the object we want to read the data from
  • |> - the pipe passes information to the function.
  • mutate() - this function adds new variables or changes existing ones by applying calculations or transformations.
  • id = - the name of the new column. The equal operator here is used in the same way as it was in the rename() function, here it assigns a column name to whatever comes next.
  • row_number() - this is a function used to assign a number to each row, usually based on the current order of the data

Now, check if the id column was created.

# add a column with an anonymous ID per respondent  
View(survey_data_id)  

Scroll down to the right to see the final column in the dataset, it should be the newly created id column.

Reorder Columns

Now, we want to move our newly created id column from the right of the tibble to the left, so that it is easier to see when we view the data.

The relocate() function, which is also part of the dplyr package, is the next function we will learn. It is a simple function that relocates columns in a dataset.

The default of relocate() is to put the specified column in the first position, so we don’t need to specify location here, though that is possible with added arguments. We will learn those below.

# moves id column to the left most position (default of relocate function)  
survey_data_ordered <- survey_data_id |>  
  relocate(id)

# check if it worked  
survey_data_ordered  
# A tibble: 500 × 23
      id timestamp       name  email date_birth   age province gender year_study
   <int> <chr>           <chr> <chr> <chr>      <dbl> <chr>    <chr>       <dbl>
 1     1 3/18/2026 19:0… Mei … mei.… 9/16/2001     24 British… Female          3
 2     2 3/5/2026 17:00… Yuki… yuki… 4/16/2004     21 MB       Female          1
 3     3 3/6/2026 08:00… Arju… arju… 2/14/2007     19 Manitoba Male            4
 4     4 3/6/2026 12:00… Luca… luca… 7/20/2001     24 BC       Female          2
 5     5 3/10/2026 08:0… Nina… nina… 5/3/2001      NA ON       Male            1
 6     6 3/18/2026 18:0… Oliv… oliv… 1/15/2007     19 ontario  Male            2
 7     7 3/8/2026 08:00… Aish… aish… 12/30/2008    17 british… Female          1
 8     8 3/5/2026 11:00… Jama… jama… 3/2/2005      21 MB       Female          1
 9     9 3/18/2026 19:0… Jama… jama… 2/1/2002      24 <NA>     Female          1
10    10 3/8/2026 08:00… Nina… nina… 8/27/2005     20 Alberta  <NA>            2
# ℹ 490 more rows
# ℹ 14 more variables: field_study <chr>, hours_facebook <dbl>,
#   hours_instagram <dbl>, hours_snapchat <dbl>, hours_tiktok <dbl>,
#   hours_xtwitter <dbl>, hours_other <dbl>, hours_sleep <dbl>,
#   devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
#   feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>

However, if you want to put a column in another location, you can specify the new location in reference to another column by using the arguments .before or .after. For example, the code below moves the gender column to the position after the age column.

# move gender column to the right of age column  
survey_data_ordered <- survey_data_ordered |>  
  relocate(gender, .after = age)

# check  
survey_data_ordered  
# A tibble: 500 × 23
      id timestamp       name  email date_birth   age gender province year_study
   <int> <chr>           <chr> <chr> <chr>      <dbl> <chr>  <chr>         <dbl>
 1     1 3/18/2026 19:0… Mei … mei.… 9/16/2001     24 Female British…          3
 2     2 3/5/2026 17:00… Yuki… yuki… 4/16/2004     21 Female MB                1
 3     3 3/6/2026 08:00… Arju… arju… 2/14/2007     19 Male   Manitoba          4
 4     4 3/6/2026 12:00… Luca… luca… 7/20/2001     24 Female BC                2
 5     5 3/10/2026 08:0… Nina… nina… 5/3/2001      NA Male   ON                1
 6     6 3/18/2026 18:0… Oliv… oliv… 1/15/2007     19 Male   ontario           2
 7     7 3/8/2026 08:00… Aish… aish… 12/30/2008    17 Female british…          1
 8     8 3/5/2026 11:00… Jama… jama… 3/2/2005      21 Female MB                1
 9     9 3/18/2026 19:0… Jama… jama… 2/1/2002      24 Female <NA>              1
10    10 3/8/2026 08:00… Nina… nina… 8/27/2005     20 <NA>   Alberta           2
# ℹ 490 more rows
# ℹ 14 more variables: field_study <chr>, hours_facebook <dbl>,
#   hours_instagram <dbl>, hours_snapchat <dbl>, hours_tiktok <dbl>,
#   hours_xtwitter <dbl>, hours_other <dbl>, hours_sleep <dbl>,
#   devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
#   feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>
NoteYour turn!

Use relocate() to move the gender column after the name column by rewriting your code to change the order of columns.

You can check the answer below, but please make sure to try it yourself before you see the answer. The best way to learn is to try it yourself first.

# move gender column to the right of name column  
survey_data_ordered <- survey_data_ordered |>  
  relocate(gender, .after = name)

# check  
survey_data_ordered  
# A tibble: 500 × 23
      id timestamp       name  gender email date_birth   age province year_study
   <int> <chr>           <chr> <chr>  <chr> <chr>      <dbl> <chr>         <dbl>
 1     1 3/18/2026 19:0… Mei … Female mei.… 9/16/2001     24 British…          3
 2     2 3/5/2026 17:00… Yuki… Female yuki… 4/16/2004     21 MB                1
 3     3 3/6/2026 08:00… Arju… Male   arju… 2/14/2007     19 Manitoba          4
 4     4 3/6/2026 12:00… Luca… Female luca… 7/20/2001     24 BC                2
 5     5 3/10/2026 08:0… Nina… Male   nina… 5/3/2001      NA ON                1
 6     6 3/18/2026 18:0… Oliv… Male   oliv… 1/15/2007     19 ontario           2
 7     7 3/8/2026 08:00… Aish… Female aish… 12/30/2008    17 british…          1
 8     8 3/5/2026 11:00… Jama… Female jama… 3/2/2005      21 MB                1
 9     9 3/18/2026 19:0… Jama… Female jama… 2/1/2002      24 <NA>              1
10    10 3/8/2026 08:00… Nina… <NA>   nina… 8/27/2005     20 Alberta           2
# ℹ 490 more rows
# ℹ 14 more variables: field_study <chr>, hours_facebook <dbl>,
#   hours_instagram <dbl>, hours_snapchat <dbl>, hours_tiktok <dbl>,
#   hours_xtwitter <dbl>, hours_other <dbl>, hours_sleep <dbl>,
#   devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
#   feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>

Save Data with Personal Identifiers

Before we go further, let’s save a copy of our data as it is currently formatted, just as our project lead asked us to do in the project scenario. This copy will include the personal identifiers and the ID codes. We won’t be using this dataset for analysis, but it will be necessary should the project lead need to connect with respondents.

To save the data, we can use the function write_csv(), which is very similar to the function read_csv that you learned in an earlier lesson. The general syntax is:

write_csv(object_to_save, “file-path-to-file-to-be-saved.csv”)

It is important to know that if you use the name of an existing file, R will overwrite it without warning. As we mentioned earlier, you should never edit your original data, so in this step we are going to give the file a new descriptive name.

# Save data linking ID to names  
write_csv(survey_data_ordered, "data/processed/survey_student_identifiers-IDs.csv")  

Remember, if you want to inspect your file directory, you can see it under the ‘Files’ tab in the bottom right panel of RStudio. Did your file save as expected?

Remove Columns

Now that we have saved a new version of the data, we will remove the email and name columns to de-identify the dataset. We will also use the same command to remove unnecessary columns.

There are a number of ways to remove columns in R, but here, we’ll focus on the select() function from the dplyr package in tidyverse.

This function uses the same general syntax that we have been learning as we explore the tidyverse. In the parentheses, you must insert a dash - or minus symbol, followed by the name of the column you want to remove.

Remove the name and email columns using the select() function.

# Remove name and email columns  
survey_data_anon <- survey_data_ordered |>  
  select(-name, -email)  
NoteYour turn!

Now remove other unnecessary columns. The timestamp is a default column added by the survey tool we used (Google Forms), but we will not be using this for our analysis. Use the select() function to remove this column from the dataset.

You can check the answer below, but please make sure to try it yourself before you see the answer.

# Remove other unnecessary columns  
survey_data_anon <- survey_data_anon |>  
  select(-timestamp)

You can now check that all column removals worked:

# check that it worked  
survey_data_anon  
# A tibble: 500 × 20
      id gender date_birth   age province  year_study field_study hours_facebook
   <int> <chr>  <chr>      <dbl> <chr>          <dbl> <chr>                <dbl>
 1     1 Female 9/16/2001     24 British …          3 Statistics               1
 2     2 Female 4/16/2004     21 MB                 1 Computer S…              1
 3     3 Male   2/14/2007     19 Manitoba           4 Psychology               1
 4     4 Female 7/20/2001     24 BC                 2 Chemistry                1
 5     5 Male   5/3/2001      NA ON                 1 Biology                  1
 6     6 Male   1/15/2007     19 ontario            2 psycology                1
 7     7 Female 12/30/2008    17 british …          1 psychology               1
 8     8 Female 3/2/2005      21 MB                 1 stats                    1
 9     9 Female 2/1/2002      24 <NA>               1 Psychology               1
10    10 <NA>   8/27/2005     20 Alberta            2 Computer S…              1
# ℹ 490 more rows
# ℹ 12 more variables: hours_instagram <dbl>, hours_snapchat <dbl>,
#   hours_tiktok <dbl>, hours_xtwitter <dbl>, hours_other <dbl>,
#   hours_sleep <dbl>, devices_used <chr>, feel_connected <chr>,
#   feel_stress <chr>, feel_distracted <chr>, feel_improvedmood <chr>,
#   reasons_use <chr>

Before we continue to the next step, it is useful to know that the select() function can also be used to select specific variables from your dataset. For example, if you want to show the variables year_study and field_study, use the following code:

# select specific columns  
survey_data_anon |>  
  select(year_study, field_study)  
# A tibble: 500 × 2
   year_study field_study     
        <dbl> <chr>           
 1          3 Statistics      
 2          1 Computer Science
 3          4 Psychology      
 4          2 Chemistry       
 5          1 Biology         
 6          2 psycology       
 7          1 psychology      
 8          1 stats           
 9          1 Psychology      
10          2 Computer Science
# ℹ 490 more rows
NoteYour turn!

Write code that selects the date_birth, age, and province columns:

You can check the answer below, but try it yourself before you see the answer.

# select specific columns  
survey_data_anon |>  
  select(date_birth, age, province) 
# A tibble: 500 × 3
   date_birth   age province        
   <chr>      <dbl> <chr>           
 1 9/16/2001     24 British Columbia
 2 4/16/2004     21 MB              
 3 2/14/2007     19 Manitoba        
 4 7/20/2001     24 BC              
 5 5/3/2001      NA ON              
 6 1/15/2007     19 ontario         
 7 12/30/2008    17 british columbia
 8 3/2/2005      21 MB              
 9 2/1/2002      24 <NA>            
10 8/27/2005     20 Alberta         
# ℹ 490 more rows

Save current version of cleaned data

Now that we removed the personal identifiers, it is good practice to save a version of the data at this stage, before any more steps of data cleaning, so we can share this version later on.

NoteYour turn!

Use the write_csv() function to save the object survey_data_anon into a .csv file named “survey_student_anonymized-cleancolnames.csv” in the processed data folder.

You can check the answer below, but please make sure to try it yourself before you see the answer.

# Save the file without personal identifiers  
write_csv(survey_data_anon, "data/processed/survey_student_anonymized-cleancolnames.csv")  

To continue the script, we want to import the data that we just saved. This will make sure that the rest of the code is reproducible later on when we share the anonymized data with clean column names.

# import data  
survey_data_anon <- read_csv("data/processed/survey_student_anonymized-cleancolnames.csv")  

Create New Variables

Depending on your research question, you may need to create a new variable based on your existing variables. For example, because we are investigating overall social media usage, we could combine the time spent in different platforms into a single variable. This is called a “derived variable”, as it does not appear in the original dataset. However, if our question was about comparing the usage of different platforms, we would want to keep the variables separate. It is always important to go back to your questions to have more clarity on how you want your variables to be.

To create a derived variable, we will use the mutate() function. Remember, the syntax for this function requires an argument inside the parentheses. Last time, we used mutate() to create the new ID column. This time, we are using it to combine variables, so we need to tell R the new column name, followed by the names of the columns we want to merge. Since the column names are existing variables in our dataset, we don’t need to enclose them in quotations. Take a look at the code below to see how the mutate() function works for this example.

# add a variable for total amount of time spent on social media  
survey_data_tothour <- survey_data_anon |>  
  mutate(  
    hours_total = hours_facebook + hours_instagram + hours_snapchat + hours_tiktok + hours_xtwitter  + hours_other  
  )  

Remember, by default, R creates the new variable as the last column in the dataset. For a cleaner dataset, we may want to move this variable closer to the individual platform variables.

NoteYour turn!

Remember how to reorder columns? Now try to move the newly created column hours_total to after the hours_other variable.

You can check the answer below, but please make sure to try it yourself first.

# move the new variable to after hour\_other  
survey_data_tothour <- survey_data_tothour |>   
  relocate(hours_total,  
           .after = hours_other)  

Now that we have created this new variable, let’s see if the numbers from each individual column were added together correctly in the new column. First, let’s use the range() function to see the range of total hours spent in social media:

# check the range of hours\_total  
range(survey_data_tothour$hours_total)  
[1] NA NA

If your code ran correctly, you’ll be seeing NA, or ‘not available’ in your console. NA in R means that a value is missing or unavailable.

Deal with Missing Data

Missing data can create difficulties when working with certain functions because they can’t be used in calculations. Certain statistical analyses, such as linear models, will not work if there is missing data. This means that dealing with NA values during data cleaning is very important. There are many different ways to handle NAs, but we will only cover some of them here.

First, let’s look at how many NAs we have in our dataset.

# check NAs in the dataset  
survey_data_tothour |> summary()   
       id           gender           date_birth             age       
 Min.   :  1.0   Length:500         Length:500         Min.   :16.00  
 1st Qu.:125.8   Class :character   Class :character   1st Qu.:18.00  
 Median :250.5   Mode  :character   Mode  :character   Median :20.00  
 Mean   :250.5                                         Mean   :20.47  
 3rd Qu.:375.2                                         3rd Qu.:23.00  
 Max.   :500.0                                         Max.   :29.00  
                                                       NA's   :47     
   province           year_study    field_study        hours_facebook 
 Length:500         Min.   :1.000   Length:500         Min.   :0.000  
 Class :character   1st Qu.:1.000   Class :character   1st Qu.:1.000  
 Mode  :character   Median :2.000   Mode  :character   Median :1.000  
                    Mean   :2.213                      Mean   :0.831  
                    3rd Qu.:3.000                      3rd Qu.:1.000  
                    Max.   :4.000                      Max.   :2.000  
                    NA's   :16                         NA's   :3      
 hours_instagram hours_snapchat    hours_tiktok   hours_xtwitter    hours_other
 Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0   
 1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0   
 Median :1.000   Median :1.0000   Median :1.000   Median :0.0000   Median :0   
 Mean   :1.472   Mean   :0.9919   Mean   :1.306   Mean   :0.0261   Mean   :0   
 3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:0   
 Max.   :3.000   Max.   :2.0000   Max.   :3.000   Max.   :1.0000   Max.   :0   
 NA's   :4       NA's   :8        NA's   :4       NA's   :2        NA's   :7   
  hours_total     hours_sleep   devices_used       feel_connected    
 Min.   : 0.00   Min.   :4.00   Length:500         Length:500        
 1st Qu.: 4.00   1st Qu.:6.00   Class :character   Class :character  
 Median : 4.00   Median :7.00   Mode  :character   Mode  :character  
 Mean   : 4.61   Mean   :6.95                                        
 3rd Qu.: 6.00   3rd Qu.:8.00                                        
 Max.   :11.00   Max.   :9.00                                        
 NA's   :28      NA's   :18                                          
 feel_stress        feel_distracted    feel_improvedmood  reasons_use       
 Length:500         Length:500         Length:500         Length:500        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            

If you inspect the output in the console, you will see detailed information about each of the variables. The numeric variables show a set of descriptive statistics, while the character variables show the character count and other information. You may remember seeing these descriptive statistics in an earlier lesson. At the bottom of each column generated by the summary() function, you will also see the number of NAs present in the data for each variable.

As you can see in the columns related to the hours variables, there are NA values across all of them. For example, there were three participants that did not answer the question about hours spent on Facebook. Because this data is missing, R can’t accurately calculate the total hours, which explains the NAs in the hours_total variable.

Every missing data situation is different, so when you’re working with your own data, you will have to make a decision about how you want to deal with NAs based on your dataset and research question.

In some cases, you may want to preserve the missing values. Since we don’t have information to replace the missing data about social media usage, it would be unwise to add incorrect information simply to get rid of the error. In this situation, we can ask R to prevent the inclusion of NA values by using the na.rm=TRUE argument inside our function. This tells R to remove the NA values before performing a calculation.

# check range of total_hour_per_day, removing NAs  
range(survey_data_tothour$hours_total, na.rm = TRUE)  
[1]  0 11

With this additional argument, you can now see the range of the hours_total variable, which shows the values 0 and 11. This means that the lowest value in the hours_total column is zero and the highest value is 11.

In other situations, you may wish to replace the NA values, rather than ignore them. In this dataset, we can safely substitute the NAs in the hours variables with zeroes, as the sum of all hours will still give us a minimum number of hours. To do this we will return to the mutate() function, combining it with the replace_na() function. The replace_na function takes a vector in its argument followed by a value to replace the missing values.

# replace NAs in the hours variable with 0  
survey_data_replna <- survey_data_tothour |>   
  mutate(hours_facebook = replace_na(hours_facebook, 0),  
         hours_instagram  = replace_na(hours_instagram, 0),  
         hours_snapchat = replace_na(hours_snapchat, 0),  
         hours_tiktok = replace_na(hours_tiktok, 0),  
         hours_xtwitter = replace_na(hours_xtwitter, 0),  
         hours_other = replace_na(hours_other, 0))
CautionStep-by-step explanation:

The syntax used by the mutate() function is similar to the rename() function, as it allows you to modify several columns at once.

  • You start the mutate call with the first column you want to modify. By giving here the same name of a current column (e.g. hours_facebook), R will overwrite the column with the new set of values.
  • After the = sign for each variable, you insert the new values for the column. Here they are calculated using the replace_na function.
  • Inside the replace_na function, you add the column name, a comma, and the value for NAs.
  • After closing the parenthesis for the replace_na function, you add a comma, and then can repeat for all the columns you want to modify.

Now you can check if there still are NAs in those columns.

# check: that NAs were replaced by zeroes in hours-of-social-media-use columns  
summary(survey_data_replna)  

Now, we can update our hours_total column by re-running the code snippet again.

# update our 'hours_total' column   
survey_data_replna <- survey_data_replna |>  
  mutate(  
    hours_total = hours_facebook + hours_instagram + hours_snapchat + hours_tiktok + hours_xtwitter  + hours_other  
  )

Now you can check if there still are NAs in the hours_total column.

# check no NAs in hours_total  
summary(survey_data_replna$hours_total)  
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   4.000   4.584   5.250  11.000 

We can also check the range again.

# check the range again  
range(survey_data_replna$hours_total)  
[1]  0 11

Looks good, now we have only numerical values, without having to use the na.rm argument.

Finally, another way that you can deal with NAs is to remove them for your dataset. For example, it might be that knowing the number of hours people sleep on average is very important for your research question, in which case you might want to remove all the entries where hours of sleep were not specified from the dataset.

We can remove the entries (i.e., rows) where hours_sleep are NA using the function drop_na():

# remove observations with NAs for hours slept  
survey_data_dropna <- survey_data_replna |>   
  drop_na(hours_sleep)  
CautionStep-by-step explanation:
  • The syntax is similar as before.
  • Then you pipe the current dataset survey_data_replna into the drop_na function
  • Inside the drop_na function, you add the column (or columns) for which you want the NAs to be dropped
NoteYour turn!

Now remove all observations with NA in the date_birth variable.

You can check the answer below, but please make sure to try it yourself before you see the answer.

# remove observations with NAs for age  
survey_data_dropna <- survey_data_dropna |>   
  drop_na(date_birth)  

To drop NAs, you have to drop the entire observation, which means you end up with a smaller dataset.

# check number of rows now  
nrow(survey_data_dropna)  
[1] 437

Therefore, dropping NAs is only recommended when essential for your analysis. Here, as we have a larger sample size, it is okay to continue with this smaller dataset.

Now that we have dealt with NAs, let’s continue with other data cleaning steps.

Fix Typos and Recode Variables

Now that we have organized our columns and dealt with NAs, it is time to start looking at the actual values of variables to see if we will be easily able to analyze them, or if they need cleaning too. A quick inspection of the dataset shows us that some variables like province and field of study need some work to be put into a stage that is ready for analysis. When we want to substitute some values of a variable for other values, we call that recoding. This can be done both to update an existing variable, or to create a new variable. Let’s look at both options.

The province variable was entered as a free entry text, and there are a lot of typos that need to be corrected. A good way to see all possible entries is to use the function distinct(), which we have already seen.

# check all entries for the province variable  
survey_data_dropna |>  
  distinct(province) |>  
  arrange(province) |>  
  print(n = Inf)  
# A tibble: 18 × 1
   province        
   <chr>           
 1 AB              
 2 Alberta         
 3 Alta            
 4 B.C.            
 5 BC              
 6 British Columbia
 7 MB              
 8 Manitoba        
 9 NS              
10 Nova Scotia     
11 ON              
12 Ontario         
13 QC              
14 Quebec          
15 Québec          
16 british columbia
17 ontario         
18 <NA>            
CautionStep-by-step explanation:

Again, the syntax is similar to what you have previously done using the tidyverse.

  • We start with our current dataset survey_data_dropna and pipe it into our next function. We don’t use an assignment operator here as we are not trying to save anything, just see the output.
  • The distinct() function will show you the unique rows for a variable or combination of variables. Here, we ask for the unique rows of province
  • We then pipe this into the arrange() function, which will order the dataset based on a variable. We use province again here to order the dataset, so that we see it in alphabetical order and can more easily spot typos.
  • Finally, we pipe it into the print() function. Using n = Inf here tells R to print all the rows in the tibble.
  • As a result, you will see printed in the console a list of all unique entries for the province variable, organized alphabetically.

Look at the results. Can you spot the typos that need to be fixed?

Now you can use these values to fix the variable. In this case, we want to update the variable, and not create a new one. For that, we can use the replace_values() function from the dplyr package. This function is useful when you want to replace values in a column with new values. We will use this inside the mutate() function because we will be modifying an existing column. This is how we would use it for the province variable:

# fix values for the province variable  
survey_data_notypos <- survey_data_dropna |>  
  mutate(   
    province = replace_values(province, 
                              c("AB", "Alta") ~ "Alberta",  
                              c("B.C.", "BC", "british columbia") ~ "British Columbia",
                              "MB" ~ "Manitoba",
                              "NS" ~ "Nova Scotia",
                              c("ON", "ontario") ~ "Ontario",
                              c("QC", "Québec") ~ "Quebec"  
    )  
  )  
CautionStep-by-step explanation:
  • We start with a similar syntax to the other functions in the tidyverse package: you start with tour current dataset survey_data_dropna, and you pipe it into the mutate function
  • Inside the mutate function, you tell R that you want to update the provincevariable by indicating province =
  • Then, you use the replace_values() function to specify what and how you want to change
  • Inside the replace_values() function, you first specify the variable that you want to fix, in our case, the province variable
  • Then, a sequence of statements separated by commas should follow, with each statement specifying the values to be changed inside a character vector c(), followed by the ~ sign, followed by the new value. For example, the statement c("AB", "Alta") ~ "Alberta" tells R to substitute every entry of “AB”, and “Alta” with “Alberta”.
  • If you have just one value that you want to replace, you do not need to use c(). For example, "NS" ~ "Nova Scotia" tells R to substitute every entry of “NS” with “Nova Scotia”.

Now we can see if that worked:

# check if it worked  
survey_data_notypos |>  
  distinct(province) |>  
  arrange(province) |>  
  print(n = Inf)  
# A tibble: 7 × 1
  province        
  <chr>           
1 Alberta         
2 British Columbia
3 Manitoba        
4 Nova Scotia     
5 Ontario         
6 Quebec          
7 <NA>            
NoteYour turn!

You might have noticed that the field of study variable also has typos. Check which typos exist and use replace_values() to fix these typos.

You can check the answer below, but please make sure to try it yourself before you see the answer.

# check all entries for the field\_study variable  
survey_data_notypos |>  
  distinct(field_study) |>  
  arrange(field_study) |>  
  print(n = Inf)
# A tibble: 22 × 1
   field_study     
   <chr>           
 1 Biolgy          
 2 Biology         
 3 Chemistry       
 4 CompSci         
 5 Computer Science
 6 Creative Writing
 7 Economics       
 8 Engineering     
 9 Physics         
10 Psychology      
11 Statistics      
12 Visual Art      
13 biology         
14 chem            
15 comp sci        
16 econ            
17 engineering     
18 enginering      
19 psychology      
20 psycology       
21 stats           
22 <NA>            
# fix values for the field\_study variable  
survey_data_notypos <- survey_data_notypos |>  
  mutate(field_study = replace_values(field_study,  
                                      c("Biolgy", "biology") ~ "Biology",  
                                      c("CompSci", "comp sci") ~ "Computer Science",  
                                      "chem" ~ "Chemistry",  
                                      "econ" ~ "Economics",  
                                      c("engineering", "enginering") ~ "Engineering",  
                                      c("psychology", "psycology") ~"Psychology",  
                                      "stats" ~ "Statistics"  
  )  
  )

# check if it worked  
survey_data_notypos |>  
  distinct(field_study) |>  
  arrange(field_study) |>  
  print(n = Inf)
# A tibble: 11 × 1
   field_study     
   <chr>           
 1 Biology         
 2 Chemistry       
 3 Computer Science
 4 Creative Writing
 5 Economics       
 6 Engineering     
 7 Physics         
 8 Psychology      
 9 Statistics      
10 Visual Art      
11 <NA>            

Great, we fixed the typos and inconsistencies in the field of study variable.

Sometimes you might also want to create a new variable based on values from other variables. For example, it might be useful to group participants according to certain ranges of social media usage, instead of looking at the exact number of hours used per day.

For that, we can use the case_when() function inside mutate() to create a new categorical variable that bins participants into four categories of social media usage. The case_when function works well for when you want to create a whole new variable using values from an existing variable (this is different from replacing an existing variable, as we did above for the province and field_study variables, when we were interested in updating an existent column, rather than creating a new one). To create this new variable, case_when looks for statements of conditions, and creates the new values when the condition is TRUE. Let’s see how this works:

# create categorical variable for hours\_total  
survey_data_bins <- survey_data_notypos  |>   
  mutate(bin_hours_total = case_when(  
    hours_total <= 2 ~ "Less than or equal to 2 hours",  
    hours_total <= 4 ~ "2-4 hours",  
    hours_total <= 6 ~ "4-6 hours",  
    hours_total > 6 ~ "More than 6 hours"))
CautionStep-by-step explanation:
  • The syntax is similar to the replace_values() function we used above: you start with our current dataset survey_data_notypos, and then pipe it into the mutate()function
  • Inside the mutate function, you name the new variable to be created bin_hours_total, and follow it with a = sign
  • Then, you use the case_when function to assign specific values for the new variable based on certain conditions
  • The syntax is similar to the replace_values() syntax: the sequence of statements should be separated by commas, with each statement specifying a condition, followed by the ~ sign, followed by the new value. For example, the statement hours_total <= 2 ~ "Less than or equal to 2 hours", tells R to check if the value in the variable hours_total is smaller or equal to 2. If yes, it assigns the value ““Less than or equal to 2 hours” in the new variable.
  • With this function, R uses the first condition that is TRUE to assign the new value. For example, someone who spent 1 hour per day on social media would satisfy the three first conditions, but R will assign it the value of “Less than or equal to 2 hours” because that was listed first.

If you want to check how that worked, you can see how the bin_hours_total variable maps to the hours_total variable using this code:

# check if it worked  
survey_data_bins  |>   
  distinct(hours_total, bin_hours_total)  |>   
  arrange(hours_total)  
# A tibble: 11 × 2
   hours_total bin_hours_total              
         <dbl> <chr>                        
 1           0 Less than or equal to 2 hours
 2           1 Less than or equal to 2 hours
 3           2 Less than or equal to 2 hours
 4           3 2-4 hours                    
 5           4 2-4 hours                    
 6           5 4-6 hours                    
 7           6 4-6 hours                    
 8           7 More than 6 hours            
 9           8 More than 6 hours            
10           9 More than 6 hours            
11          11 More than 6 hours            

Separate Values

In our final cleaning step for this lesson (but there is more to come in the next lesson), we’ll talk about how to separate values from a variable. In certain cases, you might have a variable that actually contains two variables. For example, the date_birth variable can actually be understood as containing information about the date, month, and year of birth. To separate the values of this column into three columns, we can use the function separate_wider_delim() of the tidyr package, which specializes in dealing with columns like this.

The separate_wider_delim() function will separate a column into multiple columns, therefore making the dataset wider, hence its name. It also uses a delimiter to separate the column, and that’s why it has delim on its name (we all love an informative function name, right?). The function has three main arguments: the column(s) you want to separate, the delimiter that should be used to separate the values, and the names for the new columns. In the case of the date_birth column, the delimiter is the character /.

# separate 'date of birth' into 'day', 'month and 'year' columns, using the '/ as the separator  
survey_data_sep <- survey_data_bins  |>   
  separate_wider_delim(cols = date_birth,  
                       delim = "/",  
                       names = c("day", "month", "year"))  

Let’s see if it worked:

# check the result  
survey_data_sep  
# A tibble: 437 × 24
      id gender day   month year    age province         year_study field_study 
   <int> <chr>  <chr> <chr> <chr> <dbl> <chr>                 <dbl> <chr>       
 1     1 Female 9     16    2001     24 British Columbia          3 Statistics  
 2     2 Female 4     16    2004     21 Manitoba                  1 Computer Sc…
 3     3 Male   2     14    2007     19 Manitoba                  4 Psychology  
 4     4 Female 7     20    2001     24 British Columbia          2 Chemistry   
 5     5 Male   5     3     2001     NA Ontario                   1 Biology     
 6     6 Male   1     15    2007     19 Ontario                   2 Psychology  
 7     7 Female 12    30    2008     17 British Columbia          1 Psychology  
 8     8 Female 3     2     2005     21 Manitoba                  1 Statistics  
 9     9 Female 2     1     2002     24 <NA>                      1 Psychology  
10    10 <NA>   8     27    2005     20 Alberta                   2 Computer Sc…
# ℹ 427 more rows
# ℹ 15 more variables: hours_facebook <dbl>, hours_instagram <dbl>,
#   hours_snapchat <dbl>, hours_tiktok <dbl>, hours_xtwitter <dbl>,
#   hours_other <dbl>, hours_total <dbl>, hours_sleep <dbl>,
#   devices_used <chr>, feel_connected <chr>, feel_stress <chr>,
#   feel_distracted <chr>, feel_improvedmood <chr>, reasons_use <chr>,
#   bin_hours_total <chr>

Great, seems like we managed to separate the variable into three!

We can also unite the columns together using the unite() function, which does the opposite as separate_wider_delim(). This function takes in the name of the new column you want to create, followed by the current columns to be united, followed by the separator to be added between column values. For example, to recreate the date_birth variable, you can use this code:

# Now unite them  
survey_data_unit <- survey_data_sep  |>   
  unite(date_birth, day, month, year, sep = "/")

# check the result  
survey_data_unit  
# A tibble: 437 × 22
      id gender date_birth   age province  year_study field_study hours_facebook
   <int> <chr>  <chr>      <dbl> <chr>          <dbl> <chr>                <dbl>
 1     1 Female 9/16/2001     24 British …          3 Statistics               1
 2     2 Female 4/16/2004     21 Manitoba           1 Computer S…              1
 3     3 Male   2/14/2007     19 Manitoba           4 Psychology               1
 4     4 Female 7/20/2001     24 British …          2 Chemistry                1
 5     5 Male   5/3/2001      NA Ontario            1 Biology                  1
 6     6 Male   1/15/2007     19 Ontario            2 Psychology               1
 7     7 Female 12/30/2008    17 British …          1 Psychology               1
 8     8 Female 3/2/2005      21 Manitoba           1 Statistics               1
 9     9 Female 2/1/2002      24 <NA>               1 Psychology               1
10    10 <NA>   8/27/2005     20 Alberta            2 Computer S…              1
# ℹ 427 more rows
# ℹ 14 more variables: hours_instagram <dbl>, hours_snapchat <dbl>,
#   hours_tiktok <dbl>, hours_xtwitter <dbl>, hours_other <dbl>,
#   hours_total <dbl>, hours_sleep <dbl>, devices_used <chr>,
#   feel_connected <chr>, feel_stress <chr>, feel_distracted <chr>,
#   feel_improvedmood <chr>, reasons_use <chr>, bin_hours_total <chr>

Although this is a simple example, there are multiple cases of weirdly structured data where the separate_wider_delim() and unite() functions will come in handy.

Save Your Script

Wow, that was a lot of data cleaning in one day! We are going to stop for now, but we will continue working on this script in Day 5 - Lesson 1, so there is no need to save a final cleaned dataset just yet. However, you will want to document and save everything that you did.

NoteYour turn!

Make sure the script is up to date:

  • At the top of the cleaning script, update the brief description to describe what you did
  • Save your script as survey_student_cleaning.R, in the scripts folder of your project.

You will have time to update the project documentation in the next lesson.

Optional Data Cleaning Activity

Practice your skills in data cleaning with another dataset! This is an optional activity not required for the course. But if you choose to complete this activity, please let us know so that we know how many students were interested in our optional activities. You are also welcome to ask us any questions about it. We estimate it should take you 30–45min to complete this activity.

  1. Download this dataset. If you want to read more about it, check out the dataset documentation.
  2. Create a data cleaning script that:
  • Renames variables to use consistent names with lower caps, no spaces, and no units of measurements
  • Checks for duplicates and corrects if needed
  • Reorders columns to move the Latitude and Longitude columns to after Squirrel ID
  • Fix typos and inconsistencies in the Height Aboveground column
  1. You will have to decide what you want to do with inconsistent values, but if you do not know the height, you can assign them an NA value
  • Recodes location variable to specify only “groundplane”, “aboveground”, or “both”.
  1. There is no need to mention “Specific Location” in the values here as a column for specific location exists and details their location.
  2. All entries that say solely “Specific Location” can be assigned to “aboveground” as they describe squirrels seen on a bench, tree, handrail or fence, as detailed in the specific location variable.
  • Saves the cleaned dataset with an informative name different from the raw dataset name

If starting from scratch feels like too much for you, you could also download this fill-in-the-blanks data cleaning script template with hints for you to complete this activity.

As always with R, there are multiple ways to perform the same actions, so your script won’t necessarily be the same as ours. But if you want a reference script that performs the above data cleaning tasks, you can find one here