Data Organization in R

Data in R can be stored as a variety of data types and organized in a variety of data structures. We can also recode data in R depending on our research needs.

This section covers how data is organized in R and how to clean a dataset. Understanding how your data is organized and being able to adjust that organization to match your research needs is an important part of research data management. For example, you might need your data to be in a certain structure to run certain statistical analyses or you may need to transform variables into versions that are more easily understood.

[Source]

This section covers five main topics:

  1. What data types exist in R
  2. What data structures exist in R
  3. Introduction to the dplyr package
  4. Recoding existing data into another type of data
  5. Dealing with missing values

Let’s get started!

Data Types in R

Data can come in many different formats or units. Sometimes we have numbers with decimals, sometimes we have counts of how many times something happened, and sometimes we have categories with names. It is important to know what type of data you have, how to figure that out in R, and how to modify the data type if needed. Various functions in R require using certain data types, so these are important foundational skill for managing data in R.

Numeric Data

Numeric data is data made up of numbers. This is what we most commonly think of as data, and this data type can have decimals. For example, how many hours people slept on a given day (6, 8.25, 7.5), the percent accuracy that students got on an exam (87.45, 75.92, 98.40), or number of people who attended a data management workshop each day (32, 38, 28).

Integer Data

This data type deals with whole numbers (i.e., no decimals) and operates similarly to numeric data types. Some data can be stored as “numeric” and “integer” at the same time.

Character/String Data

Character variables are variables with words or letters. Such as names of people (Erin, Maria, Huy) or places (ON, BC, QC). Character variables usually need to be put in quotation marks in R code, so they are not confused with words that actually denote functions in R.

Factors

Factors are how R handles categorical data. Factors can store string or integer data, because sometimes numbers do represent categories (rather than a continuous range of numbers).Factors have leveled data. For example, the variable of education may have the levels of “high school,” “bachelors degree,” and “masters degree.” Factors can be ordered or unordered. Factor structure is particularly useful for statistical modeling.

We will create some factor variables later in this section.

Dates and Times

Data can also be stored in date and/or time format in R. Our data doesn’t contain dates or times, but we want to let you know this is possible. Date data would typically look like “22-05-2025” or “2025-05-22.” Times could look like “11:12:34.”

Logical Data

Logical data includes TRUE and FALSE data. We can also ask R to provide us TRUE or FALSE response if we ask if a condition is met.

Data Structures in R

Data structures are how the data is organized in R. Here are the major data structures in R. There are some additional structures (e.g., matrices and arrays) that are mostly used only when doing statistical modeling. We will skip over these for now and focus on the data structures you will likely encounter in a research data management context.

Vectors

Vectors are the simplest data structure. They are essentially a list of elements of the same data type. For example, you could have a numeric vector containing the elements (10, 15, 20, 25) or a character vector containing the elements (“blue”, “red”, “green”). They are one dimensional, since they are just one long list. A single variable (e.g., a list of ages of all the participants) can be considered a vector.

Data Frames

Data frames are the most common and popular type of data structure used in R. Data frames work essentially like spreadsheets in that they are two dimensional (i.e., have columns and rows) and can store different data types in the same data frame (i.e., numeric and character). Data frames are made up of vectors of the same length.

However, data frames do have some constraints. First, the columns and rows must be named. Second, the lists of vectors in the data frame must have equal lengths (i.e., each column needs an identical number of elements/rows). Third, all elements in a single column must be of the same data type (i.e., you can’t switch between numbers and names in a single column).

You may see mention of tibbles, which are an updated version of data frames and part of the tidyverse package, and have the same constraints as data frames. They are a bit more user-friendly version of tabular data.

For this course, our data is in data frame format.

Lists

Lists are an ordered collection of elements. They can contain different data types and they are one dimensional. You can also have lists that contain a series of vectors or data frames. We don’t often have data in list form, but because lists can contain elements of different lengths (unlike data frames) some functions in R will output results into list format, so it is good to be aware of them. Here is an example of a list:

list_example <- list("red", 400, "skyblue", "forestgreen", .333)
list_example
## [[1]]
## [1] "red"
## 
## [[2]]
## [1] 400
## 
## [[3]]
## [1] "skyblue"
## 
## [[4]]
## [1] "forestgreen"
## 
## [[5]]
## [1] 0.333

Because lists are ordered (i.e., each element in the list has a specific spot in the list), you can call for an element based on the order it has been assigned. For example, “skyblue” is third in this example list.

list_example [3]
## [[1]]
## [1] "skyblue"

Introduction to dplyr

In order to check and modify data types we suggest using the dplyr package, which is part of the tidyverse. dplyr is a powerful R package designed to simplify data manipulation tasks. It provides a set of intuitive functions that help with filtering, selecting, and transforming data frames. It is the go to package for data organization. We will use just a few of the functions of this package in this course, but there are many more applications of dplyr you can explore. You can find more information about dplyr on the tidyverse website.

Here are some of the most common functions of dplyr. We will practice using some of these today with our dataset.

  • mutate(): Useful when you want to create new columns based on values in existing columns in your dataset. For example, you may want to recode numeric data into new categories with character names.

  • recode(): Useful for changing data format of variables or re-categorizing certain values in a variable as some other value. For example, maybe you want to label everyone who sleeps less than seven hours a night as “sleep deprived.”

  • group_by(): Allows you to view your data based on groups that you define or already exist in your data. For example, you may want to group your data based on whether people live in a city or rural area, or by gender.

  • summarize() or summarise(): Useful for seeing basic descriptive info about your data. This is particularly helpful when use with the group_by() function. For example, you can see if people in cities compared to rural areas feel more rushed on average (i.e., compare means between these two groups).

  • |> (pipe operator): Allows you to tell R to apply one function several times. You can read it more or less as “and then” when thinking about what your R code is doing.

  • filter(): This function allows you to extract rows that meet specific conditions you include.

  • select(): This function lets you choose particular columns based on your specifications.

It is important when using dplyr to assign your output to an object if you want to retain the info you’ve found or created. Remember the <- symbol can be used to assign output to a named object. We will practice this today too.

Together, these functions in dplyr enable you to easily explore your data by various subgroups, create new variables, or create a focused subset of your data for further analysis.

Let’s explore the data types in our data using the dplyr functions!

Checking Data Types

First we load the dplyr package (remember to install it first if you haven’t yet). Y If you have installed the tidyverse package, then dplyr was also included. However, it is best practice to just load in the libraries you need (i.e., just dplyr and not all of the tidyverse packages) to save space.

We will also load in our data. Yesterday we used the read_csv function to read in the csv file, but now we are working with an Rdata file we saved yesterday, so we can use the load function.

library(dplyr)
load("data/timeuse_day2.Rdata")

Just to reacquaint ourselves with our data, use the head function to peak at the first 6 rows of our data.

head(js_data)
id ageGrp sex maritalStat province popCenter eduLevel feelRushed extraTime durSleep durMealPrep durEating durAlone durDriving durWork durShoolSite durSchoolOnline durStudy mainStudy mainJobHunting mainWork worked12m workedWeek enrollStat dailyTexts timeSlowDown timeWorkaholic timeNotFamFriends timeWantAlone
10000 5 1 5 46 1 3 1 1 510 60 120 770 90 0 0 0 0 NA 1 1 1 2 NA 8 2 2 2 2
10001 5 1 1 59 1 4 3 4 420 150 0 0 0 0 0 0 0 NA 2 1 1 2 NA 1 2 2 2 2
10002 4 2 1 47 1 5 1 6 570 0 0 630 30 480 0 0 0 NA NA NA 1 1 NA 7 2 1 1 1
10003 6 2 5 35 1 4 2 4 510 10 45 875 80 20 0 0 0 NA NA NA 1 1 NA 1 2 2 2 2
10004 2 1 6 35 1 NA 1 3 525 90 40 815 0 0 0 0 0 NA NA NA 2 2 NA 1 2 2 2 2
10005 1 1 6 35 1 1 1 6 435 0 0 430 40 530 0 0 0 NA NA NA 1 1 NA 2 2 1 1 2

We can check the data type of variables using the class() function and calling for a specific variable from our dataset js_data using the $ operator, followed by the name of our variable.

class(js_data$durWork)
## [1] "numeric"

Try exploring the data type of a few other variables on your own.

Recoding Data Types

For some variables, we might want to change the data type based on our codebook. For instance, we can see that our “province” variable is in numeric format.

class(js_data$province)
## [1] "numeric"

However, maybe we would prefer to have province acronyms (i.e., factor format) rather than the numeric representation of them. We can re-code variables using the dplyr functions |> and mutate(). The possible values for the province variable, and what province they represent can be found in the dataset codebook.

js_data <- js_data |> mutate(province_fact = recode_factor(js_data$province, 
                            "10" = "NL", 
                            "11" = "PEI", 
                            "12" = "NS",
                            "13" = "NB", 
                            "24" = "QC", 
                            "35" = "ON", 
                            "46" = "MB", 
                            "47" = "SK", 
                            "48" = "AB", 
                            "59" = "BC"))
  • This code is telling R to use the function recode_factor() to recode the variable “province” from the dataset “js_data” into a new variable called “province_fact” (you could call it anything, but make sure it is a meaningful name).

  • The list outlines how each value of the current “province” variable should be recoded for the new “province_fact” variable.

  • The mutate() portion tells R to create a new column called “province_fact”.

  • Before the mutate() function this code tells R to take the “js_data” object (i.e., the dataset) “and then” (i.e., the pipe function) apply the following functions.

  • The pipe function (|>) tells R to recode all of these values all at once, rather than having to write out mutate() and recode() for each one.

  • Lastly, all of this work needs to be assigned back to our dataset “js_data” using teh <- symbol in order to keep our work in our environment.

We did not assign this work to replace/overwrite the existing “province” variable. In order to have transparent data management and to prevent overwriting the original data in the instance that you make a mistake in your recoding, it is always best practice to keep the original variables in their original format and create new variables based on them as needed. In this way, if you ever need to reference the original data (and you definitely will need to), you still have it available to you.

Now lets check the data type of this new variable we just created.

class(js_data$province_fact)
## [1] "factor"

An alternative way to check your data types is using the is.numeric, is.factor, or other variations of this function. With this function you will get a logical operator as the output. These are base R functions, and not from the tidyverse.

is.numeric(js_data$province_fact)
## [1] FALSE
is.factor(js_data$province_fact)
## [1] TRUE

Sometimes you may want to change the data type without recoding all of the data. In this case a useful function is as.numeric(). Because this data is already stored in numeric format, this function will not change the data type in this instance, but this can be helpful for other datasets you may work with in the future. Here is an example of how you would use this function. This is a base R function, and not from the tidyverse.

as.numeric(js_data$durSleep)

Missing Data

You may have noticed that there are some cells in the data with NA This represents missing data in our dataset. Most datasets having missing data. There are many reasons for missing data and many different ways to handle missing data. These choices will depend on the context of specific research, but generally you should plan for how to approach missing data in advance.

In the original dataset, missing data had been notated with various numerical codes (you can see these in the original codebook), but we have recoded the data so that all missing data as been denoted NA

If your data contains NA values, sometimes functions will give error messages due to incomplete data. For example, if you try to get the mean of education level (eduLevel), R will not provide one because it can’t calculate a mean when there are missing values.

Note: getting a mean for education level, which is a categorical variable, is not a meaningful number. However, it is used here to illustrate this point, given that the continuous numeric variables in this dataset do not contain missing values.

mean(js_data$eduLevel)
## [1] NA

If we look up the mean function (i.e., run help(mean) in your console), we can see that the default for mean is to have na.rm = FALSE.

Instead, we tell R to ignore those missing values. Including na.rm = TRUE to the function will typically tell R that it should complete the function, ignoring missing values.

mean(js_data$eduLevel, na.rm = TRUE)
## [1] 3.723747

If you want to check if a variable or data frame has missing data, you can use the function is.na.

is.na(js_data$eduLevel)

If you want to know the number of cells with missing data for a given variable you can use the same function, wrapped in the sum function. We can layer functions like this, with R reading the functions starting from inside the parentheses and then moving outwards. We will continue to work with syntax like this throughout the week.

sum(is.na(js_data$eduLevel))
## [1] 630

If you want to find instances where there is no missing data you can use complete.cases().

complete.cases(js_data)

If you want to know the total number of rows (i.e., participants) with no missing data you can take the complete.cases() function and wrap it in the sum() function.

sum(complete.cases(js_data))
## [1] 699

Your Turn!

Recoding Data Types

What type of data is the variable for marital status?

class(js_data$maritalStat)

Can you re-code marital status into a factor variable?

js_data<-js_data |> mutate(maritalStat_fact = recode_factor(js_data$maritalStat, 
                            "1" = "Married", 
                            "2" = "Living common-law", 
                            "3" = "Widowed",
                            "4" = "Separated", 
                            "5" = "Divorced", 
                            "6" = "Single, never married"))

class(js_data$maritalStat_fact)

Can you recode level of education into a factor variable?

js_data <- js_data |> mutate(eduLevel_fact = recode_factor(js_data$eduLevel,
                            "1" = "Less than high school dimploma or its equivalent",
                            "2" = "High school diploma or equivalency",
                            "3" = "Trade certificate or diploma",
                            "4" = "College, CEGEP, or other non-university certificate or dimploma",
                            "5" = "University certificate or dimploma below the bachelor's level",
                            "6" = "Bachelor's degree",
                            "7" = "University certificate, diploma, degree above the BA level"))

class(js_data$eduLevel_fact)

Check Your Knowledge

What data structure would you use to store the following data?

  1. The names of 10 most popular movies this year (10 data points)

  2. The names of the leading actor from each of the 10 most popular movies this year (10 data points total)

  3. A collection of all of the actors in each of the 10 most popular movies this year (100 data points total)

  4. A combination of a and b

  5. A combination of a, b, and c

  6. The data we have been working with in this workshop

Is our data in a data frame format? Hint: use the function “is.data.frame()”

