Data Organization in R
Data in R can be stored as a variety of data types and organized in a
variety of data structures. We can also recode data in R depending on
our research needs.
This section covers how data is organized in R and how to clean a
dataset. Understanding how your data is organized and being able to
adjust that organization to match your research needs is an important
part of research data management. For example, you might need your data
to be in a certain structure to run certain statistical analyses or you
may need to transform variables into versions that are more easily
understood.
[Source]
This section covers five main topics:
- What data types exist in R
- What data structures exist in R
- Introduction to the
dplyr
package
- Recoding existing data into another type of data
- Dealing with missing values
Let’s get started!
Data Types in R
Data can come in many different formats or units. Sometimes we have
numbers with decimals, sometimes we have counts of how many times
something happened, and sometimes we have categories with names. It is
important to know what type of data you have, how to figure that out in
R, and how to modify the data type if needed. Various functions in R
require using certain data types, so these are important foundational
skill for managing data in R.
Numeric Data
Numeric data is data made up of numbers. This is what we most
commonly think of as data, and this data type can have decimals. For
example, how many hours people slept on a given day (6, 8.25, 7.5), the
percent accuracy that students got on an exam (87.45, 75.92, 98.40), or
number of people who attended a data management workshop each day (32,
38, 28).
Integer Data
This data type deals with whole numbers (i.e., no decimals) and
operates similarly to numeric data types. Some data can be stored as
“numeric” and “integer” at the same time.
Character/String Data
Character variables are variables with words or letters. Such as
names of people (Erin, Maria, Huy) or places (ON, BC, QC). Character
variables usually need to be put in quotation marks in R code, so they
are not confused with words that actually denote functions in R.
Factors
Factors are how R handles categorical data. Factors can store string
or integer data, because sometimes numbers do represent categories
(rather than a continuous range of numbers).Factors have leveled data.
For example, the variable of education may have the levels of “high
school,” “bachelors degree,” and “masters degree.” Factors can be
ordered or unordered. Factor structure is particularly useful for
statistical modeling.
We will create some factor variables later in this section.
Dates and Times
Data can also be stored in date and/or time format in R. Our data
doesn’t contain dates or times, but we want to let you know this is
possible. Date data would typically look like “22-05-2025” or
“2025-05-22.” Times could look like “11:12:34.”
Logical Data
Logical data includes TRUE
and FALSE
data.
We can also ask R to provide us TRUE
or FALSE
response if we ask if a condition is met.
Data Structures in R
Data structures are how the data is organized in R. Here are the
major data structures in R. There are some additional structures (e.g.,
matrices and arrays) that are mostly used only when doing statistical
modeling. We will skip over these for now and focus on the data
structures you will likely encounter in a research data management
context.
Vectors
Vectors are the simplest data structure. They are essentially a list
of elements of the same data type. For example, you could have a numeric
vector containing the elements (10, 15, 20, 25) or a character vector
containing the elements (“blue”, “red”, “green”). They are one
dimensional, since they are just one long list. A single variable (e.g.,
a list of ages of all the participants) can be considered a vector.
Data Frames
Data frames are the most common and popular type of data structure
used in R. Data frames work essentially like spreadsheets in that they
are two dimensional (i.e., have columns and rows) and can store
different data types in the same data frame (i.e., numeric and
character). Data frames are made up of vectors of the same length.
However, data frames do have some constraints. First, the columns and
rows must be named. Second, the lists of vectors in the data frame must
have equal lengths (i.e., each column needs an identical number of
elements/rows). Third, all elements in a single column must be of the
same data type (i.e., you can’t switch between numbers and names in a
single column).
You may see mention of tibbles, which are an updated version of data
frames and part of the tidyverse package, and have the same constraints
as data frames. They are a bit more user-friendly version of tabular
data.
For this course, our data is in data frame format.
Lists
Lists are an ordered collection of elements. They can contain
different data types and they are one dimensional. You can also have
lists that contain a series of vectors or data frames. We don’t often
have data in list form, but because lists can contain elements of
different lengths (unlike data frames) some functions in R will output
results into list format, so it is good to be aware of them. Here is an
example of a list:
list_example <- list("red", 400, "skyblue", "forestgreen", .333)
list_example
## [[1]]
## [1] "red"
##
## [[2]]
## [1] 400
##
## [[3]]
## [1] "skyblue"
##
## [[4]]
## [1] "forestgreen"
##
## [[5]]
## [1] 0.333
Because lists are ordered (i.e., each element in the list has a
specific spot in the list), you can call for an element based on the
order it has been assigned. For example, “skyblue” is third in this
example list.
list_example [3]
## [[1]]
## [1] "skyblue"
Introduction to dplyr
In order to check and modify data types we suggest using the
dplyr
package, which is part of the tidyverse.
dplyr
is a powerful R package designed to simplify data
manipulation tasks. It provides a set of intuitive functions that help
with filtering, selecting, and transforming data frames. It is the go to
package for data organization. We will use just a few of the functions
of this package in this course, but there are many more applications of
dplyr
you can explore. You can find more information about
dplyr
on the
tidyverse
website.
Here are some of the most common functions of dplyr
. We
will practice using some of these today with our dataset.
mutate()
: Useful when you want to create new columns
based on values in existing columns in your dataset. For example, you
may want to recode numeric data into new categories with character
names.
recode()
: Useful for changing data format of
variables or re-categorizing certain values in a variable as some other
value. For example, maybe you want to label everyone who sleeps less
than seven hours a night as “sleep deprived.”
group_by()
: Allows you to view your data based on
groups that you define or already exist in your data. For example, you
may want to group your data based on whether people live in a city or
rural area, or by gender.
summarize()
or summarise()
: Useful for
seeing basic descriptive info about your data. This is particularly
helpful when use with the group_by()
function. For example,
you can see if people in cities compared to rural areas feel more rushed
on average (i.e., compare means between these two groups).
|>
(pipe operator): Allows you to tell R to apply
one function several times. You can read it more or less as “and then”
when thinking about what your R code is doing.
filter()
: This function allows you to extract rows
that meet specific conditions you include.
select()
: This function lets you choose particular
columns based on your specifications.
It is important when using dplyr
to assign your output
to an object if you want to retain the info you’ve found or created.
Remember the <-
symbol can be used to assign output to a
named object. We will practice this today too.
Together, these functions in dplyr
enable you to easily
explore your data by various subgroups, create new variables, or create
a focused subset of your data for further analysis.
Let’s explore the data types in our data using the dplyr
functions!
Checking Data Types
First we load the dplyr
package (remember to install it
first if you haven’t yet). Y If you have installed the
tidyverse
package, then dplyr
was also
included. However, it is best practice to just load in the libraries you
need (i.e., just dplyr
and not all of the tidyverse
packages) to save space.
We will also load in our data. Yesterday we used the
read_csv
function to read in the csv file, but now we are
working with an Rdata file we saved yesterday, so we can use the
load
function.
library(dplyr)
load("data/timeuse_day2.Rdata")
Just to reacquaint ourselves with our data, use the head
function to peak at the first 6 rows of our data.
head(js_data)
id
|
ageGrp
|
sex
|
maritalStat
|
province
|
popCenter
|
eduLevel
|
feelRushed
|
extraTime
|
durSleep
|
durMealPrep
|
durEating
|
durAlone
|
durDriving
|
durWork
|
durShoolSite
|
durSchoolOnline
|
durStudy
|
mainStudy
|
mainJobHunting
|
mainWork
|
worked12m
|
workedWeek
|
enrollStat
|
dailyTexts
|
timeSlowDown
|
timeWorkaholic
|
timeNotFamFriends
|
timeWantAlone
|
10000
|
5
|
1
|
5
|
46
|
1
|
3
|
1
|
1
|
510
|
60
|
120
|
770
|
90
|
0
|
0
|
0
|
0
|
NA
|
1
|
1
|
1
|
2
|
NA
|
8
|
2
|
2
|
2
|
2
|
10001
|
5
|
1
|
1
|
59
|
1
|
4
|
3
|
4
|
420
|
150
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
NA
|
2
|
1
|
1
|
2
|
NA
|
1
|
2
|
2
|
2
|
2
|
10002
|
4
|
2
|
1
|
47
|
1
|
5
|
1
|
6
|
570
|
0
|
0
|
630
|
30
|
480
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
7
|
2
|
1
|
1
|
1
|
10003
|
6
|
2
|
5
|
35
|
1
|
4
|
2
|
4
|
510
|
10
|
45
|
875
|
80
|
20
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
1
|
2
|
2
|
2
|
2
|
10004
|
2
|
1
|
6
|
35
|
1
|
NA
|
1
|
3
|
525
|
90
|
40
|
815
|
0
|
0
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
2
|
2
|
NA
|
1
|
2
|
2
|
2
|
2
|
10005
|
1
|
1
|
6
|
35
|
1
|
1
|
1
|
6
|
435
|
0
|
0
|
430
|
40
|
530
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
2
|
2
|
1
|
1
|
2
|
We can check the data type of variables using the
class()
function and calling for a specific variable from
our dataset js_data
using the $
operator,
followed by the name of our variable.
class(js_data$durWork)
## [1] "numeric"
Try exploring the data type of a few other variables on your own.
Recoding Data Types
For some variables, we might want to change the data type based on
our codebook. For instance, we can see that our “province” variable is
in numeric format.
class(js_data$province)
## [1] "numeric"
However, maybe we would prefer to have province acronyms (i.e.,
factor format) rather than the numeric representation of them. We can
re-code variables using the dplyr functions |>
and
mutate()
. The possible values for the province variable,
and what province they represent can be found in the dataset
codebook.
js_data <- js_data |> mutate(province_fact = recode_factor(js_data$province,
"10" = "NL",
"11" = "PEI",
"12" = "NS",
"13" = "NB",
"24" = "QC",
"35" = "ON",
"46" = "MB",
"47" = "SK",
"48" = "AB",
"59" = "BC"))
This code is telling R to use the function
recode_factor()
to recode the variable “province” from the
dataset “js_data” into a new variable called “province_fact” (you could
call it anything, but make sure it is a meaningful name).
The list outlines how each value of the current “province”
variable should be recoded for the new “province_fact”
variable.
The mutate()
portion tells R to create a new column
called “province_fact”.
Before the mutate()
function this code tells R to
take the “js_data” object (i.e., the dataset) “and then” (i.e., the pipe
function) apply the following functions.
The pipe function (|>
) tells R to recode all of
these values all at once, rather than having to write out
mutate()
and recode()
for each one.
Lastly, all of this work needs to be assigned back to our dataset
“js_data” using teh <-
symbol in order to keep our work
in our environment.
We did not assign this work to replace/overwrite the existing
“province” variable. In order to have transparent data management and to
prevent overwriting the original data in the instance that you make a
mistake in your recoding, it is always best practice to keep the
original variables in their original format and create new variables
based on them as needed. In this way, if you ever need to reference the
original data (and you definitely will need to), you still have it
available to you.
Now lets check the data type of this new variable we just
created.
class(js_data$province_fact)
## [1] "factor"
An alternative way to check your data types is using the
is.numeric
, is.factor
, or other variations of
this function. With this function you will get a logical operator as the
output. These are base R functions, and not from the tidyverse.
is.numeric(js_data$province_fact)
## [1] FALSE
is.factor(js_data$province_fact)
## [1] TRUE
Sometimes you may want to change the data type without recoding all
of the data. In this case a useful function is
as.numeric()
. Because this data is already stored in
numeric format, this function will not change the data type in this
instance, but this can be helpful for other datasets you may work with
in the future. Here is an example of how you would use this function.
This is a base R function, and not from the tidyverse.
as.numeric(js_data$durSleep)
Missing Data
You may have noticed that there are some cells in the data with
NA
This represents missing data in our dataset. Most
datasets having missing data. There are many reasons for missing data
and many different ways to handle missing data. These choices will
depend on the context of specific research, but generally you should
plan for how to approach missing data in advance.
In the original dataset, missing data had been notated with various
numerical codes (you can see these in the original codebook), but we
have recoded the data so that all missing data as been denoted
NA
If your data contains NA values, sometimes functions will give error
messages due to incomplete data. For example, if you try to get the mean
of education level (eduLevel), R will not provide one because it can’t
calculate a mean when there are missing values.
Note: getting a mean for education level, which is a categorical
variable, is not a meaningful number. However, it is used here to
illustrate this point, given that the continuous numeric variables in
this dataset do not contain missing values.
mean(js_data$eduLevel)
## [1] NA
If we look up the mean
function (i.e., run
help(mean) in your console)
, we can see that the default
for mean
is to have na.rm = FALSE
.
Instead, we tell R to ignore those missing values. Including
na.rm = TRUE
to the function will typically tell R that it
should complete the function, ignoring missing values.
mean(js_data$eduLevel, na.rm = TRUE)
## [1] 3.723747
If you want to check if a variable or data frame has missing data,
you can use the function is.na
.
is.na(js_data$eduLevel)
If you want to know the number of cells with missing data for a given
variable you can use the same function, wrapped in the sum
function. We can layer functions like this, with R reading the functions
starting from inside the parentheses and then moving outwards. We will
continue to work with syntax like this throughout the week.
sum(is.na(js_data$eduLevel))
## [1] 630
If you want to find instances where there is no missing data you can
use complete.cases()
.
complete.cases(js_data)
If you want to know the total number of rows (i.e., participants)
with no missing data you can take the complete.cases()
function and wrap it in the sum()
function.
sum(complete.cases(js_data))
## [1] 699
Your Turn!
Recoding Data Types
What type of data is the variable for marital status?
class(js_data$maritalStat)
Can you re-code marital status into a factor variable?
js_data<-js_data |> mutate(maritalStat_fact = recode_factor(js_data$maritalStat,
"1" = "Married",
"2" = "Living common-law",
"3" = "Widowed",
"4" = "Separated",
"5" = "Divorced",
"6" = "Single, never married"))
class(js_data$maritalStat_fact)
Can you recode level of education into a factor variable?
js_data <- js_data |> mutate(eduLevel_fact = recode_factor(js_data$eduLevel,
"1" = "Less than high school dimploma or its equivalent",
"2" = "High school diploma or equivalency",
"3" = "Trade certificate or diploma",
"4" = "College, CEGEP, or other non-university certificate or dimploma",
"5" = "University certificate or dimploma below the bachelor's level",
"6" = "Bachelor's degree",
"7" = "University certificate, diploma, degree above the BA level"))
class(js_data$eduLevel_fact)
Check Your Knowledge
What data structure would you use to store the following data?
The names of 10 most popular movies this year (10 data
points)
The names of the leading actor from each of the 10 most popular
movies this year (10 data points total)
A collection of all of the actors in each of the 10 most popular
movies this year (100 data points total)
A combination of a and b
A combination of a, b, and c
The data we have been working with in this workshop
Is our data in a data frame format? Hint: use the function
“is.data.frame()”
