Introduction

RMarkdown lets you combine R code and text in one document to create dynamic, reproducible reports.

By embedding code directly within your written explanations, RMarkdown ensures that your analysis, results, and visuals are automatically updated if the data or code changes—making it easy for others (and your future self!) to understand, verify, and rerun your work exactly as it was done.

In this short tutorial, we will learn:

  • How to use R code within a RMarkdown document
  • How to import and explore our dataset
  • How to recode column names

Intro to Data in R

Basic Syntax of R

The most important components of an R script are objects and functions. Objects store information and functions are used to manipulate the data.

Assignment operators, pipes and arguments are used to link objects and functions and communicate what we want to do.

Objects

An object is anything you create and name in R. It can be a number, a dataset, a function, or even a plot. Objects take on content from everything to the right of the assignment operator.

x <- 5 #x is now an object that holds the value 5
b <- "Anna" #b is now an object that holds the character Anna

Notes:

  • Since Anna is a character, it needs to be wrapped in quotations (e will learn more about data types tomorrow).

  • The symbol # is used within a code chunk to insert comments. Comments won’t affect how the code runs, but text that is not signaled as a comment will generate errors.

Functions

Functions are a set of instructions that accomplish a task. They are often (though not always) instructions to be performed on an argument. Functions do something—like calculate, sort, or plot. You call them by their name and add parentheses.

Arguments

Arguments are the details you give a function so it knows what to do. They go inside the parentheses of a function. Let’s take a look at the function mean()

Type mean in the Help tab of the bottom left panel. The results will provide a description of the function, including its arguments.

Argument What it means Default Value
x x is an R object that contains the numbers we want to find the mean of No default, required
trim A number between 0 and 0.5. Removes a fraction of highest and lowest values before computing mean (useful if you want a trimmed mean that ignores outliers). 0 means no trimming
na.rm Indicates whether NA values should be included or not in the calculation TRUE if NA values should be removed, FALSE if not

Assignment Operators

This is how you store a value in R. It’s like saying: “Let this name hold this value.” It assigns content from the objects/functions/arguments on its right to the object on its left.

name <- "Maria" #Now name holds the string "Maria".

Note: The assignment operator <- is also considered a function. It is a ‘store’ function that assigns information to an object. The arrow <- is the most common, but = can also be used in some contexts.

You can overwrite a new value to the same object name. When you assign again, the previous content is replaced.

name <- "Anna" #Now name replaces the previous information and holds the string "Anna".

Why Overwriting is Useful

  • As your analysis becomes more complicated, you often build your results step-by-step.

  • Instead of creating dozens of different object names, you can reuse the same object name to store updated versions of your data or results.

  • This keeps your environment clean and your code easier to read.

Pipes

This is used to chain steps together in a readable way. Instead of nesting functions, you move step-by-step like a recipe.

mynumbers <- (c(1,2,3)) #storing the numbers 1, 2 and 3 in the object called "mynumbers"
mynumbers |> mean() #take the object mynumbers and pipe it into the mean function
## [1] 2

Note: In R, c() stands for “combine” or “concatenate”. The role of c() is to combine the values inside it into a vector — a basic data structure in R. It takes the individual numbers 1, 2, and 3 and creates a single vector: 1 2 3. You can think of c() as “gluing” elements together into one group. You will learn more about data types and structures tomorrow.

Packages and Libraries

Collections of R functions are stored in Packages. In order to use a specific function we need to install the package that contains that function.

Tidyverse

When we talk about “Base R” we refer to the original functions and syntax included with R—no extra packages needed. Base R contains many functions like read.csv(), mean(), subset(), and plot(). It can be very powerful and flexible but sometimes it is less intuitive for beginners.

We will install Tidyverse, which is a collection of packages designed to make data analysis easier and more consistent.

Think of the tidyverse as a toolbox that gives you simple and readable functions for the most common steps in working with data:

  • Importing data (e.g., readr, readxl)
  • Cleaning and transforming data (e.g., dplyr, tidyr)
  • Visualizing data (e.g., ggplot2)
  • Working with strings (e.g., stringr) or dates (e.g., lubridate)

All these packages follow the same logic and syntax, so once you learn one, the others feel familiar too. For additional information, visit the tidyverse info page

We will be using packages from Tidyverse later today and tomorrow…

Install Package

To install a package we use the function install.packages().

#install.packages("tidyverse")

Load Libraries

Packages are stored in libraries. Once a package is installed, we need to call the library with the function library().

library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3

Note that the package name needs to be in quotations when installing the package, but not when loading the library.

Packages only need to be installed once.

Libraries need to be loaded in each work session.

Remember the Tidyverse Data Science Workflow? Today we will be focusing on the first two steps:

Source

Read Data

Read a csv file

To import a csv file we can use the read_csv() function and assign it to a new object we will call js_data. We create a new object to be able to call it in different functions later on.

js_data <- read_csv("data/timeuse_day1_na.csv")

Read Other Formats

In the example we are working with the data is stored in a csv file. The package readr from Tidyverse can also read other formats like read_tsv()(tab-separated values), read_delim()(delimited files CSV and TSV), read_table()(whitespace-separated files), read_log()(web log files).

There are other functions and packages that allow us to import different file types.

File Type Function Package
.csv read_csv() readr
.xlsx read.xlsx() xlsx
.sav read_sav() haven
.sas7bdat , .sas7bcat read_sas() haven
.dta read_dta() haven

Listing Column Names

To ask for a list of all the column names in our dataset we can use the names() function.

names(js_data)
##  [1] "PUMFID"   "AGEGR10"  "SEX"      "MARSTAT"  "PRV"      "LUC_RST" 
##  [7] "EHG_ALL"  "GTU_110"  "GTU_130"  "DUR01"    "DUR05"    "DUR06"   
## [13] "DURS200"  "DURL313"  "DUR08"    "DUR13"    "DUR14"    "DUR15"   
## [19] "MRW_20"   "MRW_30"   "MRW_40"   "MRW_D40A" "MRW_D40B" "EDM_02"  
## [25] "TST_01"   "TCS_110"  "TCS_120"  "TCS_150"  "TCS_200"

Notice that the column names from the original dataset don’t provide a clear description of what the variable is. We will change the column names later to facilitate working with our data in the future.

Head Function

The head function will display the top rows of the dataset. It will include information about the default data type assigned to each column. You will learn more about data types tomorrow.

head(js_data)
PUMFID AGEGR10 SEX MARSTAT PRV LUC_RST EHG_ALL GTU_110 GTU_130 DUR01 DUR05 DUR06 DURS200 DURL313 DUR08 DUR13 DUR14 DUR15 MRW_20 MRW_30 MRW_40 MRW_D40A MRW_D40B EDM_02 TST_01 TCS_110 TCS_120 TCS_150 TCS_200
10000 5 1 5 46 1 3 1 1 510 60 120 770 90 0 0 0 0 NA 1 1 1 2 NA 8 2 2 2 2
10001 5 1 1 59 1 4 3 4 420 150 0 0 0 0 0 0 0 NA 2 1 1 2 NA 1 2 2 2 2
10002 4 2 1 47 1 5 1 6 570 0 0 630 30 480 0 0 0 NA NA NA 1 1 NA 7 2 1 1 1
10003 6 2 5 35 1 4 2 4 510 10 45 875 80 20 0 0 0 NA NA NA 1 1 NA 1 2 2 2 2
10004 2 1 6 35 1 NA 1 3 525 90 40 815 0 0 0 0 0 NA NA NA 2 2 NA 1 2 2 2 2
10005 1 1 6 35 1 1 1 6 435 0 0 430 40 530 0 0 0 NA NA NA 1 1 NA 2 2 1 1 2

Viewing Data

To visualize the full dataset we use the View() function. This will open our dataset in a separate window.

View(js_data)

Change Column Names

We mentioned earlier that we wanted to work with column names that were more descriptive of the content of each variable. To change column names we can use the function rename().

The function rename() is part of one of the packages that was installed with tidyverse.

Type the following code to change the column name from “PUMFID” to “id”

js_data <- js_data |>
  rename ("id" = "PUMFID")

Did it work?

Your Turn!

Now, to change the rest of the column names copy the following code. (click show to see the code)

js_data <- js_data |>
  rename ("ageGrp" = "AGEGR10",
          "sex" = "SEX",
          "maritalStat" = "MARSTAT",
          "province" =  "PRV",
          "popCenter" = "LUC_RST",
          "eduLevel" = "EHG_ALL",
          "feelRushed" = "GTU_110",
          "extraTime" = "GTU_130",
          "durSleep" = "DUR01",
          "durMealPrep" = "DUR05",
          "durEating" = "DUR06",
          "durAlone" = "DURS200",
          "durDriving" = "DURL313",
          "durWork" = "DUR08",
          "durShoolSite" = "DUR13",
          "durSchoolOnline" = "DUR14",
          "durStudy" = "DUR15",
          "mainStudy" = "MRW_20",
          "mainJobHunting" = "MRW_30",
          "mainWork" = "MRW_40",
          "worked12m" = "MRW_D40A",
          "workedWeek" = "MRW_D40B",
          "enrollStat" = "EDM_02",
          "dailyTexts" = "TST_01",
          "timeSlowDown" = "TCS_110",
          "timeWorkaholic" = "TCS_120",
          "timeNotFamFriends" = "TCS_150",
          "timeWantAlone" = "TCS_200")

Use the functionnames(data) to display the column names.

names(js_data)
##  [1] "id"                "ageGrp"            "sex"              
##  [4] "maritalStat"       "province"          "popCenter"        
##  [7] "eduLevel"          "feelRushed"        "extraTime"        
## [10] "durSleep"          "durMealPrep"       "durEating"        
## [13] "durAlone"          "durDriving"        "durWork"          
## [16] "durShoolSite"      "durSchoolOnline"   "durStudy"         
## [19] "mainStudy"         "mainJobHunting"    "mainWork"         
## [22] "worked12m"         "workedWeek"        "enrollStat"       
## [25] "dailyTexts"        "timeSlowDown"      "timeWorkaholic"   
## [28] "timeNotFamFriends" "timeWantAlone"

Save your work

Saving in R format (RData) will preserve data types and metadata assigned to the dataset. The text format (csv) will be the ideal format to share the data.

save(js_data, file="data/timeuse_day2.RData")
write_csv(js_data, file="data/timeuse_day2.csv")

Upload to OSF

At the end of each work session, remember to save your data as .RData and .csv, and also your RMarkdown file (.Rmd). We will upload those files to OSF.

