Saving outputs is a crucial part of responsible research data
management and directly supports the FAIR principles: making data
Findable, Accessible, Interoperable, and Reusable
FAIR
Saving outputs for reproducible research is all part of supporting
the FAIR principles, making sure your data is findable, accessible,
interoperable, and re-usable.
In this section, we will learn how to save cleaned datasets, export
plots, update metadata, and document our data processing in a
transparent and accessible way. These practices all support
reproducibility and align with the FAIR principles.
library(dplyr) # for data manipulation
library(readr) # for reading and writing data
Saving Datasets
For this section, you can continue from the previous section or load
a backup of the dataset from the resources
folder:
Rdata_path <- "data/timeuse_day3_2.Rdata"
load(Rdata_path)
The js_data
object contains the dataset we used,
transformed, and modified up to now. In this session, we will assume
this is our final dataset to deposit into a repository such as Borealis.
Saving in .csv
or .tsv
makes sure our files
are open and usable across platforms, easily imported into other
programming languages, databases, or spreadsheet software.
Computational Reproducibility
Using save()
to create an .Rdata
file is
great for putting together a ‘reproducible package’ because it stores
information related to data types and data structures. For example, it
retains which variables have been converted to factors or dates.
However, it is a file type unique to R; considering that the data may
need to be accessed by non-R users, the data should also be stored in a
more interoperable and accessible format.
In our dataset (js_data
), variables like
isFeelRushed
(a binary factor: 1 = Yes, 0 = No) or other
categorical variables (e.g., popCenter
or
feelRushed
) might appear integer-like to read_csv(), which
could misinterpret them as numeric integers instead of factors. Saving
as .RData
preserves these R-specific data types (e.g.,
factors, dates, or custom attributes), ensuring consistency when the
data is reloaded in R across different sections of our analysis.
As we have seen, we can save the js_data
in
.Rdata
format by using the save()
function:
save(js_data, file = "data/timeuse_day4_3_20250515.RData")
Interoperability and Reusability
Now, let’s save this dataset into a csv
(comma-separated
values) or tsv
(tab-separated values) file for sharing or
reuse. Both are plain text files that are
non-proprietary formats promoting
interoperability and
accessibility.
While csv
is commonly used (and what we’ve been using),
in many respects tsv
is more versatile and enhances
reusability due to the following factors:
- csv: Uses commas to separate values; it’s widely
supported but may face issues if data contains commas.
- tsv: Uses tabs as delimiters, avoiding
comma-related problems and improving readability in text editors or
spreadsheet software.
write_csv(js_data, "data/timeuse_day4_3_20250515.csv")
write_tsv(js_data, "data/timeuse_day4_3_20250515.tsv")
write_csv()
and write_tsv()
(from the
tidyverse) are fast, UTF-8 CSV writers that by default skip row names.
They also match read_csv()
and read_tsv()
(also from the tidyverse) conventions. In contrast, base R’s
write.csv()
is slower and by default writes a row-names
column.
While csv
and tsv
are accessible and
interoperable due to being plain text file formats with simple
delimiters, neither is a standard (this is why there is no single way to
handle the presence of commas in a csv
file). There are
other plain text options for storing data that are supported by
standards, such as xml
and json
, but for
rectangular data, csv
and tsv
are very simple
options that enhance accessibility to a human reader.
File Compression
One of the disadvantages of plain text file formats such as
csv
and tsv
is that they are larger than their
software-specific, or ‘binary’, counterparts. In fact, if you look at
the data files we’ve saved throughout the week, you’ll notice that the
csv
versions are significantly larger. However, depending
on the kind of data one is working with, even software-specific formats
can be quite large.
Compression is especially helpful when sharing data through email or
uploading to repositories, enhancing accessibility by
minimizing bandwidth requirements. Even though in this example we’re
working with a relatively small dataset, in real-world scenarios it’s
common to encounter csv
files containing millions of rows –
potentially several gigabytes in size. In such cases, compression
becomes not just helpful but essential to efficiently store, transfer,
and share data files, especially when uploading to platforms like OSF or
institutional repositories, such as Borealis.
Here, we’ll use the .gz
compression format, which
compresses a single file at a time.
con <- gzfile("data/timeuse_day4_3_20250515.gz", "wb")
write_csv(js_data, con)
gzfile("…", "wb")
creates an R connection that writes
directly into a gzip-compressed file, and
write_csv(js_data, con)
writes your data frame directly
into that compressed CSV file.
Why use .gz
specifically? The .gz
format is
widely supported, simple, and natively handled by R through the
gzfile()
function. It’s also compatible with most data
repositories and file-sharing platforms.
To bundle multiple files or folders (e.g., datasets, plots, and
dictionaries), you can use ZIP
compression. A
.zip
file combines several items into a single compressed
archive, which help simplify storage and make sharing easier among
collaborators or repositories.
Saving Plots
Exporting plots in multiple formats ensures your visualizations are
reusable and accessible for different
purposes: publication, presentation, or web display.
Let’s first again generate the scatterplot to show the association
between sleep and work duration from previous section.
library(ggplot2)
p <- ggplot(js_data, aes(durWork, durSleep)) +
geom_point(color = "#2D5E7F", alpha = .2) +
geom_smooth(method = lm, color = "black") +
xlab("Minutes Spent Working") +
ylab ("Minutes Spent Sleeping") +
scale_x_continuous(breaks = seq(0, 1500, 250)) +
labs(title = "Association Between Working and Sleeping") +
theme(text = element_text(size = 18))
p

PNG
PNG format is suitable for web use or slide
presentations. PNG is a raster format that offers high
quality and is widely supported across platforms.
ggsave(
filename = "outputs/association_sleep_working_day4.png",
plot = p,
width = 8,
height = 6,
dpi = 300
)
ggsave()
is a ggplot2
function that saves
your last (or specified) plot to a file, automatically picking the
correct format from the filename extension and letting you set size and
resolution.
PDF
If you want your plots to be resizable without losing quality, you
can use PDF files, as they preserve vector graphics.
This is especially useful for embedding in academic papers or generating
print-friendly outputs.
ggsave(
filename = "outputs/association_sleep_working_day4.pdf",
plot = p,
width = 8,
height = 6)
TIFF
You can save images as TIFF format, as some academic
journals request or prefer TIFF for its high resolution. This format
ensures sharp, detailed visuals suitable for professional publication
requirements.
ggsave(
filename = "outputs/association_sleep_working_day4.tiff",
plot = p,
width = 8,
height = 6,
dpi = 600,
device = "tiff"
)
Saving Documentation
Data Dictionaries
To enhance Findability and
Reusability, metadata must be kept up to date. Metadata
provides context about the data – what each variable means, what values
are allowed and how they are determined, and how data were collected.
This makes it easier for others (including your future self!) to
interpret and reuse your data.
One practical way to create and maintain metadata is to use a
data dictionary. A data dictionary is, ideally, a
structured summary that describes each variable in a dataset, including
its name, type, and label or definition.
Human-Readable vs. Machine-Readable
Data dictionaries can be designed to be
human-readable or machine-readable,
each serving distinct purposes with specific advantages and limitations.
Choosing the appropriate type depends on your audience, whether it’s
researchers reading documentation or systems processing metadata
automatically.
Human-Readable Data Dictionary
- Description: A human-readable data dictionary is
formatted for easy comprehension by people, and is typically presented
as a table in a spreadsheet (e.g., Excel) or document (e.g., PDF, Word).
It includes plain-language descriptions in natural language (e.g.,
English) and is designed to be intuitive for researchers, students, or
non-technical collaborators. For example, a PDF document containing a
table of variable descriptions is human-readable but often not
machine-readable, as it lacks structured data that represents the
relationships present in the dataset
- Example (PDF format):

Machine-Readable Data Dictionary
Description: A machine-readable data dictionary
uses structured formats like JSON, XML, YAML, or RDF, designed for
software to parse and process automatically without human involvement.
These formats align with metadata standards and are ideal for
integration with data repositories or automated systems, enabling
automatic data feeds and processing.
Example (JSON format):
[
{
"variable": "feelRushed",
"type": "integer",
"description": "Duration - Sleeping, resting, relaxing, sick in bed",
"unique_responses": 274,
"missing": 0,
"count": 17390.0,
"mean": 522.3948246118459,
"std": 133.06481348435142,
"min": 0.0,
"percentile_25": 450.0,
"median": 510.0,
"percentile_75": 585.0,
"max": 1440.0
},
{
"variable": "durSleep",
"type": "factor",
"description": "General time use - Feel rushed",
"unique_responses": 6,
"missing": 62,
"levels": [
{"value": 1.0, "label": "Every day"},
{"value": 2.0, "label": "A few times a week"},
{"value": 3.0, "label": "About once a week"},
{"value": 4.0, "label": "About once a month"},
{"value": 5.0, "label": "Less than once a month"}
{"value": 6.0, "label": "Never"}
]
}
]
Summary Table
Pros |
Accessibility: Easy to read, supports
reusability for non-technical users. Ease
of Creation: Manual or CSV export, minimal tools
needed. Broad Usability: Suits workshops/reports,
enhances accessibility. |
Interoperability: JSON/XML enable
automation, repository integration. Efficiency:
Programmatic updates reduce errors for large datasets.
Standardization: Metadata standards enhance
findability. |
Cons |
Limited Automation: Unstructured, hard
to parse, hinders interoperability. Manual
Maintenance: Manual updates risk errors in large datasets.
Scalability Issues: Inefficient for many variables.
Non-FAIR Compliance: Lacks algorithmic processing,
incompatible with FAIR. |
Technical Barrier: Needs scripting,
challenging for users. Reduced Accessibility:
Complex without tools. Format Dependency: JSON/XML
may lack universal support. Learning Curve:
Structured formats hard for new users. |
The human-readable PDF dictionary provided in the workshop is ideal
for learning and quick reference. For long-term storage or submission to
repositories like OSF or Borealis, you might want to convert it to a
machine-readable format to maximize Interoperability and
Findability
Documenting Changes
One of the most important pieces of RDM is documentation and a key
element of documentation is the process of decision making in how one
works through their data, whether that be in the cleaning or analysis
stage of the process. To this end, you should use your RMarkdown file as
a living logbook. Write clear descriptions of how your data has been
transformed, and which variables have been changed, removed, or created.
This documentation improves transparency and promotes
reusability.
Example explanation:
“We filtered the dataset to include only respondents who feel rushed
at least once a week (feelRushed
<= 3), and retained
only time-use columns for analysis.”
Remember that we can create the rushed
and
not_rushed
data frames by filtering
isFeelRushed == 1
for rushed participants and
isFeelRushed == 0
for those who are not rushed.
rushed <- js_data |>
filter(isFeelRushed == 1)
not_rushed <- js_data |>
filter(isFeelRushed == 0)
You can also track changes quantitatively:
summary_changes <- data.frame(
Step = c("Original rows", "Filtered for isFeelRushed == 1", "Final rows"),
Count = c(nrow(js_data), sum(js_data$isFeelRushed == 1, na.rm = TRUE), nrow(rushed))
)
summary_changes
## Step Count
## 1 Original rows 17390
## 2 Filtered for isFeelRushed == 1 12689
## 3 Final rows 12689
Such logs help other researchers understand your workflow and
replicate or build upon your analysis.
Saving Documents to PDF
Once you’ve saved individual datasets and plots, you’ll often want to
bundle your entire analysis – code, text, figures, and tables –
into a single file. You can export your RMarkdown document as a
PDF for a polished, human-readable output or as
HTML for a dynamic, web-friendly format. Each format
has distinct advantages and limitations, depending on your needs.
Pros |
- Human-readable. - Universally accessible. -
Consistent formatting. - Ideal for formal distribution (e.g.,
publications, reports). |
- Machine-readable. - Supports interactivity (e.g.,
collapsible code). - Web-friendly. - Smaller file sizes. |
Cons |
- Not machine-readable. - Lacks interactivity.
- Large file sizes. - Limited for web-based sharing. |
- Requires web browser or specific software. - Less
consistent formatting across devices. - May need technical setup for
sharing. |
In this section, we can practice exporting our RMarkdown document as
either PDF or HTML.
Update your YAML
At the very top of your .Rmd
file, ensure you have a
YAML header supporting both PDF and HTML outputs:
---
title: "RDM Jumpstart Workshop"
author: "Your Name"
date: "2025-05-16"
output:
pdf_document:
toc: true # optional: include table of contents
number_sections: true # optional: number your sections
html_document:
toc: true # optional: include table of contents
code_folding: show # optional: toggle code visibility
code_download: true # optional: allow downloading .Rmd
---
This tells RMarkdown to produce either a PDF via LaTeX or an HTML
file, depending on your knitting choice.
Install TinyTex
R Markdown relies on LaTeX under the hood. We recommend TinyTeX as a
simple solution. TinyTeX is lightweight and will automatically pull in
any missing LaTeX packages as you knit.
install.packages("tinytex")
tinytex::install_tinytex()
Knit
In RStudio:
- PDF: Click the arrow next to Knit and choose Knit to PDF
(or simply hit Knit if PDF is your default).
- HTML: Choose Knit to HTML for a web-friendly output with
interactive features.
Summary
Compression |
Use .csv.gz or .tsv.gz for
large files to reduce size and improve access speed. |
Interoperability |
Prefer .csv or .tsv ; avoid
locked-in formats to maximize cross-platform use. |
Naming Conventions |
Use descriptive names with dates/versions:
timeuse_day4_3_20250515.csv . |
Transparency |
Log all changes in RMarkdown; describe decisions in
plain language. |
Metadata |
Maintain and save an updated data dictionary in
human-readable (e.g., PDF) or machine-readable (e.g., JSON) formats with
rich, structured metadata, enhancing findability and
interoperability. |
Plot Formats |
Use appropriate formats (.png ,
.pdf , .tiff ) based on audience and/or
platform. |
Your Turn
Scenario 1: You’re preparing to share your fully
cleaned dataset so that collaborators using Python, SPSS, or Excel can
easily load it. Would you choose .csv
or
.Rdata
to maximize interoperability and
why?
Scenario 2: You’ve created a human-readable data
dictionary in PDF format, describing newly created variables like
isFeelRushed
. A collaborator needs a machine-readable
version, such as JSON, to share it with a data repository. Why would you
choose JSON over PDF for this purpose, and how does this support
findability and interoperability?
Scenario 3: You’ve just created a
publication-quality ggplot showing the relationship between work time
and sleep. For web sharing and for submission to an academic journal,
which file formats would you export (PNG, PDF, TIFF) and why?
Scenario 4: You need to share your RMarkdown
analysis with both a journal (requiring a formal report) and an online
community (preferring interactive content). Which output formats (PDF or
HTML) would you choose for each, and how do these choices support
accessibility and reusability?
Challenge 1: Select one plot you created in the
visualization section and save it using ggsave()
. Choose an
appropriate file format (e.g., PNG for web, PDF for papers, TIFF for
high-res), create a descriptive filename that includes the date and plot
type, and write the exact ggsave()
command you would use.
Experiment with parameters like resolution (dpi
) or
dimensions (width
, height
) to optimize the
output.
Challenge 2: Revisit your RMarkdown now and improve
the documentation of one of your data-cleaning, or data-transformation
steps. Imagine you (or a colleague who are not familiar with the
project) come back to this 6 months from now: will they understand
exactly what you did and why?
Wrap-Up
Following these practices ensures your research outputs are
understandable, reusable, and verifiable. Saving data, metadata, and
visuals properly helps you, your collaborators, and future researchers
reproduce and extend your work. More importantly, it aligns your
workflow with the FAIR principles,making your research more open,
ethical, and impactful.
Remember: every saved dataset, every labeled plot, and every comment
in your RMarkdown is a contribution toward a more FAIR research
ecosystem.
