Handling and organizing research data

A primer for researchers

FRDR curation team

Digital Research Alliance of Canada

Principles for handling research data

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

Make datasets understandable

Research data comes in many flavors and shapes (tables, images, videos, text).

In all cases, it is essential that the dataset has a clear structure and is understandable by others.

Tip

Try to put yourself in the shoes of an outside observer when structuring the data.

A black-and-white cartoon illustration of a frustrated person holding their head, surrounded by question marks. The person is sitting at a table covered with large, disorganized spreadsheets filled with numbers and text, symbolizing the challenge of managing and understanding research data.

Others generally do not understand research data

1. Use naming conventions

Use consistent naming conventions that fairly describe file’s content and allow interrelation between files:

  • A1.tif Exp_MouseID_Day_Condition_Marker.tif
  • CellsTable.xls Widefield_5x_Cortex_NeuN_Counts.csv

2. Prioritize open file formats

Use proper, open/accessible file formats to improve accessibility:

  • .tif for images (preserve the metadata).
  • .csv for tables (non-proprietary).
  • .png or .svg for graphs (preserves quality).
  • .txt or .pdf for documentation (non-proprietary).

3. Provide comprehensive metadata

Use comprehensive metadata (README files and data dictionaries/codebooks) to contextualize and describe research files.

A table displaying a codebook for a dataset, with columns labeled 'Variable Name,' 'Description,' 'Type,' and 'Values or Characteristics.' The table defines variables such as patient ID, gender, procedure date, treatment group, and clinical outcomes, specifying data types (numeric, date, character) and value meanings (e.g., 1=Female, 2=Male). This codebook provides a structured overview of dataset variables for research data management.

Codebook example (https://domstat.med.ucla.edu/)

4. Implement reproducible workflows

Implement reproducible workflows using coding (R, Python) to transform raw data into data for analysis.

Tip

These practices ensure organized, clean, and validated datasets.

Handling data tables

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

Tables are the core of research data

Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.

Example of bad data formatting, showcasing a spreadsheet with combined cells and different variables in the same column.

from https://dansteer.wordpress.com/

Example of bad data formatting, showcasing a spreadsheet with combined cells and different variables in the same column. We can also observe a combination of figures and numeric data in the same sheet.

Courtesy of researcher

Examples from published research

Example of bad data formatting, showcasing a spreadsheet with combined cells and different variables in the same column. We can also observe a combination of figures and numeric data in the same sheet.

Zhao et al. (2024). Nature Comm. DOI: 10.1038/s41467-024-50836-6

Example of bad data formatting, showcasing a spreadsheet with combined cells and different variables in the same column. We also see color codes that must not be defined in a data spreadsheet.

Balinda et al. (2024). Nature Comm. DOI: doi.org/10.1038/s41467-024-50558-9

Examples from Crystal Lewis (2024)

Two tables comparing data structures: The left table, labeled 'Not a rectangle,' has an irregular structure with inconsistent alignment of variable names and values, making it difficult to interpret. The right table, labeled 'Rectangle,' follows a structured format with clearly defined columns for student ID, age in months, raw reading score, and standardized reading score. This contrast highlights the importance of tidy, well-structured data in research data management.

Lewis (2024). DOI: 10.1201/9781032622835-3

Two tables comparing inconsistent and consistent data formatting. The left table, labeled 'Inconsistent column values,' contains mixed date formats (e.g., '10-12-2023,' 'Oct. 15, 2023,' 'September 15') and inconsistent categorical values for survey completion ('y,' 'Yes,' 'Y,' 'no'). The right table, labeled 'Consistent column values,' standardizes dates to 'YYYY-MM-DD' format and unifies categorical responses to 'y' and 'n.' This highlights best practices in data management for improving data clarity and usability.

Lewis (2024). DOI: 10.1201/9781032622835-3

Two tables comparing character and numeric variable formatting. The left table, labeled 'Character variable,' contains inconsistent age values: '24' has an extra space, '49 years old' includes unnecessary text, and '36..0' has a formatting error, causing them to be stored as text instead of numbers. The right table, labeled 'Numeric variable,' correctly stores ages as numerical values without extra spaces, text, or formatting issues. This demonstrates the importance of maintaining clean numeric data for proper analysis.

Lewis (2024). DOI: 10.1201/9781032622835-3

Two tables comparing improper and proper data structuring. The left table, labeled 'Two things in one variable,' combines incident counts and enrollment numbers into a single column using a 'incident_rate' format (e.g., '55/250'). The right table, labeled 'Two things in two variables,' properly separates these values into distinct columns: 'incident' for the number of incidents and 'enrollment' for the total population. This demonstrates best practices in data management by ensuring each variable represents only one piece of information.

Lewis (2024). DOI: 10.1201/9781032622835-3

Examples from Crystal Lewis (2024)

Two tables comparing implicit and explicit data entry. The left table, labeled 'Not explicit values,' omits repeated school IDs and years, assuming they apply to multiple rows, which can cause confusion in data processing. The right table, labeled 'Explicit values,' explicitly repeats the school ID and year for each row, ensuring clarity and making the dataset more machine-readable. This highlights best practices in research data management for maintaining completeness and reducing ambiguity.

Lewis (2024). DOI: 10.1201/9781032622835-3

Two tables comparing implicit and explicit variable representation. The left table, labeled 'Not explicit variables,' uses cell color to indicate treatment conditions without an explicit variable, which can be misinterpreted or lost in data processing. The right table, labeled 'Explicit variables,' adds a 'treatment' column with numerical values (0 or 1) to explicitly indicate the treatment condition for each student. This demonstrates best practices in research data management by ensuring that all meaningful information is stored as explicit variables rather than relying on formatting or visual cues.

Lewis (2024). DOI: 10.1201/9781032622835-3

Building accessible data tables

A well-structured dataset in table format, displaying experimental data for different mice. The table includes columns for 'MouseID,' 'DPI' (days post-injury), 'Condition' (MCAO), 'Region' (Contra, Ipsi, Peri), and cell counts for NeuN, Ki67, and BrdU markers. This table demonstrates a clean and organized data structure

A typical data table organizes the information by rows and columns

Columns

  • Identifier variables: animal ID, Time point, condition (factors or characters).
  • Analysis variables: score, area, number of cells, etc (numerical or categorical).
  • Variables created during processing (proportions, ratios, etc).

Rows

  • Variable values: entries for each column (variable). Each row corresponds to a unique observation.

Wide table formats

Diagram illustrating the transformation of longitudinal data from separate tables into a wide format. The top two tables represent 'Wave 1 data' and 'Wave 2 data,' each containing anxiety measures ('anx1' and 'anx2') for students identified by 'stu_id.' The bottom table, labeled 'Wide format data,' consolidates both waves into a single dataset by renaming variables with wave-specific prefixes ('w1_anx1,' 'w1_anx2,' 'w2_anx1,' 'w2_anx2'). This transformation makes it easier to analyze individual-level changes over time.

A typical wide-format data table, from Lewis (2024). DOI: 10.1201/9781032622835-3

In a wide format table, each subject occupies a single row and variables are individual columns: subject, Id1, Id2, Var1, Var2, Time 1, Time2, Time3.

Tip

Here, columns are responses or predictors in a regression. Example:

Cells_7D ~ Cells_2D + Cells_3D.

Long table formats

Diagram illustrating the transformation of longitudinal data from separate tables into a long format. The top and bottom tables represent 'Wave 1 data' and 'Wave 2 data,' each containing anxiety measures ('anx1' and 'anx2') for students identified by 'stu_id.' The right table, labeled 'Long format data,' restructures the data by adding a 'wave' column, with each row representing one student's measurements at a specific wave. This transformation optimizes the dataset for longitudinal analysis and efficient storage.

A typical long-format data table, from Lewis (2024). DOI: 10.1201/9781032622835-3

In a long format table, the observations per subject occupy various rows: subject(repeat), Time (1, 2 , 3).

Tip

Useful when analyzing time-lapse data. Example:

Cells ~ TimePoint (1D, 2D, 3D).

Long-format is usually the first choice for data analysis.

The best of all…

You can use R (or Python) and Quarto to convert from long to wide table format, or visceversa.

Diagram illustrating the transformation between long and wide data formats using `pivot_wider()` and `pivot_longer()`. The left table represents long format data, where each row contains a 'country', a 'year', and a corresponding 'metric' value. The right table represents wide format data, where 'year' values are spread across multiple columns (e.g., 'yr1960', 'yr1970', 'yr2010'), each containing the corresponding metric for each country. The `pivot_wider()` function converts long format to wide format, while `pivot_longer()` reverses the process, demonstrating flexible data reshaping in R.

Long to wide format (https://tavareshugo.github.io/)

Tip

Check the following R and python tutorials.

Provide metadata (README files)

  • Datasets are unintelligible if they are not accompanied by a README file (.txt, .md) that describes their context and content. Codebooks and data dictionaries (.txt, .md, .csv) are also useful for describing the variables in each data table.

Screenshot of descriptive metadata for a dataset related to PDGFR-B+ cell reactivity in a mouse model of cerebral ischemia. The text provides details on the dataset's origin, file naming conventions, experimental conditions, and image processing methods. It describes how images were derived from high-resolution Zenodo deposits and processed using ImageJ and CellProfiler. The document also outlines folder contents, including Ki67/PDGFR-B object detections and outlines, and references an OSF repository for further details.

Example of a readme file

Handling images

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

When handling images, please consider:

Fluorescence microscopy image of a coronal section of a mouse brain. The section is stained with immunofluorescence markers

Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J
  • Transform proprietary files (i.e .czi) to open formats with no compression (.tif).
  • Share technical (acquisition parameters) and descriptive (context and content) metadata.
  • Document (i.e using coding/scripting software) all procedures applied to images (resize, background subtraction, etc.).
  • Perform analysis using coding/scripting software to ensure reproducibility. Avoid manual analysis.

Tip

Visit this resource for additional information on handling and sharing images.

Transform images to open formats

Screenshot of an ImageJ macro script written in JavaScript. The script automates the conversion of `.czi` microscopy image files into `.tif` format. It prompts the user to select a directory, retrieves the list of `.czi` files, and processes each file by opening it with Bio-Formats Importer. It then extracts two image channels, saving them separately as `.tif` files in an 'Images_Tiff' folder. The script ensures all files are processed systematically and closes all windows after completion.

FIJI script to save .czi images as tiff. From Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J

You can easy transform your proprietary files (.czi) to open formats (.tif) using i.e FIJI scripts.

Caution

Saving .czi images as .tif using FIJI will result in metadata loss (archived within the .czi file).

Keep track of metadata

Technical

Export technical metadata from proprietary images (i.e .czi) as .txt or .csv files.

![Screenshot of the metadata viewer displaying technical metadata from a `.czi` microscopy image file. The metadata table includes keys and values such as 'BitsPerPixel' (14), 'DimensionOrder' (XYZCT), and 'PixelType' (uint16). Other metadata details indicate that the image has 4 channels (SizeC), a single time point (SizeT = 1), dimensions of 2752x2208 pixels (SizeX, SizeY), and a single Z-plane (SizeZ = 1). This metadata provides essential information for image processing and analysis in research microscopy.

Example of technical metadata in FIJI: image -> show info

Descriptive

Document the provenance and naming conventions in README files.

Organizing (and sharing) data

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

A worrying research landscape

We live in a pandemic of fraudulent and irreproducible science.

A chart from The Economist titled 'Pants on fire,' illustrating the cumulative number of retracted biomedical science papers from 1996 to 2023. The graph shows an exponential increase in retractions, reaching over 15,000 by 2023. The data is sourced from Retraction Watch, covering 4,244 assessed journals. The chart highlights growing concerns about research integrity and the rise in retracted publications over time.

Increase in the number of retracted articles in the last three decades

This landscape demands that, as responsible researchers, we employ good research practices to share data and analysis procedures.

Define a dataset structure

A structured dataset is the key to understanding and reusing it.

A display of traditional Russian Matryoshka dolls, also known as nesting dolls, painted in vibrant colors.

From pexels.com

A structured directory tree representing an organized research project folder. The top-level folders include 'Code,' containing R scripts for data cleaning and analysis ('clean_raw_data.r,' 'analysis_1.r,' 'analysis_2.r'); 'Data,' which is divided into 'Raw_data' (containing raw files 'file_a.raw' and 'file_b.raw') and 'Processed_data' (containing cleaned CSV files 'file_a.csv' and 'file_b.csv'); 'Outputs,' which includes subfolders for 'Figures' and 'Models'; and a 'README.txt' file. This structure follows best practices for research data organization.

File structure

Principles for structuring a dataset

Define a structure for the data at the beginning (best) or during the course of your research.

Think about

  • Folders/directory structures
  • Think about file types/formats
  • Establish logical/descriptive naming conventions

Overall, ensure that the dataset structure is logical and consistent, understandable to external users.

Diving into the folder tree

TIER 4.0 is a project template to standardize datasets.

Download the project structure and adapt it to specific cases.

A hierarchical directory structure following the TIER Protocol 4.0 for research data organization. The top-level 'Project/' folder contains key documents such as 'The Read Me File' and 'The Report.' The 'Data/' folder is divided into 'InputData/' (with 'Input Data Files' and 'Metadata' subfolders, including 'Data Sources Guide' and 'Codebooks'), 'AnalysisData/' (with 'Analysis Data Files' and 'The Data Appendix'), and 'IntermediateData/'. The 'Scripts/' folder includes subfolders for 'ProcessingScripts/', 'DataAppendixScripts/', 'AnalysisScripts/', and 'The Master Script.' The 'Output/' folder contains 'DataAppendixOutput' and 'Results.' This structure ensures transparency and reproducibility in research data management.

Folder tree

Raw data

A Data_Raw folder can contain:

  • Original images (.tiff, .czi)
  • Measuring device output files (.txt, .csv)
  • Original registration datasheets (.png, .csv, .xlxs)

A screenshot of a structured directory containing organized raw data.

Folder tree

Raw Data - metadata

Include metadata that allows the file contents to be understood and reused:

  • Methodological and technical details.

  • Codebooks / data dictionaries that explain variables and units. They can be .txt or .csv, xlxs files.

  • Instrument and acquisition parameters for images.

Analysis (processed) data

A Data_Analysis folder contains processed files to generate the research results.

  • Provide metadata (as done for raw data).

  • (Optional) Include Data_Appendix files showing basic descriptive statistics or data distributions.

A screenshot of a structured directory containing organized processed data.

Folder tree

Intermediate data (Optional)

A Data_Intermediate folder can contain intermediate or pre-processed files (i.e. image ‘masks’ or machine learning classifiers) as part of an analysis pipeline.

Scripting is the way

While most researchers may be more comfortable with GUIs, the current research landscape requires the use of scripts and code to ensure reproducible research results.

A humorous Star Wars-themed meme comparing different programming languages. The image is divided into three sections, each showing a Star Wars character wielding a lightsaber. On the left, Luke Skywalker, with an 'R' programming language logo, holds a blue lightsaber. In the center, Kylo Ren wields a red crossguard lightsaber with the GraphPad logo. On the right, Mace Windu, associated with the Python logo, holds a purple lightsaber. This meme humorously depicts the perceived roles open software has in the research landscape.

Tip

Coding should be considered an essential skill like other research methods.

Partners to handle code/scripts

R-studio/Quarto (R + Python)

Screenshot of an R-Studio session displaying a Quarto data analysis notebook.

R-studio/quarto screen

GitHub (Version control)

Screenshot of a GitHub repository named 'Stroke_PDGRF-B_Reactivity,' forked from 'elalilab/Stroke_PDGRF-B_Reactivity.' The repository is public and contains directories such as 'Data_Processed' and multiple Quarto markdown files (`.qmd`) related to data analysis

GitHub screen

With R-studio (R and Python) you can

R-studio/Quarto (R + Python)

Screenshot of an R-Studio session displaying a Quarto data analysis notebook.

R-studio/quarto screen

Keep track with version control

Screenshot of a GitHub repository named 'Stroke_PDGRF-B_Reactivity,' forked from 'elalilab/Stroke_PDGRF-B_Reactivity.' The repository is public and contains directories such as 'Data_Processed' and multiple Quarto markdown files (`.qmd`) related to data analysis

GitHub screen

With GitHub or GitLab you can:

  • Store your code/data in a secure place and share it with collaborators and the public.

  • Keep a history of changes and version your code (v 1.0, 1.2, 2.0).

  • Link/render your code in different platforms (i.e OSF) or Borealis.

  • Support other researchers and contribute to a culture of open and reproducible science.

Global supporting communities for coding

Processing scripts

A Scripts_Processing folder contains code to transform the raw data for analysis:

  • Drop variables (subset the dataset)
  • Generate new variables (Perform computations, calculate averages, etc.)
  • Combine different sources of information (merge tables or files)

Tip

Consider saving the generated intermediate files in the Data_Intermediate/ folder.

Keep in mind

Logical naming conventions are the key to linking the raw data, processing scripts, and analysis data.

Analysis scripts

The Scripts_Analysis folder hosts code to generate results that may be in the form of:

  • Images
  • Figures
  • Tables
  • Statistical models

A screenshot of a structured directory containing organized analysis data.

Folder tree

Tip

These scripts import and process the analysis data.

A master script?

The Scripts folder can also contain a master script that executes all other scripts, creating a fully automated pipeline.

The Results folder

The Results folder contains files generated by the analysis scripts in the form of:

  • Images
  • Figures
  • Tables
  • Statistical models

A screenshot of a structured directory containing research figures/plots

Folder tree

Write a README file

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

README files

README files are guides to understand datasets and tables.

Screenshot of a GitHub README file for the 'Bootstrap Ruby Gem,' a library used in Ruby on Rails applications. The README file includes a badge indicating that the gem build is passing and version 4.1.1 is available. The document provides installation instructions, including how to add the gem to the Gemfile and ensure compatibility with 'sprockets-rails.' The guide references different environments, including Ruby on Rails and other Ruby frameworks. The page contains formatted code snippets for easy integration into a Rails project.

From https://github.com/twbs/bootstrap-rubygem

There are templates and resources to guide the generation of readme files:
- Creating a README file
- Readme.so
- Readme.ai

Contents of a readme file

Generally, a dataset readme file showcases:

  • Dataset identifier showing information such as title, authors, data collection date, Geographic information.

  • A map of files/folders defining the content and hierarchy of folders and subfolders, together with naming conventions.

  • Methodological information showcasing methods for data collection/generation, analysis, and experimental conditions.

  • A set of instructions and software for opening, handling and reproducing research pipelines.

  • Sharing and access information detailing permissions and conditions of use.

Please note

A dataset is a standalone object. Methodological information MUST NOT be relegated to associated research articles.

Checklist for reproducible research

Agenda

  1. Principles for handling research data

  2. Handling data tables

  3. Handling images

  4. Organizing (and sharing) data

  5. Write a README file

  6. Checklist for reproducible research

Commitment to reproducibility

A reproducible research project meets these characteristics:

  1. Folders and files are organized in a structured way with open file formats (e.g., CSV, TIFF) and consistent naming conventions.

  2. Processing and analysis is based on reproducible workflows. Results (images, tables, figures, plots) are shared as independent artifacts.

  3. README and data dictionary files allow the understanding of the dataset as standalone object, providing context, methods, processing steps, and variables.

In summary

A dataset is an independent research object that that can be used (and cited) independently of the research article.

Better yet, think of articles as supplements to your dataset!

Tip

Visit this resource for principles on deposting data into repositories.

Resources and support

Supporting material

A QR code image that redirects to the presentation located in a GitHub repository.

This presentation is available here (English or French)

Support Services

Contact us to ensure that your data are well prepared and can be effectively shared with the research community.