A primer for researchers
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
Research data comes in many flavors and shapes (tables, images, videos, text).
In all cases, it is essential that the dataset has a clear structure and is understandable by others.
Tip
Try to put yourself in the shoes of an outside observer when structuring the data.
Use consistent naming conventions that fairly describe file’s content and allow interrelation between files:
Use proper, open/accessible file formats to improve accessibility:
Use comprehensive metadata (README files and data dictionaries/codebooks) to contextualize and describe research files.
Implement reproducible workflows using coding (R, Python) to transform raw data into data for analysis.
Tip
These practices ensure organized, clean, and validated datasets.
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.
In a wide format table, each subject occupies a single row and variables are individual columns: subject, Id1, Id2, Var1, Var2, Time 1, Time2, Time3.
Tip
Here, columns are responses or predictors in a regression. Example:
Cells_7D ~ Cells_2D + Cells_3D.
In a long format table, the observations per subject occupy various rows: subject(repeat), Time (1, 2 , 3).
Tip
Useful when analyzing time-lapse data. Example:
Cells ~ TimePoint (1D, 2D, 3D).
Long-format is usually the first choice for data analysis.
You can use R (or Python) and Quarto to convert from long to wide table format, or visceversa.
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
Tip
Visit this resource for additional information on handling and sharing images.
You can easy transform your proprietary files (.czi) to open formats (.tif) using i.e FIJI scripts.
Caution
Saving .czi images as .tif using FIJI will result in metadata loss (archived within the .czi file).
Export technical metadata from proprietary images (i.e .czi) as .txt or .csv files.
Document the provenance and naming conventions in README files.
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
We live in a pandemic of fraudulent and irreproducible science.
This landscape demands that, as responsible researchers, we employ good research practices to share data and analysis procedures.
A structured dataset is the key to understanding and reusing it.
Define a structure for the data at the beginning (best) or during the course of your research.
Think about
Overall, ensure that the dataset structure is logical and consistent, understandable to external users.
A Data_Raw folder can contain:
Include metadata that allows the file contents to be understood and reused:
Methodological and technical details.
Codebooks / data dictionaries that explain variables and units. They can be .txt or .csv, xlxs files.
Instrument and acquisition parameters for images.
A Data_Analysis folder contains processed files to generate the research results.
Provide metadata (as done for raw data).
(Optional) Include Data_Appendix files showing basic descriptive statistics or data distributions.
A Data_Intermediate folder can contain intermediate or pre-processed files (i.e. image ‘masks’ or machine learning classifiers) as part of an analysis pipeline.
While most researchers may be more comfortable with GUIs, the current research landscape requires the use of scripts and code to ensure reproducible research results.
Tip
Coding should be considered an essential skill like other research methods.
Handle data tables and variables using the R Tidyverse.
Process Flow cytometry files/data using R FlowCore from BioConductor.
Analyze RNA-seq data using R DESeq2 from BioConductor.
Perform state-of-the-art statistical modeling using brms.
And anything else you can imagine…
With GitHub or GitLab you can:
Store your code/data in a secure place and share it with collaborators and the public.
Keep a history of changes and version your code (v 1.0, 1.2, 2.0).
Link/render your code in different platforms (i.e OSF) or Borealis.
Support other researchers and contribute to a culture of open and reproducible science.
A Scripts_Processing folder contains code to transform the raw data for analysis:
Tip
Consider saving the generated intermediate files in the Data_Intermediate/ folder.
Logical naming conventions are the key to linking the raw data, processing scripts, and analysis data.
The Scripts_Analysis folder hosts code to generate results that may be in the form of:
Tip
These scripts import and process the analysis data.
The Scripts folder can also contain a master script that executes all other scripts, creating a fully automated pipeline.
The Results folder contains files generated by the analysis scripts in the form of:
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
README files are guides to understand datasets and tables.
There are templates and resources to guide the generation of readme files:
- Creating a README file
- Readme.so
- Readme.ai
Generally, a dataset readme file showcases:
Dataset identifier showing information such as title, authors, data collection date, Geographic information.
A map of files/folders defining the content and hierarchy of folders and subfolders, together with naming conventions.
Methodological information showcasing methods for data collection/generation, analysis, and experimental conditions.
A set of instructions and software for opening, handling and reproducing research pipelines.
Sharing and access information detailing permissions and conditions of use.
Please note
A dataset is a standalone object. Methodological information MUST NOT be relegated to associated research articles.
Agenda
Principles for handling research data
Handling data tables
Handling images
Organizing (and sharing) data
Write a README file
Checklist for reproducible research
A reproducible research project meets these characteristics:
Folders and files are organized in a structured way with open file formats (e.g., CSV, TIFF) and consistent naming conventions.
Processing and analysis is based on reproducible workflows. Results (images, tables, figures, plots) are shared as independent artifacts.
README and data dictionary files allow the understanding of the dataset as standalone object, providing context, methods, processing steps, and variables.
A dataset is an independent research object that that can be used (and cited) independently of the research article.
Better yet, think of articles as supplements to your dataset!
Tip
Visit this resource for principles on deposting data into repositories.
Contact us to ensure that your data are well prepared and can be effectively shared with the research community.
Handling and organizing research data - FRDR curation team