Code
# install.packages(c("tidyverse", "R.matlab", "hdf5r", "rstudioapi"))
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("rhdf5")MAT-files are the proprietary binary data container format used by MATLAB. They are widely used in engineering and science but pose a significant challenge for long-term preservation.
Identify file versions and inventory internal variables. Our objective is to distinguish between classic Level 5 (v5) files and modern Level 7.3 (HDF5-based) files, ensuring the correct open-source libraries are used for long-term access.
Proprietary format lock-in is a major risk. Version 7.3 files require specialized HDF5 parsers, while older versions are opaque without MATLAB or compatible open-source alternatives like GNU Octave.
This notebook inspects .mat files to:
This workflow relies on the R.matlab package.
If you don’t have the packages installed, run this command once in your R console:
# install.packages(c("tidyverse", "R.matlab", "hdf5r", "rstudioapi"))
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("rhdf5")library(tidyverse)
library(R.matlab)
library(hdf5r)
library(rstudioapi)We select the folder containing the .mat files.
Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") {
selected_dir <- rstudioapi::selectDirectory(caption = "Select Notebook Directory")
} else {
selected_dir <- NULL
}
# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
target_dir <- selected_dir
} else {
target_dir <- params$target_dir
}
print(paste("Analyzing directory:", target_dir))[1] "Analyzing directory: data/Inspect_mat/"
We scan the directory for files ending in .mat.
ipynb_files <- list.files(
path = target_dir,
pattern = "\\.mat$",
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
print(paste("Found", length(ipynb_files), "MATLAB files."))[1] "Found 2 MATLAB files."
head(ipynb_files)[1] "data/Inspect_mat//preprocessed.mat" "data/Inspect_mat//trial.mat"
We try to read each .mat file as HDF5 first. If that succeeds, we extract:
If the file is not HDF5, the code falls back to R.matlab::readMat() and extracts the same information from a v4/v5 structure.
# --- Helper 1: Extract v5 Content (R.matlab) ---
extract_v5_content <- function(fp) {
tryCatch({
# readMat returns a simple list
content <- R.matlab::readMat(fp)
var_names <- names(content)
# Map each variable to its dimensions and class
map_dfr(var_names, function(v) {
obj <- content[[v]]
dims <- paste(dim(obj), collapse = "x")
if (dims == "") dims <- paste(length(obj), "(len)") # Handle vectors
tibble(
object_path = v,
object_type = "MATLAB Variable",
dimensions = dims,
data_class = class(obj)[1]
)
})
}, error = function(e) {
tibble(object_path = "Parse Error", object_type = "Error", dimensions = NA, data_class = e$message)
})
}
# --- Helper 2: Extract v7.3 Content (rhdf5) ---
extract_v73_content <- function(fp) {
tryCatch({
# h5ls recursively lists the HDF5 hierarchy
content <- rhdf5::h5ls(fp, all = TRUE)
content %>%
mutate(
object_path = paste0(group, "/", name),
object_path = str_replace_all(object_path, "//", "/") # Clean paths
) %>%
select(
object_path,
object_type = otype, # e.g., H5I_DATASET
dimensions = dim,
data_class = dclass # e.g., H5T_FLOAT
)
}, error = function(e) {
tibble(object_path = "HDF5 Error", object_type = "Error", dimensions = NA, data_class = e$message)
})
}
# --- Main Processing Function ---
inspect_mat_file <- function(fp) {
fname <- basename(fp)
# 1. Header Check (Identify Version)
con <- file(fp, "rb")
header_raw <- tryCatch(readBin(con, "raw", n = 128), finally = close(con))
header_txt <- rawToChar(header_raw[header_raw > 31 & header_raw < 127])
is_hdf5 <- str_detect(header_txt, "HDF5") || str_detect(header_txt, "MATLAB 7.3")
version_label <- if (is_hdf5) "v7.3 (HDF5)" else "v5 (Standard)"
# 2. Branching Logic
if (is_hdf5) {
content_df <- extract_v73_content(fp)
} else {
content_df <- extract_v5_content(fp)
}
# 3. Add Metadata
content_df %>%
mutate(
filename = fname,
version = version_label
) %>%
select(filename, version, everything())
}
# Run Analysis
mat_files <- list.files(target_dir, pattern = "\\.mat$", full.names = TRUE, recursive = TRUE)
message(paste("Processing", length(mat_files), "files..."))
if (length(mat_files) > 0) {
results <- map_dfr(mat_files, inspect_mat_file)
} else {
results <- tibble(filename = character(), version = character())
}
print(paste("Extracted", nrow(results), "objects from", length(unique(results$filename)), "files."))[1] "Extracted 2 objects from 2 files."
head(results, 10) filename version object_path object_type dimensions
1 preprocessed.mat v7.3 (HDF5) /arr_0 H5I_DATASET 3000 x 180 x 180
2 trial.mat v7.3 (HDF5) /arr_0 H5I_DATASET 3030 x 720 x 240
data_class
1 INTEGER
2 INTEGER
output_dir <- "Results/Inspect_mat"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
output_file <- file.path(output_dir, paste0("MAT_Deep_Report_", format(Sys.Date(), "%Y%m%d"), ".csv"))
write.csv(results, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))[1] "Report saved to: Results/Inspect_mat/MAT_Deep_Report_20260515.csv"
Use the generated CSV report to guide your preservation actions:
Format Obsolescence (v5 vs v7.3): The version column identifies the file architecture. v7.3 are essentially HDF5 files. They require specialized libraries (rhdf5 in R, h5py in Python) to read. Consider creating a Codebook that maps the HDF5 paths (e.g., /group1/dataset_A) to their scientific meaning.
Workspace Dumps: Files with high object counts (>50) and generic names (var1, temp, data_copy) are likely RAM demanding. These files request a data dictionary or a cleaned version containing only the final variables.
Complex Structures (H5T_COMPOUND): In v7.3 files, H5T_COMPOUND indicates complex nested tables (structs). These are difficult to convert to CSV. Recommend converting these specific tables to JSON or separate CSVs to preserve the hierarchy.
MATLAB .mat files can vary significantly in format and complexity. We have some limitations to keep in mind:
rhdf5 (Bioconductor): This library allows R users to browse and read v7.3 MAT-files as if they were standard HDF5 archives.
GNU Octave: Is an open-source high-level language, largely compatible with MATLAB. It can read and write most MAT-files without a MATLAB license.
HDFView: This is a visual tool to browse the contents of v7.3 MAT-files without needing code.
For users who want to run this analysis on a server, in a batch job, or from the command line, here is a pure R script that performs the same process.
Download the R Script: Inspect_mat_Script.R
Inspect_mat_Script.sh)#!/bin/bash
#SBATCH --job-name=matlab_inspect
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --output=matlab_inspect_%j.out
#SBATCH --error=matlab_inspect_%j.err
module load R/4.3.0 # adapt to your cluster's module system
Rscript matlab_inspector.R \
--target_dir /path/to/your/mat/files \
--output_dir /path/to/output/folder \
--max_full_numeric 1000000 \
--max_sample_numeric 100000