19  MATLAB (.mat) Files

Author

Daniel Manrique-Castano

Published

December 18, 2025

19.1 Overview

MAT-files are the proprietary binary data container format used by MATLAB. They are widely used in engineering and science but pose a significant challenge for long-term preservation.

NoteCuration Goal

Identify file versions and inventory internal variables. Our objective is to distinguish between classic Level 5 (v5) files and modern Level 7.3 (HDF5-based) files, ensuring the correct open-source libraries are used for long-term access.

WarningPreservation Risk

Proprietary format lock-in is a major risk. Version 7.3 files require specialized HDF5 parsers, while older versions are opaque without MATLAB or compatible open-source alternatives like GNU Octave.

This notebook inspects .mat files to:

  1. Identify Version: Distinguish between v5 and v7.3 without crashing.
  2. Inventory Content: List variable names and array dimensions to understand dataset complexity.

19.2 Setup

This workflow relies on the R.matlab package.

19.2.1 R Packages

If you don’t have the packages installed, run this command once in your R console:

Code
# install.packages(c("tidyverse", "R.matlab", "hdf5r",  "rstudioapi"))

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("rhdf5")

19.2.2 Load libraries

Code
library(tidyverse)
library(R.matlab)
library(hdf5r)
library(rstudioapi)

19.3 Select Target Directory

We select the folder containing the .mat files.

Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.

Code
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select Notebook Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: data/Inspect_mat/"

19.4 Find MATLAB files

We scan the directory for files ending in .mat.

Code
ipynb_files <- list.files(
  path = target_dir,
  pattern = "\\.mat$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(ipynb_files), "MATLAB files."))
[1] "Found 2 MATLAB files."
Code
head(ipynb_files)
[1] "data/Inspect_mat//preprocessed.mat" "data/Inspect_mat//trial.mat"       

19.5 Generate Metadata Report

We try to read each .mat file as HDF5 first. If that succeeds, we extract:

  • variable (dataset) names
  • data type (HDF5 class)
  • array dimensions
  • number of dimensions
  • number of elements

If the file is not HDF5, the code falls back to R.matlab::readMat() and extracts the same information from a v4/v5 structure.

Code
# --- Helper 1: Extract v5 Content (R.matlab) ---
extract_v5_content <- function(fp) {
  tryCatch({
    # readMat returns a simple list
    content <- R.matlab::readMat(fp)
    var_names <- names(content)
    
    # Map each variable to its dimensions and class
    map_dfr(var_names, function(v) {
      obj <- content[[v]]
      dims <- paste(dim(obj), collapse = "x")
      if (dims == "") dims <- paste(length(obj), "(len)") # Handle vectors
      
      tibble(
        object_path = v,
        object_type = "MATLAB Variable",
        dimensions = dims,
        data_class = class(obj)[1]
      )
    })
  }, error = function(e) {
    tibble(object_path = "Parse Error", object_type = "Error", dimensions = NA, data_class = e$message)
  })
}

# --- Helper 2: Extract v7.3 Content (rhdf5) ---
extract_v73_content <- function(fp) {
  tryCatch({
    # h5ls recursively lists the HDF5 hierarchy
    content <- rhdf5::h5ls(fp, all = TRUE)
    
    content %>%
      mutate(
        object_path = paste0(group, "/", name),
        object_path = str_replace_all(object_path, "//", "/") # Clean paths
      ) %>%
      select(
        object_path,
        object_type = otype,  # e.g., H5I_DATASET
        dimensions = dim,
        data_class = dclass     # e.g., H5T_FLOAT
      )
  }, error = function(e) {
    tibble(object_path = "HDF5 Error", object_type = "Error", dimensions = NA, data_class = e$message)
  })
}

# --- Main Processing Function ---
inspect_mat_file <- function(fp) {
  fname <- basename(fp)
  
  # 1. Header Check (Identify Version)
  con <- file(fp, "rb")
  header_raw <- tryCatch(readBin(con, "raw", n = 128), finally = close(con))
  header_txt <- rawToChar(header_raw[header_raw > 31 & header_raw < 127])
  
  is_hdf5 <- str_detect(header_txt, "HDF5") || str_detect(header_txt, "MATLAB 7.3")
  version_label <- if (is_hdf5) "v7.3 (HDF5)" else "v5 (Standard)"
  
  # 2. Branching Logic
  if (is_hdf5) {
    content_df <- extract_v73_content(fp)
  } else {
    content_df <- extract_v5_content(fp)
  }
  
  # 3. Add Metadata
  content_df %>%
    mutate(
      filename = fname,
      version = version_label
    ) %>%
    select(filename, version, everything())
}

# Run Analysis
mat_files <- list.files(target_dir, pattern = "\\.mat$", full.names = TRUE, recursive = TRUE)

message(paste("Processing", length(mat_files), "files..."))

if (length(mat_files) > 0) {
  results <- map_dfr(mat_files, inspect_mat_file)
} else {
  results <- tibble(filename = character(), version = character())
}

print(paste("Extracted", nrow(results), "objects from", length(unique(results$filename)), "files."))
[1] "Extracted 2 objects from 2 files."
Code
head(results, 10)
          filename     version object_path object_type       dimensions
1 preprocessed.mat v7.3 (HDF5)      /arr_0 H5I_DATASET 3000 x 180 x 180
2        trial.mat v7.3 (HDF5)      /arr_0 H5I_DATASET 3030 x 720 x 240
  data_class
1    INTEGER
2    INTEGER

19.6 Exports results

Code
output_dir <- "Results/Inspect_mat"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("MAT_Deep_Report_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(results, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))
[1] "Report saved to: Results/Inspect_mat/MAT_Deep_Report_20260515.csv"

19.7 Curation Insights

Use the generated CSV report to guide your preservation actions:

  • Format Obsolescence (v5 vs v7.3): The version column identifies the file architecture. v7.3 are essentially HDF5 files. They require specialized libraries (rhdf5 in R, h5py in Python) to read. Consider creating a Codebook that maps the HDF5 paths (e.g., /group1/dataset_A) to their scientific meaning.

  • Workspace Dumps: Files with high object counts (>50) and generic names (var1, temp, data_copy) are likely RAM demanding. These files request a data dictionary or a cleaned version containing only the final variables.

  • Complex Structures (H5T_COMPOUND): In v7.3 files, H5T_COMPOUND indicates complex nested tables (structs). These are difficult to convert to CSV. Recommend converting these specific tables to JSON or separate CSVs to preserve the hierarchy.

19.7.1 Notes and limitations

MATLAB .mat files can vary significantly in format and complexity. We have some limitations to keep in mind:

  • Very large datasets may not be fully loaded and may be summarized based on sampling.
  • Deeply nested MATLAB structs or objects may require custom handling.
  • Character encoding in .mat files may differ across MATLAB versions.
  • Complex cell arrays can contain heterogeneous types that cannot be summarized numerically.

19.8 Additional Tools

  • rhdf5 (Bioconductor): This library allows R users to browse and read v7.3 MAT-files as if they were standard HDF5 archives.

  • GNU Octave: Is an open-source high-level language, largely compatible with MATLAB. It can read and write most MAT-files without a MATLAB license.

  • HDFView: This is a visual tool to browse the contents of v7.3 MAT-files without needing code.

19.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server, in a batch job, or from the command line, here is a pure R script that performs the same process.

Download the R Script: Inspect_mat_Script.R

19.9.1 Example HPC Submission Script (Inspect_mat_Script.sh)

#!/bin/bash
#SBATCH --job-name=matlab_inspect
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --output=matlab_inspect_%j.out
#SBATCH --error=matlab_inspect_%j.err

module load R/4.3.0   # adapt to your cluster's module system

Rscript matlab_inspector.R \
  --target_dir /path/to/your/mat/files \
  --output_dir /path/to/output/folder \
  --max_full_numeric 1000000 \
  --max_sample_numeric 100000
  

19.10 References