Code
# install.packages(c("tidyverse", "readxl", "rstudioapi", "tidyxl"))Excel files are complex “containers” (Zip-compressed XML) that hold much more than raw data. They are considered high-risk formats for long-term preservation.
Unpack the complex structure of Excel spreadsheets. Our objective is to identify hidden content, validate structural integrity, and detect external dependencies that may compromise the file’s future usability.
Proprietary macros (VBA), external cell links that break upon transfer, and hidden sheets containing sensitive scratch work or PII are significant threats to data integrity and privacy in archival settings.
This notebook combines two approaches:
We use readxl for data reading and tidyxl for deep structural inspection of modern Excel files.
If you don’t have these packages installed, run this command once in your R console:
# install.packages(c("tidyverse", "readxl", "rstudioapi", "tidyxl"))This chunk loads all the necessary libraries for the session.
library(tidyverse)
library(readxl) # Standard data reader
library(tidyxl) # Structural inspector (Hidden sheets, formatting)
library(rstudioapi) # For directory selectionThis section identifies the directory to be analyzed and finds all Excel files within it.
Note: If you are running this interactively in RStudio, a dialog box will appear. If you are rendering this document (where no user interaction is possible), it will default to the
params$target_dirdefined in the YAML header.
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") {
selected_dir <- rstudioapi::selectDirectory(caption = "Select Excel Directory")
} else {
selected_dir <- NULL
}
# 2. Logic to determine final directory (Interactive vs Parameter)
if (!is.null(selected_dir)) {
target_dir <- selected_dir
} else {
target_dir <- params$target_dir
}
print(paste("Analyzing directory:", target_dir))[1] "Analyzing directory: data/Inspect_xlsx/"
Now we scan the selected directory for all .xls and .xlsx files.
# Find all Excel files using a regex pattern
excel_files <- list.files(
path = target_dir,
pattern = "\\.xlsx?$", # Matches .xls or .xlsx
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
# Print the number of files found and show the first few paths
print(paste("Found", length(excel_files), "Excel files."))[1] "Found 3 Excel files."
head(excel_files)[1] "data/Inspect_xlsx//41467_2025_65055_MOESM9_ESM.xlsx"
[2] "data/Inspect_xlsx//41467_2025_65141_MOESM4_ESM.xlsx"
[3] "data/Inspect_xlsx//41467_2025_65179_MOESM4_ESM.xlsx"
In this section, we iterate through each file. For .xlsx files, we perform a deep scan for hidden content. For legacy .xls, we rely on basic extraction.
message("Generating Excel Structure Report...")
# Helper: Analyze Structure (Hidden Sheets, Links)
analyze_structure <- function(fp) {
is_modern <- grepl("\\.xlsx$|\\.xlsm$", fp, ignore.case = TRUE)
is_macro <- grepl("\\.xlsm$", fp, ignore.case = TRUE)
hidden_sheets <- "Unknown (Legacy)"
has_ext_links <- "Unknown (Legacy)"
if (is_modern) {
tryCatch({
# tidyxl allows us to see sheet state (visible/hidden)
sheet_meta <- tidyxl::xlsx_sheet_names(fp)
hidden_count <- sum(sheet_meta$state != "visible")
hidden_sheets <- if (hidden_count > 0) paste0(hidden_count, " Hidden") else "None"
# Check for external links (naive check in cells is slow, sticking to sheets/macros)
# We could scan formulas, but for speed we focus on sheets here.
}, error = function(e) {
hidden_sheets <- "Scan Failed"
})
}
return(list(
Macros = is_macro,
Hidden = hidden_sheets,
Format = if (is_modern) "XML (Modern)" else "Binary (Legacy)"
))
}
# Main Loop
all_sheets_summary <- list()
for (file_path in excel_files) {
file_name <- basename(file_path)
# 1. Structural Scan
struct <- analyze_structure(file_path)
tryCatch({
sheet_names <- readxl::excel_sheets(file_path)
if (length(sheet_names) == 0) {
all_sheets_summary[[length(all_sheets_summary) + 1]] <- tibble(
FileName = file_name, SheetName = "(No Sheets)", Dimensions = "0 x 0",
HeaderPreview = "(Empty)", IsMacro = struct$Macros,
HiddenSheets = struct$Hidden, Format = struct$Format, Status = "Empty"
)
next
}
# 2. Content Scan (Sheet by Sheet)
for (sheet in sheet_names) {
tryCatch({
# Read minimal data for dimensions
# Suppress messages about 'new names'
full_data <- suppressMessages(read_excel(file_path, sheet = sheet, .name_repair = "minimal"))
dims <- paste(dim(full_data), collapse = " x ")
# Read header for preview
preview <- suppressMessages(read_excel(file_path, sheet = sheet, n_max = 5, .name_repair = "minimal"))
col_names <- colnames(preview)
# Heuristic: Is it non-tabular? (Default columns ...1, ...2 or single column)
is_messy <- any(grepl("^\\.\\.\\.\\d+$", col_names)) || length(col_names) <= 1
header_str <- paste(head(col_names, 5), collapse = " | ")
if (length(col_names) > 5) header_str <- paste(header_str, "...")
all_sheets_summary[[length(all_sheets_summary) + 1]] <- tibble(
FileName = file_name,
SheetName = sheet,
Dimensions = dims,
HeaderPreview = header_str,
IsMacro = struct$Macros,
HiddenSheets = struct$Hidden,
Format = struct$Format,
LikelyNonTabular = is_messy,
Status = "Success"
)
}, error = function(e) {
all_sheets_summary[[length(all_sheets_summary) + 1]] <- tibble(
FileName = file_name, SheetName = sheet, Dimensions = "Error",
HeaderPreview = e$message, IsMacro = struct$Macros,
HiddenSheets = struct$Hidden, Format = struct$Format,
LikelyNonTabular = TRUE, Status = "Read Failed"
)
})
}
}, error = function(e) {
all_sheets_summary[[length(all_sheets_summary) + 1]] <- tibble(
FileName = file_name, SheetName = "Error", Dimensions = "Error",
HeaderPreview = e$message, IsMacro = struct$Macros,
HiddenSheets = struct$Hidden, Format = struct$Format,
LikelyNonTabular = TRUE, Status = "File Failed"
)
})
}
summary_report <- bind_rows(all_sheets_summary)
# Display preview
print("--- Excel Structure Report ---")[1] "--- Excel Structure Report ---"
head(summary_report)# A tibble: 6 × 9
FileName SheetName Dimensions HeaderPreview IsMacro HiddenSheets Format
<chr> <chr> <chr> <chr> <lgl> <chr> <chr>
1 41467_2025_650… Fig2a-FA 48 x 5 White matter… FALSE Unknown (Le… XML (…
2 41467_2025_650… Fig2b-FA 11042 x 3 Count_Greate… FALSE Unknown (Le… XML (…
3 41467_2025_650… Fig2a-GM 139 x 5 Grey matter … FALSE Unknown (Le… XML (…
4 41467_2025_650… Fig2b-GMV 11042 x 3 Count_Greate… FALSE Unknown (Le… XML (…
5 41467_2025_650… Fig3b 82 x 8 Category | G… FALSE Unknown (Le… XML (…
6 41467_2025_650… Fig3c 11 x 10 Category | I… FALSE Unknown (Le… XML (…
# ℹ 2 more variables: LikelyNonTabular <lgl>, Status <chr>
output_dir <- "Results/Inspect_xlsx"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
output_file <- file.path(output_dir, paste0("Excel_Structure_", format(Sys.Date(), "%Y%m%d"), ".csv"))
write.csv(summary_report, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))[1] "Report saved to: Results/Inspect_xlsx/Excel_Structure_20260515.csv"
Use the generated report to identify high-risk files:
Macros (IsMacro = TRUE): Macros are a risk for security and prone to obsolescence. .xlsm files contain code that may not run in the future or may be blocked by security software. Curators can ask depositors to remove macros and save as .xlsx unless the code is essential (in which case, the code can be archived as text).
Hidden Sheets: Hidden data often contains PII, messy scratchpads, or deprecated versions of the data. Curators can inspect these sheets manually, and if they are not needed, delete them to clean the dataset.
Likely Non-Tabular (LikelyNonTabular = TRUE): This implies that thw sheet is likely a report, chart, or form, not a raw data table. These sheets generally cannot be converted to CSV automatically and require manual review. To keep tabular data tidy, graphs and images must be removed and saved apart.
LibreOffice Calc: Excellent open-source tool for batch converting legacy .xls files to modern open formats (.ods or .xlsx) via command line.
OpenRefine: is a tool for cleaning messy tabular data (e.g., untangling merged cells or fixing inconsistent headers) before archiving.
VBA Remover: These are a set of different tools to strip VBA from Excel files to sanitize them for preservation. Several are available on-line.
For users who want to run this analysis on a server or from the command line, here is a pure R script that performs the same process.
This “hybrid” script will: 1. If run interactively (e.g., in RStudio), pop up a dialog to select a directory. 2. If run non-interactively (e.g., on an HPC cluster), accept a directory path as a command-line argument. 3. Process all .xls and .xlsx files found in that directory.
Download the R Script: Inspect_xlsx_Script.R
submit_excel_check.sh)#!/bin/bash
#SBATCH --job-name=excel_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --output=logs/excel_check_%j.log
module load R
# Define target directory
TARGET_DIR="/scratch/user/project_data/financials"
# Prepare folders
mkdir -p Results/Inspect_xlsx
mkdir -p logs
# Run
echo "Starting Excel Inspection on $TARGET_DIR"
Rscript Inspect_xlsx_Script.R "$TARGET_DIR"