Code
# install.packages(c("tidyverse", "xml2", "rstudioapi"))XML (Extensible Markup Language) is a standard for storing structured data. In curation, we distinguish between “well-formedness” and “validity.”
Ensure structural and semantic validity. Our objective is to verify that XML files are syntactically well-formed and conform to specific schemas (XSD), ensuring metadata adheres to community standards like DDI, TEI, or Dublin Core.
Broken nesting structures, missing namespaces, and “XML Bombs” (recursive expansion risks) can crash archival parsers and lead to the permanent loss of structural context.
This notebook performs a rigorous structure analysis:
.xsd files.We use the xml2 package, which is a wrapper around the comprehensive libxml2 C library.
If you do not have the required packages, run this command once in your R console:
# install.packages(c("tidyverse", "xml2", "rstudioapi"))library(tidyverse)
library(xml2) # Parsing and Validation
library(rstudioapi) # Directory selectionSelect the folder containing the XML files.
Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") {
selected_dir <- rstudioapi::selectDirectory(caption = "Select XML Directory")
} else {
selected_dir <- NULL
}
# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
target_dir <- selected_dir
} else {
target_dir <- params$target_dir
}
print(paste("Analyzing directory:", target_dir))[1] "Analyzing directory: data/Inspect_xml/"
We scan for XML files and any accompanying XSD schemas.
# Find XML files
xml_files <- list.files(
path = target_dir,
pattern = "\\.xml$",
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
# Find XSD Schemas
xsd_files <- list.files(
path = target_dir,
pattern = "\\.xsd$",
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
print(paste("Found", length(xml_files), "XML files."))[1] "Found 2 XML files."
print(paste("Found", length(xsd_files), "XSD Schema files."))[1] "Found 0 XSD Schema files."
# Use the first XSD found for validation (if any)
active_schema <- if (length(xsd_files) > 0) read_xml(xsd_files[1]) else NULL
if (!is.null(active_schema)) message(paste("Using Schema:", basename(xsd_files[1])))This code iterates through the files and extracts their internal structure.
message("Generating XML Report...")
# Helper: Calculate Max Depth (Recursive)
get_max_depth <- function(node) {
children <- xml_children(node)
if (length(children) == 0) return(1)
return(1 + max(sapply(children, get_max_depth)))
}
report <- purrr::map_dfr(xml_files, function(file_path) {
fname <- basename(file_path)
tryCatch({
# 1. Parse (Well-Formedness Check)
doc <- read_xml(file_path)
# 2. Validation (if Schema exists)
validity <- "Not Checked (No XSD)"
if (!is.null(active_schema)) {
is_valid <- xml_validate(doc, active_schema)
validity <- if (is_valid) "Valid" else "Invalid (Schema Violation)"
}
# 3. Structure Analysis
root <- xml_root(doc)
root_name <- xml_name(root)
num_children <- length(xml_children(root))
# Namespaces
ns <- xml_ns(doc)
ns_str <- if (length(ns) > 0) paste(names(ns), collapse = ", ") else "None"
# Complexity (Depth)
# Note: For huge files, depth calc can be slow. We limit it implicitly by recursion.
max_depth <- tryCatch(get_max_depth(root), error = function(e) "Error")
tibble(
FileName = fname,
Status = "Well-Formed",
SchemaValidation = validity,
RootNode = root_name,
Namespaces = ns_str,
MaxDepth = as.character(max_depth),
DirectChildren = num_children
)
}, error = function(e) {
tibble(
FileName = fname,
Status = "Parsing Failed (Not Well-Formed)",
SchemaValidation = "Failed",
RootNode = "Error",
Namespaces = "",
MaxDepth = NA,
DirectChildren = NA
)
})
})
# Display preview
print("--- XML Structure Report ---")[1] "--- XML Structure Report ---"
head(report)# A tibble: 2 × 7
FileName Status SchemaValidation RootNode Namespaces MaxDepth DirectChildren
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 LaurentEt… Well-… Not Checked (No… PAMData… None 6 31
2 LaurentEt… Well-… Not Checked (No… PAMData… None 7 5
Save the dictionary to a CSV file for review.
output_dir <- "Results/Inspect_xml"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
output_file <- file.path(output_dir, paste0("XML_Structure_", format(Sys.Date(), "%Y%m%d"), ".csv"))
write.csv(report, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))[1] "Report saved to: Results/Inspect_xml/XML_Structure_20260515.csv"
Use the generated CSV to guide your preservation actions:
Status (“Parsing Failed”): If the parsing fails, it may implicate the files are corrupted or contain syntax errors (e.g., unclosed tags). Curators may need to discuss with depositors the validity of these files.
Namespaces (Namespaces): It is avisable to look for standard prefixes like dc (Dublin Core) or mods. If a specific standard is detected, verify if the file strictly adheres to it (Schema Validation) .
Complexity (MaxDepth > 50): Deeply nested XMLs can be difficult to process with standard XSLT tools. Curators can investigate these files manually.
Oxygen XML Editor: It is a standard IDE for XML development. It provides visual schema validation and XSLT debugging.
xmllint: It is a a command-line tool, also part of libxml2, for checking syntax and validating against schemas in batch.
For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.
Download the R Script: Inspect_xml_Script.R
Inspect_xml_submit.sh)#!/bin/bash
#SBATCH --job-name=xml_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:15:00
#SBATCH --mem=4G
#SBATCH --output=logs/xml_check_%j.log
module load R
# Define target directory
TARGET_DIR="/scratch/user/project_data/metadata"
# Prepare folders
mkdir -p Results/Inspect_xml
mkdir -p logs
# Run
echo "Starting XML Inspection on $TARGET_DIR"
Rscript Inspect_xml_Script.R "$TARGET_DIR"