Author

Daniel Manrique-Castano

Published

December 18, 2025

16.1 Overview

XML (Extensible Markup Language) is a standard for storing structured data. In curation, we distinguish between “well-formedness” and “validity.”

NoteCuration Goal

Ensure structural and semantic validity. Our objective is to verify that XML files are syntactically well-formed and conform to specific schemas (XSD), ensuring metadata adheres to community standards like DDI, TEI, or Dublin Core.

WarningPreservation Risk

Broken nesting structures, missing namespaces, and “XML Bombs” (recursive expansion risks) can crash archival parsers and lead to the permanent loss of structural context.

This notebook performs a rigorous structure analysis:

  1. Parse Check: Verifying well-formedness.
  2. Schema Validation: Comparing XMLs against accompanying .xsd files.
  3. Complexity Scan: Calculating Max Nesting Depth.
  4. Namespace Inventory: Extracting declared standards and vocabularies.

16.2 Setup

We use the xml2 package, which is a wrapper around the comprehensive libxml2 C library.

16.2.1 R Packages

If you do not have the required packages, run this command once in your R console:

Code
# install.packages(c("tidyverse", "xml2", "rstudioapi"))

16.2.2 Load libraries

Code
library(tidyverse)
library(xml2)       # Parsing and Validation
library(rstudioapi) # Directory selection

16.3 Select Target Directory

Select the folder containing the XML files.

Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.

Code
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select XML Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: data/Inspect_xml/"

16.4 Inventory of files and schemas

We scan for XML files and any accompanying XSD schemas.

Code
# Find XML files
xml_files <- list.files(
  path = target_dir,
  pattern = "\\.xml$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

# Find XSD Schemas
xsd_files <- list.files(
  path = target_dir,
  pattern = "\\.xsd$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(xml_files), "XML files."))
[1] "Found 2 XML files."
Code
print(paste("Found", length(xsd_files), "XSD Schema files."))
[1] "Found 0 XSD Schema files."
Code
# Use the first XSD found for validation (if any)
active_schema <- if (length(xsd_files) > 0) read_xml(xsd_files[1]) else NULL
if (!is.null(active_schema)) message(paste("Using Schema:", basename(xsd_files[1])))

16.5 Generate Data Dictionary

This code iterates through the files and extracts their internal structure.

Code
message("Generating XML Report...")

# Helper: Calculate Max Depth (Recursive)
get_max_depth <- function(node) {
  children <- xml_children(node)
  if (length(children) == 0) return(1)
  return(1 + max(sapply(children, get_max_depth)))
}

report <- purrr::map_dfr(xml_files, function(file_path) {
  
  fname <- basename(file_path)
  
  tryCatch({
    # 1. Parse (Well-Formedness Check)
    doc <- read_xml(file_path)
    
    # 2. Validation (if Schema exists)
    validity <- "Not Checked (No XSD)"
    if (!is.null(active_schema)) {
      is_valid <- xml_validate(doc, active_schema)
      validity <- if (is_valid) "Valid" else "Invalid (Schema Violation)"
    }
    
    # 3. Structure Analysis
    root <- xml_root(doc)
    root_name <- xml_name(root)
    num_children <- length(xml_children(root))
    
    # Namespaces
    ns <- xml_ns(doc)
    ns_str <- if (length(ns) > 0) paste(names(ns), collapse = ", ") else "None"
    
    # Complexity (Depth)
    # Note: For huge files, depth calc can be slow. We limit it implicitly by recursion.
    max_depth <- tryCatch(get_max_depth(root), error = function(e) "Error")
    
    tibble(
      FileName = fname,
      Status = "Well-Formed",
      SchemaValidation = validity,
      RootNode = root_name,
      Namespaces = ns_str,
      MaxDepth = as.character(max_depth),
      DirectChildren = num_children
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname,
      Status = "Parsing Failed (Not Well-Formed)",
      SchemaValidation = "Failed",
      RootNode = "Error",
      Namespaces = "",
      MaxDepth = NA,
      DirectChildren = NA
    )
  })
})

# Display preview
print("--- XML Structure Report ---")
[1] "--- XML Structure Report ---"
Code
head(report)
# A tibble: 2 × 7
  FileName   Status SchemaValidation RootNode Namespaces MaxDepth DirectChildren
  <chr>      <chr>  <chr>            <chr>    <chr>      <chr>             <int>
1 LaurentEt… Well-… Not Checked (No… PAMData… None       6                    31
2 LaurentEt… Well-… Not Checked (No… PAMData… None       7                     5

16.6 Save Results

Save the dictionary to a CSV file for review.

Code
output_dir <- "Results/Inspect_xml"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("XML_Structure_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))
[1] "Report saved to: Results/Inspect_xml/XML_Structure_20260515.csv"

16.7 Curation Insights

Use the generated CSV to guide your preservation actions:

  • Status (“Parsing Failed”): If the parsing fails, it may implicate the files are corrupted or contain syntax errors (e.g., unclosed tags). Curators may need to discuss with depositors the validity of these files.

  • Namespaces (Namespaces): It is avisable to look for standard prefixes like dc (Dublin Core) or mods. If a specific standard is detected, verify if the file strictly adheres to it (Schema Validation) .

  • Complexity (MaxDepth > 50): Deeply nested XMLs can be difficult to process with standard XSLT tools. Curators can investigate these files manually.

16.8 Additional Tools

  • Oxygen XML Editor: It is a standard IDE for XML development. It provides visual schema validation and XSLT debugging.

  • xmllint: It is a a command-line tool, also part of libxml2, for checking syntax and validating against schemas in batch.

16.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_xml_Script.R

16.9.1 Example HPC Submission Script (Inspect_xml_submit.sh)

#!/bin/bash
#SBATCH --job-name=xml_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:15:00
#SBATCH --mem=4G
#SBATCH --output=logs/xml_check_%j.log

module load R

# Define target directory
TARGET_DIR="/scratch/user/project_data/metadata"

# Prepare folders
mkdir -p Results/Inspect_xml
mkdir -p logs

# Run
echo "Starting XML Inspection on $TARGET_DIR"
Rscript Inspect_xml_Script.R "$TARGET_DIR"

16.10 References