Code
# install.packages(c("tidyverse", "pdftools", "rstudioapi"))PDF is considered the standard for fixed-layout documents. However, not all PDFs are created in the same manner.
Validate the archival suitability of PDF documents. Our objective is to distinguish between high-quality PDF/A files and “Image-Only” scans, ensuring content is searchable, accessible, and free from encryption that would lock data away.
Not all PDFs are equal. Encrypted files, image-only scans without OCR, and documents with external font dependencies pose significant risks for long-term accessibility and automated text mining.
This notebook evaluates PDF files on three critical dimensions:
We use pdftools for PDF analysis and tidyverse for reporting.
# install.packages(c("tidyverse", "pdftools", "rstudioapi"))library(tidyverse)
library(pdftools)
library(rstudioapi)if (interactive() && .Platform$OS.type == "windows") {
selected_dir <- rstudioapi::selectDirectory(caption = "Select PDF Directory")
} else {
selected_dir <- NULL
}
if (!is.null(selected_dir)) {
target_dir <- selected_dir
} else {
target_dir <- params$target_dir
}
print(paste("Analyzing directory:", target_dir))[1] "Analyzing directory: data/Inspect_pdf/"
pdf_files <- list.files(
path = target_dir,
pattern = "\\.pdf$",
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
print(paste("Found", length(pdf_files), "PDF files."))[1] "Found 6 PDF files."
We iterate through the files to extract metadata, check for encryption, and sample the text content to determine its accessibility status.
message("Generating PDF Report...")
results_list <- lapply(pdf_files, function(file_path) {
fname <- basename(file_path)
tryCatch({
# 1. Metadata & Security
info <- pdftools::pdf_info(file_path)
# NEW: Capture Encryption Status explicitly
is_encrypted <- isTRUE(info$encrypted)
# 2. Text Extraction
# If encrypted, text extraction might fail, so we wrap it or handle it
first_page_text <- tryCatch(pdftools::pdf_text(file_path)[1], error = function(e) "")
# Handle NA/NULL text
if(is.na(first_page_text) || is.null(first_page_text)) first_page_text <- ""
char_count <- nchar(trimws(first_page_text))
has_text <- char_count > 10
# 3. Fonts
fonts <- tryCatch(pdftools::pdf_fonts(file_path), error = function(e) NULL)
font_names <- if (!is.null(fonts)) paste(head(unique(fonts$name), 5), collapse = ", ") else "Unknown"
# Helper for metadata
get_meta <- function(x) {
if (is.null(x) || length(x) == 0 || x == "") return("Unknown")
return(as.character(x))
}
# Return valid tibble with explicit types
tibble(
FileName = fname,
Pages = as.integer(info$pages),
Encrypted = as.logical(is_encrypted), # <--- NEW COLUMN
Author = get_meta(info$keys$Author),
Title = get_meta(info$keys$Title),
Creator = get_meta(info$keys$Creator),
Created = as.character(info$created),
HasText = as.logical(has_text),
FirstPageChars = as.integer(char_count),
Fonts = as.character(font_names),
Status = "Success"
)
}, error = function(e) {
tibble(
FileName = fname,
Pages = as.integer(NA),
Encrypted = as.logical(NA),
Author = as.character(NA),
Title = as.character(NA),
Creator = as.character(NA),
Created = as.character(NA),
HasText = as.logical(NA),
FirstPageChars = as.integer(NA),
Fonts = as.character(NA),
Status = paste("Failed:", e$message)
)
})
})
report <- bind_rows(results_list)
# Display preview
print("--- PDF Report Preview ---")[1] "--- PDF Report Preview ---"
print(head(report))# A tibble: 6 × 11
FileName Pages Encrypted Author Title Creator Created HasText FirstPageChars
<chr> <int> <lgl> <chr> <chr> <chr> <chr> <lgl> <int>
1 20230724_… 7 FALSE Unkno… Unkn… Unknown 2024-0… TRUE 57
2 example_1… 1 FALSE Unkno… Unkn… Matplo… 2024-0… FALSE 0
3 Oil_capac… 4 FALSE Unkno… Oil_… TextEd… 2025-0… TRUE 2883
4 Oil_Trans… 1 FALSE Unkno… Oil_… TextEd… 2025-0… TRUE 1982
5 Past_Spil… 2 FALSE Unkno… Past… TextEd… 2025-0… TRUE 2819
6 Spill_eve… 1 FALSE Unkno… Spil… TextEd… 2025-0… TRUE 814
# ℹ 2 more variables: Fonts <chr>, Status <chr>
output_dir <- file.path("Results", "Inspect_pdf")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
output_file <- file.path(output_dir, paste0("PDF_Report_", Sys.Date(), ".csv"))
write.csv(report, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))[1] "Report saved to: Results/Inspect_pdf/PDF_Report_2026-05-15.csv"
Use the generated CSV report to guide your preservation actions:
Encryption (Encrypted = TRUE): Encrypted files cannot be migrated, indexed, or arguably “preserved.” Curators must contact the depositor immediately to remove the password. If the password is lost, the file cannot be reused.
Accessibility (TextStatus = “Image-Only”): These are scanned images. They are readable by humans but invisible to search engines and screen readers (accessibility violation). You can run these files through an OCR tool (like Tesseract or Adobe Acrobat) to generate a text layer.
Font Embedding (Fonts): If Fonts is empty or lists generic names like “Helvetica” without embedding, the layout may break on future systems. the curator can convert the file to PDF/A-1b or PDF/A-2b, which forces font embedding.
VeraPDF: This is the standard open-source validator for PDF/A compliance. It checks if fonts are embedded and if multimedia is excluded (see https://verapdf.org).
Tesseract OCR: Is an open-source engine for optical character recognition . It can take an “Image-Only” PDF and output a searchable text file or a new PDF with a text layer.
Google AI: Extractd text and data from images and documents using AI capabilities. See a use case in another chapter of this book.
Ghostscript: This is a command-line interpreter often used to repair corrupted PDFs or convert standard PDFs into PDF/A format.
For users who want to run this analysis on a server, in a batch job, or from the command line, here is a pure R script that performs the same process.
Download the R Script: Inspect_PDF_Script.R
#!/bin/bash
#SBATCH --job-name=pdf_inspect
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --output=logs/pdf_inspect_%j.log
module load R
# Define target directory
TARGET_DIR="/scratch/user/project_data/documents"
# Prepare output folders
mkdir -p Results/Inspect_pdf
mkdir -p logs
# Run
echo "Starting PDF Inspection on $TARGET_DIR"
Rscript Inspect_PDF_Script.R "$TARGET_DIR"