Code
# install.packages(c("tidyverse", "haven", "rstudioapi", "stringr"))Unlike generic formats like CSV, SPSS files (.sav) utilize a “Labelled Data” paradigm where the codebook is embedded directly within the file (Wickham et al. 2023).
Extract and preserve the embedded codebook. Our objective is to inventory variable labels, value mappings, and missing value definitions to ensure the data’s semantic meaning is not lost during long-term preservation.
The accidental loss of “Missing Value” definitions during conversion is a critical risk. If placeholder codes (e.g., -99, 999) are not explicitly identified, they may be mistaken for valid numeric measurements, leading to biased statistical results.
This notebook performs a rigorous metadata extraction:
We use the haven package to read SPSS files. It is specifically designed to handle the “labelled” vectors that are unique to statistical software like SPSS, SAS, and Stata.If you do not have the required packages, run this command once in your R console:
# install.packages(c("tidyverse", "haven", "rstudioapi", "stringr"))library(tidyverse)
library(haven) # Primary reader
library(rstudioapi) # For directory selection
library(stringr) # For encoding checksSelect the folder containing the SPSS files you wish to inspect.
Note: If running interactively (RStudio), a dialog box will appear. If running via command line or rendering, it defaults to the target_dir parameter defined at the top of this file.
if (interactive() && .Platform$OS.type == "windows") {
selected_dir <- rstudioapi::selectDirectory(caption = "Select SPSS Directory")
} else { selected_dir <- NULL }
target_dir <- if (!is.null(selected_dir)) selected_dir else params$target_dir
print(paste("Analyzing:", target_dir))[1] "Analyzing: data/Inspect_sav/"
We scan the directory for files ending in .sav.
spss_files <- list.files(
path = target_dir,
pattern = "\\.sav$",
recursive = TRUE,
full.names = TRUE,
ignore.case = TRUE
)
print(paste("Found", length(spss_files), "SPSS files."))[1] "Found 2 SPSS files."
head(spss_files)[1] "data/Inspect_sav//GovernanceSurvey_Data.sav"
[2] "data/Inspect_sav//GovernanceSurvey.sav"
Here, the code loops through every file found and performs the following:
Reads the Header: It reads only the first 100 rows (n_max = 100). This makes the process extremely fast, even for multi-gigabyte files, because we only need the metadata headers, not the full dataset.
Extracts Metadata: It uses the attr() function to pull out the hidden “label” and “labels” attributes attached to each column.
Formats Output: It creates a tidy table listing every variable, its type, and its definitions.
message("Generating Data Dictionary...")
# Helper: Check for non-ASCII characters (Encoding Risk)
check_encoding <- function(col) {
# Check if variable is character and has labels
if (is.character(col) || is.factor(col)) {
# Coerce to character for check
txt <- as.character(col)
has_special <- any(str_detect(txt, "[^\\x00-\\x7F]"), na.rm = TRUE)
return(if (has_special) "Contains Special/Non-ASCII" else "ASCII")
}
return("N/A")
}
# Helper: Extract User-Defined Missing Values
# haven stores these in the 'na_values' or 'na_range' attribute
get_na_values <- function(col) {
na_vals <- attr(col, "na_values", exact = TRUE)
na_range <- attr(col, "na_range", exact = TRUE)
res <- ""
if (!is.null(na_vals)) res <- paste(na_vals, collapse = ", ")
if (!is.null(na_range)) {
range_str <- paste0(na_range[1], " to ", na_range[2])
res <- if(res == "") range_str else paste(res, range_str, sep = "; ")
}
return(if (res == "") "None" else res)
}
report <- purrr::map_dfr(spss_files, function(file_path) {
fname <- basename(file_path)
tryCatch({
# Read header only (first 100 rows) for speed
data <- read_sav(file_path, n_max = 100)
purrr::map_dfr(names(data), function(var) {
col <- data[[var]]
# 1. Variable Label
lbl <- attr(col, "label", exact = TRUE)
if (is.null(lbl)) lbl <- "(No Label)"
# 2. Value Labels
val_lbls <- attr(col, "labels", exact = TRUE)
val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else ""
# 3. Measurement Level
# 'display_width' or class often indicates scale vs nominal
dtype <- class(col)[1]
# 4. Create Row
tibble(
FileName = fname,
VariableName = var,
VariableLabel = lbl,
DataType = dtype,
ValueLabels = substr(val_str, 1, 100),
MissingValues = get_na_values(col),
EncodingCheck = check_encoding(col),
Status = "Success"
)
})
}, error = function(e) {
tibble(
FileName = fname,
VariableName = "ERROR",
VariableLabel = e$message,
DataType = "Error",
ValueLabels = "",
MissingValues = "",
EncodingCheck = "",
Status = "Failed"
)
})
})
# Display preview
print("--- SPSS Data Dictionary Preview ---")[1] "--- SPSS Data Dictionary Preview ---"
head(report)# A tibble: 6 × 8
FileName VariableName VariableLabel DataType ValueLabels MissingValues
<chr> <chr> <chr> <chr> <chr> <chr>
1 GovernanceSurve… ID "(No Label)" numeric "" None
2 GovernanceSurve… RecordedDate "Recorded Da… POSIXct "" None
3 GovernanceSurve… Q1 "Representat… haven_l… "1=Yes" None
4 GovernanceSurve… Q2 "How old you… numeric "" None
5 GovernanceSurve… Q3 "If a federa… haven_l… "1=The Con… None
6 GovernanceSurve… Q4 "Is there a … haven_l… "1=The Con… None
# ℹ 2 more variables: EncodingCheck <chr>, Status <chr>
We save the resulting table to a CSV file. This CSV acts as the Data Dictionary for the curated deposit.
#| label: save-results
output_dir <- "Results/Inspect_sav"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
output_file <- file.path(output_dir, paste0("SPSS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv"))
write.csv(report, output_file, row.names = FALSE)
print(paste("Data Dictionary saved to:", output_file))[1] "Data Dictionary saved to: Results/Inspect_sav/SPSS_Dictionary_20260515.csv"
Use the generated CSV to perform these checks:
User-Defined Missing Values (MissingValues): The curator may look for numbers like -9, -99, or 999. Ensure these are documented in the study’s metadata. If you convert this file to CSV, these numbers will become regular integers (e.g., Age = 999), invalidating any downstream analysis (see IBM SPSS statistics 28 brief guide).
Missing Value Labels: Look for variables that are DataType = haven_labelled or numeric but have empty ValueLabels.
Encoding Issues (EncodingCheck): “Contains Special/Non-ASCII” often indicates files created in older localized Windows versions (e.g., French or Spanish SPSS) using the legacy Windows-1252 encoding. If the labels look corrupted (e.g., Encuestó instead of Encuestó), the data requires character set remediation before long-term storage.
PSPP: I is an open-source alternative to IBM SPSS Statistics. It allows curators to verify .sav files and provides a command-line interface for batch conversion without requiring a proprietary license.
StatTransfer: A commercial tool for converting between statistical formats (SAS, SPSS, Stata, R). It is effective at preserving missing value definitions and variable labels during migration.
For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.
Download the R Script: Inspect_sav_Script.R
Inspect_sav_submit.sh)#!/bin/bash
#SBATCH --job-name=spss_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=8G
#SBATCH --output=logs/spss_check_%j.log
# 1. Load R Module
# (Adjust version based on your cluster, e.g., 'module load R/4.3.0')
module load R
# 2. Define Target Directory
# Replace with the actual path to your SPSS data
DATA_DIR="/scratch/user/project_data/surveys"
# 3. Prepare Environment
# Ensure output directories exist
mkdir -p Results/Inspect_sav
mkdir -p logs
# 4. Run Analysis
echo "Starting SPSS Inspection on $DATA_DIR"
Rscript Inspect_sav_Script.R "$DATA_DIR"