11  SPSS Data (.sav) Files

Author

Daniel Manrique-Castano

Published

December 18, 2025

11.1 Overview

Unlike generic formats like CSV, SPSS files (.sav) utilize a “Labelled Data” paradigm where the codebook is embedded directly within the file (Wickham et al. 2023).

NoteCuration Goal

Extract and preserve the embedded codebook. Our objective is to inventory variable labels, value mappings, and missing value definitions to ensure the data’s semantic meaning is not lost during long-term preservation.

WarningPreservation Risk

The accidental loss of “Missing Value” definitions during conversion is a critical risk. If placeholder codes (e.g., -99, 999) are not explicitly identified, they may be mistaken for valid numeric measurements, leading to biased statistical results.

This notebook performs a rigorous metadata extraction:

  1. Dictionary Extraction: Inventorying human-readable labels and value mappings.
  2. Missing Value Inventory: Explicitly listing codes that should be treated as Nulls.
  3. Encoding Scan: Checking for character corruption (Mojibake) in older datasets.

11.2 Setup

We use the haven package to read SPSS files. It is specifically designed to handle the “labelled” vectors that are unique to statistical software like SPSS, SAS, and Stata.If you do not have the required packages, run this command once in your R console:

11.2.1 R Packages

Code
# install.packages(c("tidyverse", "haven", "rstudioapi", "stringr"))

11.2.2 Load Libraries

Code
library(tidyverse)
library(haven)      # Primary reader 
library(rstudioapi) # For directory selection
library(stringr)    # For encoding checks

11.3 Select Directory

Select the folder containing the SPSS files you wish to inspect.

Note: If running interactively (RStudio), a dialog box will appear. If running via command line or rendering, it defaults to the target_dir parameter defined at the top of this file.

Code
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select SPSS Directory")
} else { selected_dir <- NULL }

target_dir <- if (!is.null(selected_dir)) selected_dir else params$target_dir
print(paste("Analyzing:", target_dir))
[1] "Analyzing: data/Inspect_sav/"

11.4 File inventory

We scan the directory for files ending in .sav.

Code
spss_files <- list.files(
  path = target_dir,
  pattern = "\\.sav$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(spss_files), "SPSS files."))
[1] "Found 2 SPSS files."
Code
head(spss_files)
[1] "data/Inspect_sav//GovernanceSurvey_Data.sav"
[2] "data/Inspect_sav//GovernanceSurvey.sav"     

11.5 Generate Data Dictionary

Here, the code loops through every file found and performs the following:

  • Reads the Header: It reads only the first 100 rows (n_max = 100). This makes the process extremely fast, even for multi-gigabyte files, because we only need the metadata headers, not the full dataset.

  • Extracts Metadata: It uses the attr() function to pull out the hidden “label” and “labels” attributes attached to each column.

  • Formats Output: It creates a tidy table listing every variable, its type, and its definitions.

Code
message("Generating Data Dictionary...")

# Helper: Check for non-ASCII characters (Encoding Risk)
check_encoding <- function(col) {
  # Check if variable is character and has labels
  if (is.character(col) || is.factor(col)) {
    # Coerce to character for check
    txt <- as.character(col)
    has_special <- any(str_detect(txt, "[^\\x00-\\x7F]"), na.rm = TRUE)
    return(if (has_special) "Contains Special/Non-ASCII" else "ASCII")
  }
  return("N/A")
}

# Helper: Extract User-Defined Missing Values
# haven stores these in the 'na_values' or 'na_range' attribute
get_na_values <- function(col) {
  na_vals <- attr(col, "na_values", exact = TRUE)
  na_range <- attr(col, "na_range", exact = TRUE)
  
  res <- ""
  if (!is.null(na_vals)) res <- paste(na_vals, collapse = ", ")
  if (!is.null(na_range)) {
    range_str <- paste0(na_range[1], " to ", na_range[2])
    res <- if(res == "") range_str else paste(res, range_str, sep = "; ")
  }
  return(if (res == "") "None" else res)
}

report <- purrr::map_dfr(spss_files, function(file_path) {
  
  fname <- basename(file_path)
  
  tryCatch({
    # Read header only (first 100 rows) for speed
    data <- read_sav(file_path, n_max = 100)
    
    purrr::map_dfr(names(data), function(var) {
      col <- data[[var]]
      
      # 1. Variable Label
      lbl <- attr(col, "label", exact = TRUE)
      if (is.null(lbl)) lbl <- "(No Label)"
      
      # 2. Value Labels
      val_lbls <- attr(col, "labels", exact = TRUE)
      val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else ""
      
      # 3. Measurement Level
      # 'display_width' or class often indicates scale vs nominal
      dtype <- class(col)[1]
      
      # 4. Create Row
      tibble(
        FileName = fname,
        VariableName = var,
        VariableLabel = lbl,
        DataType = dtype,
        ValueLabels = substr(val_str, 1, 100),
        MissingValues = get_na_values(col),
        EncodingCheck = check_encoding(col),
        Status = "Success"
      )
    })
    
  }, error = function(e) {
    tibble(
      FileName = fname,
      VariableName = "ERROR",
      VariableLabel = e$message,
      DataType = "Error",
      ValueLabels = "",
      MissingValues = "",
      EncodingCheck = "",
      Status = "Failed"
    )
  })
})

# Display preview
print("--- SPSS Data Dictionary Preview ---")
[1] "--- SPSS Data Dictionary Preview ---"
Code
head(report)
# A tibble: 6 × 8
  FileName         VariableName VariableLabel DataType ValueLabels MissingValues
  <chr>            <chr>        <chr>         <chr>    <chr>       <chr>        
1 GovernanceSurve… ID           "(No Label)"  numeric  ""          None         
2 GovernanceSurve… RecordedDate "Recorded Da… POSIXct  ""          None         
3 GovernanceSurve… Q1           "Representat… haven_l… "1=Yes"     None         
4 GovernanceSurve… Q2           "How old you… numeric  ""          None         
5 GovernanceSurve… Q3           "If a federa… haven_l… "1=The Con… None         
6 GovernanceSurve… Q4           "Is there a … haven_l… "1=The Con… None         
# ℹ 2 more variables: EncodingCheck <chr>, Status <chr>

11.6 Save the results

We save the resulting table to a CSV file. This CSV acts as the Data Dictionary for the curated deposit.

Code
#| label: save-results

output_dir <- "Results/Inspect_sav"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("SPSS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Data Dictionary saved to:", output_file))
[1] "Data Dictionary saved to: Results/Inspect_sav/SPSS_Dictionary_20260515.csv"

11.7 Curation Insights

Use the generated CSV to perform these checks:

  • User-Defined Missing Values (MissingValues): The curator may look for numbers like -9, -99, or 999. Ensure these are documented in the study’s metadata. If you convert this file to CSV, these numbers will become regular integers (e.g., Age = 999), invalidating any downstream analysis (see IBM SPSS statistics 28 brief guide).

  • Missing Value Labels: Look for variables that are DataType = haven_labelled or numeric but have empty ValueLabels.

  • Encoding Issues (EncodingCheck): “Contains Special/Non-ASCII” often indicates files created in older localized Windows versions (e.g., French or Spanish SPSS) using the legacy Windows-1252 encoding. If the labels look corrupted (e.g., Encuestó instead of Encuestó), the data requires character set remediation before long-term storage.

11.8 Additional Tools

  • PSPP: I is an open-source alternative to IBM SPSS Statistics. It allows curators to verify .sav files and provides a command-line interface for batch conversion without requiring a proprietary license.

  • StatTransfer: A commercial tool for converting between statistical formats (SAS, SPSS, Stata, R). It is effective at preserving missing value definitions and variable labels during migration.

11.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_sav_Script.R

11.9.1 Example HPC Submission Script (Inspect_sav_submit.sh)

#!/bin/bash
#SBATCH --job-name=spss_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=8G
#SBATCH --output=logs/spss_check_%j.log

# 1. Load R Module
# (Adjust version based on your cluster, e.g., 'module load R/4.3.0')
module load R

# 2. Define Target Directory
# Replace with the actual path to your SPSS data
DATA_DIR="/scratch/user/project_data/surveys"

# 3. Prepare Environment
# Ensure output directories exist
mkdir -p Results/Inspect_sav
mkdir -p logs

# 4. Run Analysis
echo "Starting SPSS Inspection on $DATA_DIR"
Rscript Inspect_sav_Script.R "$DATA_DIR"

11.10 References