11 SPSS Data (.sav) Files

Author

Daniel Manrique-Castano

Published

December 18, 2025

11.1 Overview

Unlike generic formats like CSV, SPSS files (.sav) utilize a “Labelled Data” paradigm where the codebook is embedded directly within the file (Wickham et al. 2023).

Curation Goal

Extract and preserve the embedded codebook. Our objective is to inventory variable labels, value mappings, and missing value definitions to ensure the data’s semantic meaning is not lost during long-term preservation.

Preservation Risk

The accidental loss of “Missing Value” definitions during conversion is a critical risk. If placeholder codes (e.g., -99, 999) are not explicitly identified, they may be mistaken for valid numeric measurements, leading to biased statistical results.

This notebook performs a rigorous metadata extraction:

Dictionary Extraction: Inventorying human-readable labels and value mappings.
Missing Value Inventory: Explicitly listing codes that should be treated as Nulls.
Encoding Scan: Checking for character corruption (Mojibake) in older datasets.

11.2 Setup

We use the haven package to read SPSS files. It is specifically designed to handle the “labelled” vectors that are unique to statistical software like SPSS, SAS, and Stata.If you do not have the required packages, run this command once in your R console:

11.2.1 R Packages

Code

# install.packages(c("tidyverse", "haven", "rstudioapi", "stringr"))

11.2.2 Load Libraries

Code

library(tidyverse)
library(haven)      # Primary reader 
library(rstudioapi) # For directory selection
library(stringr)    # For encoding checks

11.3 Select Directory

Select the folder containing the SPSS files you wish to inspect.

Note: If running interactively (RStudio), a dialog box will appear. If running via command line or rendering, it defaults to the target_dir parameter defined at the top of this file.

Code

if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select SPSS Directory")
} else { selected_dir <- NULL }

target_dir <- if (!is.null(selected_dir)) selected_dir else params$target_dir
print(paste("Analyzing:", target_dir))

[1] "Analyzing: data/Inspect_sav/"

11.4 File inventory

We scan the directory for files ending in .sav.

Code

spss_files <- list.files(
  path = target_dir,
  pattern = "\\.sav$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(spss_files), "SPSS files."))

[1] "Found 2 SPSS files."

Code

head(spss_files)

[1] "data/Inspect_sav//GovernanceSurvey_Data.sav"
[2] "data/Inspect_sav//GovernanceSurvey.sav"

11.5 Generate Data Dictionary

Here, the code loops through every file found and performs the following:

Reads the Header: It reads only the first 100 rows (n_max = 100). This makes the process extremely fast, even for multi-gigabyte files, because we only need the metadata headers, not the full dataset.
Extracts Metadata: It uses the attr() function to pull out the hidden “label” and “labels” attributes attached to each column.
Formats Output: It creates a tidy table listing every variable, its type, and its definitions.

Code

message("Generating Data Dictionary...")

# Helper: Check for non-ASCII characters (Encoding Risk)
check_encoding <- function(col) {
  # Check if variable is character and has labels
  if (is.character(col) || is.factor(col)) {
    # Coerce to character for check
    txt <- as.character(col)
    has_special <- any(str_detect(txt, "[^\\x00-\\x7F]"), na.rm = TRUE)
    return(if (has_special) "Contains Special/Non-ASCII" else "ASCII")
  }
  return("N/A")
}

# Helper: Extract User-Defined Missing Values
# haven stores these in the 'na_values' or 'na_range' attribute
get_na_values <- function(col) {
  na_vals <- attr(col, "na_values", exact = TRUE)
  na_range <- attr(col, "na_range", exact = TRUE)
  
  res <- ""
  if (!is.null(na_vals)) res <- paste(na_vals, collapse = ", ")
  if (!is.null(na_range)) {
    range_str <- paste0(na_range[1], " to ", na_range[2])
    res <- if(res == "") range_str else paste(res, range_str, sep = "; ")
  }
  return(if (res == "") "None" else res)
}

report <- purrr::map_dfr(spss_files, function(file_path) {
  
  fname <- basename(file_path)
  
  tryCatch({
    # Read header only (first 100 rows) for speed
    data <- read_sav(file_path, n_max = 100)
    
    purrr::map_dfr(names(data), function(var) {
      col <- data[[var]]
      
      # 1. Variable Label
      lbl <- attr(col, "label", exact = TRUE)
      if (is.null(lbl)) lbl <- "(No Label)"
      
      # 2. Value Labels
      val_lbls <- attr(col, "labels", exact = TRUE)
      val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else ""
      
      # 3. Measurement Level
      # 'display_width' or class often indicates scale vs nominal
      dtype <- class(col)[1]
      
      # 4. Create Row
      tibble(
        FileName = fname,
        VariableName = var,
        VariableLabel = lbl,
        DataType = dtype,
        ValueLabels = substr(val_str, 1, 100),
        MissingValues = get_na_values(col),
        EncodingCheck = check_encoding(col),
        Status = "Success"
      )
    })
    
  }, error = function(e) {
    tibble(
      FileName = fname,
      VariableName = "ERROR",
      VariableLabel = e$message,
      DataType = "Error",
      ValueLabels = "",
      MissingValues = "",
      EncodingCheck = "",
      Status = "Failed"
    )
  })
})

# Display preview
print("--- SPSS Data Dictionary Preview ---")

[1] "--- SPSS Data Dictionary Preview ---"

Code

head(report)

# A tibble: 6 × 8
  FileName         VariableName VariableLabel DataType ValueLabels MissingValues
  <chr>            <chr>        <chr>         <chr>    <chr>       <chr>        
1 GovernanceSurve… ID           "(No Label)"  numeric  ""          None         
2 GovernanceSurve… RecordedDate "Recorded Da… POSIXct  ""          None         
3 GovernanceSurve… Q1           "Representat… haven_l… "1=Yes"     None         
4 GovernanceSurve… Q2           "How old you… numeric  ""          None         
5 GovernanceSurve… Q3           "If a federa… haven_l… "1=The Con… None         
6 GovernanceSurve… Q4           "Is there a … haven_l… "1=The Con… None         
# ℹ 2 more variables: EncodingCheck <chr>, Status <chr>

11.6 Save the results

We save the resulting table to a CSV file. This CSV acts as the Data Dictionary for the curated deposit.

Code

#| label: save-results

output_dir <- "Results/Inspect_sav"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("SPSS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Data Dictionary saved to:", output_file))

[1] "Data Dictionary saved to: Results/Inspect_sav/SPSS_Dictionary_20260515.csv"

11.7 Curation Insights

Use the generated CSV to perform these checks:

User-Defined Missing Values (MissingValues): The curator may look for numbers like -9, -99, or 999. Ensure these are documented in the study’s metadata. If you convert this file to CSV, these numbers will become regular integers (e.g., Age = 999), invalidating any downstream analysis (see IBM SPSS statistics 28 brief guide).
Missing Value Labels: Look for variables that are DataType = haven_labelled or numeric but have empty ValueLabels.
Encoding Issues (EncodingCheck): “Contains Special/Non-ASCII” often indicates files created in older localized Windows versions (e.g., French or Spanish SPSS) using the legacy Windows-1252 encoding. If the labels look corrupted (e.g., EncuestÃ³ instead of Encuestó), the data requires character set remediation before long-term storage.

11.8 Additional Tools

PSPP: I is an open-source alternative to IBM SPSS Statistics. It allows curators to verify .sav files and provides a command-line interface for batch conversion without requiring a proprietary license.
StatTransfer: A commercial tool for converting between statistical formats (SAS, SPSS, Stata, R). It is effective at preserving missing value definitions and variable labels during migration.

11.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_sav_Script.R

11.9.1 Example HPC Submission Script (`Inspect_sav_submit.sh`)

#!/bin/bash
#SBATCH --job-name=spss_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=8G
#SBATCH --output=logs/spss_check_%j.log

# 1. Load R Module
# (Adjust version based on your cluster, e.g., 'module load R/4.3.0')
module load R

# 2. Define Target Directory
# Replace with the actual path to your SPSS data
DATA_DIR="/scratch/user/project_data/surveys"

# 3. Prepare Environment
# Ensure output directories exist
mkdir -p Results/Inspect_sav
mkdir -p logs

# 4. Run Analysis
echo "Starting SPSS Inspection on $DATA_DIR"
Rscript Inspect_sav_Script.R "$DATA_DIR"

11.10 References

--- title: "SPSS Data (.sav) Files" author: "Daniel Manrique-Castano" date: "2025-12-18" format: html: toc: true toc-location: left code-fold: true theme: cosmo bibliography: references.bib params: target_dir: "data/Inspect_sav/" --- ## Overview Unlike generic formats like CSV, SPSS files (`.sav`) utilize a **"Labelled Data"** paradigm where the codebook is embedded directly within the file [@wickham2023]. ::: {.callout-note title="Curation Goal"} Extract and preserve the embedded codebook. Our objective is to inventory variable labels, value mappings, and missing value definitions to ensure the data's semantic meaning is not lost during long-term preservation. ::: ::: {.callout-warning title="Preservation Risk"} The accidental loss of "Missing Value" definitions during conversion is a critical risk. If placeholder codes (e.g., `-99`, `999`) are not explicitly identified, they may be mistaken for valid numeric measurements, leading to biased statistical results. ::: **This notebook performs a rigorous metadata extraction:** 1. **Dictionary Extraction:** Inventorying human-readable labels and value mappings. 2. **Missing Value Inventory:** Explicitly listing codes that should be treated as Nulls. 3. **Encoding Scan:** Checking for character corruption (Mojibake) in older datasets. ## Setup We use the `haven` package to read SPSS files. It is specifically designed to handle the "labelled" vectors that are unique to statistical software like SPSS, SAS, and Stata.If you do not have the required packages, run this command once in your R console: ### R Packages ```{r} # install.packages(c("tidyverse", "haven", "rstudioapi", "stringr")) ``` ### Load Libraries ```{r} #| message: false #| library(tidyverse) library(haven) # Primary reader library(rstudioapi) # For directory selection library(stringr) # For encoding checks ``` ## Select Directory Select the folder containing the SPSS files you wish to inspect. **Note:** If running interactively (RStudio), a dialog box will appear. If running via command line or rendering, it defaults to the target_dir parameter defined at the top of this file. ```{r} #| label: select-target-dir if (interactive() && .Platform$OS.type == "windows") { selected_dir <- rstudioapi::selectDirectory(caption = "Select SPSS Directory") } else { selected_dir <- NULL } target_dir <- if (!is.null(selected_dir)) selected_dir else params$target_dir print(paste("Analyzing:", target_dir)) ``` ## File inventory We scan the directory for files ending in .sav. ```{r} #| label: find-files spss_files <- list.files( path = target_dir, pattern = "\\.sav$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE ) print(paste("Found", length(spss_files), "SPSS files.")) head(spss_files) ``` ## Generate Data Dictionary Here, the code loops through every file found and performs the following: - Reads the Header: It reads only the first 100 rows (n_max = 100). This makes the process extremely fast, even for multi-gigabyte files, because we only need the metadata headers, not the full dataset. - Extracts Metadata: It uses the attr() function to pull out the hidden "label" and "labels" attributes attached to each column. - Formats Output: It creates a tidy table listing every variable, its type, and its definitions. ```{r} #| label: extraction-logic #| warning: false #| message: false message("Generating Data Dictionary...") # Helper: Check for non-ASCII characters (Encoding Risk) check_encoding <- function(col) { # Check if variable is character and has labels if (is.character(col) || is.factor(col)) { # Coerce to character for check txt <- as.character(col) has_special <- any(str_detect(txt, "[^\\x00-\\x7F]"), na.rm = TRUE) return(if (has_special) "Contains Special/Non-ASCII" else "ASCII") } return("N/A") } # Helper: Extract User-Defined Missing Values # haven stores these in the 'na_values' or 'na_range' attribute get_na_values <- function(col) { na_vals <- attr(col, "na_values", exact = TRUE) na_range <- attr(col, "na_range", exact = TRUE) res <- "" if (!is.null(na_vals)) res <- paste(na_vals, collapse = ", ") if (!is.null(na_range)) { range_str <- paste0(na_range[1], " to ", na_range[2]) res <- if(res == "") range_str else paste(res, range_str, sep = "; ") } return(if (res == "") "None" else res) } report <- purrr::map_dfr(spss_files, function(file_path) { fname <- basename(file_path) tryCatch({ # Read header only (first 100 rows) for speed data <- read_sav(file_path, n_max = 100) purrr::map_dfr(names(data), function(var) { col <- data[[var]] # 1. Variable Label lbl <- attr(col, "label", exact = TRUE) if (is.null(lbl)) lbl <- "(No Label)" # 2. Value Labels val_lbls <- attr(col, "labels", exact = TRUE) val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else "" # 3. Measurement Level # 'display_width' or class often indicates scale vs nominal dtype <- class(col)[1] # 4. Create Row tibble( FileName = fname, VariableName = var, VariableLabel = lbl, DataType = dtype, ValueLabels = substr(val_str, 1, 100), MissingValues = get_na_values(col), EncodingCheck = check_encoding(col), Status = "Success" ) }) }, error = function(e) { tibble( FileName = fname, VariableName = "ERROR", VariableLabel = e$message, DataType = "Error", ValueLabels = "", MissingValues = "", EncodingCheck = "", Status = "Failed" ) }) }) # Display preview print("--- SPSS Data Dictionary Preview ---") head(report) ``` ## Save the results We save the resulting table to a CSV file. This CSV acts as the Data Dictionary for the curated deposit. ```{r} #| label: save-results #| label: save-results output_dir <- "Results/Inspect_sav" dir.create(output_dir, recursive = TRUE, showWarnings = FALSE) output_file <- file.path(output_dir, paste0("SPSS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv")) write.csv(report, output_file, row.names = FALSE) print(paste("Data Dictionary saved to:", output_file)) ``` ## Curation Insights Use the generated CSV to perform these checks: - **User-Defined Missing Values (MissingValues):** The curator may look for numbers like -9, -99, or 999. Ensure these are documented in the study's metadata. If you convert this file to CSV, these numbers will become regular integers (e.g., Age = 999), invalidating any downstream analysis (see [IBM SPSS statistics 28 brief guide](https://www.ibm.com/docs/en/SSLVMB_28.0.0/pdf/IBM_SPSS_Statistics_Brief_Guide.pdf)). - **Missing Value Labels:** Look for variables that are DataType = haven_labelled or numeric but have empty ValueLabels. - **Encoding Issues (EncodingCheck):** "Contains Special/Non-ASCII" often indicates files created in older localized Windows versions (e.g., French or Spanish SPSS) using the legacy Windows-1252 encoding. If the labels look corrupted (e.g., EncuestÃ³ instead of Encuestó), the data requires character set remediation before long-term storage. ## Additional Tools - **PSPP:** I is an [open-source alternative](https://www.gnu.org/software/pspp/) to IBM SPSS Statistics. It allows curators to verify .sav files and provides a command-line interface for batch conversion without requiring a proprietary license. - **StatTransfer:** A [commercial tool](https://stattransfer.com/) for converting between statistical formats (SAS, SPSS, Stata, R). It is effective at preserving missing value definitions and variable labels during migration. ## Using the Non-Interactive R Script For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version. Download the **R Script:** [**`Inspect_sav_Script.R`**](Scripts/Inspect_sav_Script.R) ### Example HPC Submission Script (`Inspect_sav_submit.sh`) ``` bash #!/bin/bash #SBATCH --job-name=spss_check #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:20:00 #SBATCH --mem=8G #SBATCH --output=logs/spss_check_%j.log # 1. Load R Module # (Adjust version based on your cluster, e.g., 'module load R/4.3.0') module load R # 2. Define Target Directory # Replace with the actual path to your SPSS data DATA_DIR="/scratch/user/project_data/surveys" # 3. Prepare Environment # Ensure output directories exist mkdir -p Results/Inspect_sav mkdir -p logs # 4. Run Analysis echo "Starting SPSS Inspection on $DATA_DIR" Rscript Inspect_sav_Script.R "$DATA_DIR" ``` ## References ::: {#refs} :::

11.1 Overview

11.2 Setup

11.2.1 R Packages

11.2.2 Load Libraries

11.3 Select Directory

11.4 File inventory

11.5 Generate Data Dictionary

11.6 Save the results

11.7 Curation Insights

11.8 Additional Tools

11.9 Using the Non-Interactive R Script

11.9.1 Example HPC Submission Script (Inspect_sav_submit.sh)

11.10 References

11.9.1 Example HPC Submission Script (`Inspect_sav_submit.sh`)