12  Stata (.dta) Files

Author

Daniel Manrique-Castano

Published

December 1, 2025

12.1 Overview

This notebook inspects Stata (.dta) files. Unlike simple text formats, Stata files are “self-describing” binary containers that bundle raw data with extensive metadata.

NoteCuration Goal

Quantify “metadata richness” and archival readiness. Our objective is to extract embedded variable labels and “Extended Missing Values” (e.g., .a, .b) to ensure the dataset remains contextually usable after preservation.

WarningPreservation Risk

Proprietary lock-in is a primary concern. Furthermore, Stata’s unique extended missing values are frequently lost when files are blindly converted to CSV, leading to data degradation if not properly documented.

Curation Objectives:

  1. Metadata Coverage: Assess the quality of variable and value labels.
  2. Missing Value Integrity: Detect extended missing codes that require special handling.
  3. Privacy Screening: Scan string variables for Personally Identifiable Information (PII).

12.2 Setup

We use haven (Wickham et al. 2023) to read Stata files while preserving their metadata attributes.

12.2.1 R Packages

If you do not have the required packages, run this command once in your R console:

Code
# install.packages(c("tidyverse", "haven", "labelled", ",tools", "rstudioapi", "skimr", "DT"))

12.2.2 Load libraries

Code
library(tidyverse)
library(haven)      # Read/Write Stata .dta files
library(labelled)   # Tools for manipulating variable labels
library(skimr)      # Statistical profiling
library(DT)         # Interactive tables
library(tools)      # File utilities
library(rstudioapi)

12.3 Select Target Directory

Select the folder containing the Stata files you wish to inspect.

Note: If running interactively (RStudio), a dialog box will appear. If running via command line or rendering, it defaults to the target_dir parameter defined at the top of this file.

Code
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select Stata Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: ."

12.4 Find Stata Files

We scan the directory for files ending in .dta.

Code
dta_files <- list.files(
  path = target_dir,
  pattern = "\\.dta$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(dta_files), "Stata files."))
[1] "Found 1 Stata files."
Code
head(dta_files)
[1] "./data/Inspect_dta/DATASET_KE.dta"

12.5 Batch Health Check (Metadata & Structure)

This section assesses the quality of the embedded documentation. We define a “Documented Variable” as one that possesses a descriptive Variable Label (e.g., q1 = “Respondent Age”).

Key Metrics:

Code
analyze_dta_health <- function(file_path) {
  
  fname <- basename(file_path)
  file_info <- file.info(file_path)
  
  tryCatch({
    # 1. Read Data (Lazy load not possible with haven, must read full)
    data <- read_dta(file_path)
    
    # 2. Structural Metrics
    n_vars <- ncol(data)
    n_obs <- nrow(data)
    
    # 3. Metadata Quality
    # Check Variable Labels
    var_labels <- map_lgl(data, ~ !is.null(attr(., "label")))
    pct_labeled <- round(100 * sum(var_labels) / n_vars, 1)
    
    # Check Value Labels
    val_labels <- map_lgl(data, ~ !is.null(attr(., "labels")))
    has_val_labels <- any(val_labels)
    
    # 4. Extended Missing Check
    # Scan numeric columns for Stata's special missing values
    has_ext_missing <- any(map_lgl(data, function(x) {
      if (is.numeric(x)) any(is.na(x) & is.nan(x)) else FALSE 
      # Note: haven handles extended missing differently in newer versions, 
      # but simplistic NA check is often sufficient for a high-level flag.
    }))
    
    # 5. PII Scan (Email Regex)
    char_cols <- select(data, where(is.character))
    pii_found <- FALSE
    if (ncol(char_cols) > 0) {
      email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
      sample_size <- min(n_obs, 1000)
      pii_check <- char_cols %>%
        slice_head(n = sample_size) %>%
        summarise(across(everything(), ~ any(str_detect(., email_pattern), na.rm = TRUE)))
      pii_found <- any(unlist(pii_check))
    }
    
    tibble(
      FileName = fname,
      Size_MB = round(file_info$size / 1024^2, 2),
      Vars = n_vars,
      Obs = n_obs,
      Label_Coverage_Pct = pct_labeled,
      Has_Value_Labels = has_val_labels,
      PII_Risk = pii_found,
      Status = "Success"
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname,
      Size_MB = round(file_info$size / 1024^2, 2),
      Vars = NA, Obs = NA, Label_Coverage_Pct = NA, Has_Value_Labels = NA, PII_Risk = NA,
      Status = paste("Read Failed:", e$message)
    )
  })
}

if (length(dta_files) > 0) {
  health_report <- purrr::map_dfr(dta_files, analyze_dta_health)
  datatable(health_report, 
            caption = "Table 1: Stata Metadata Health Check",
            options = list(scrollX = TRUE))
} else {
  message("No Stata files found.")
}

12.6 Data Dictionary

To facilitate curation, we extract the full schema of the dataset. This creates a “Codebook” in CSV format, listing every variable name alongside its internal label. This allows curators to check for consistency across years (e.g., ensuring Q1 always means “Age”) without opening Stata.

Code
extract_dictionary <- function(file_path) {
  tryCatch({
    data <- read_dta(file_path, n_max = 1) # Read header only
    
    map_dfr(names(data), function(var) {
      lbl <- attr(data[[var]], "label")
      if (is.null(lbl)) lbl <- NA_character_
      
      # Extract value labels if present (first 3 examples)
      val_lbls <- attr(data[[var]], "labels")
      val_str <- if (!is.null(val_lbls)) {
        paste(head(names(val_lbls), 3), collapse = "; ")
      } else {
        NA_character_
      }
      
      tibble(
        FileName = basename(file_path),
        Variable = var,
        Label = lbl,
        Type = typeof(data[[var]]),
        Value_Examples = val_str
      )
    })
  }, error = function(e) NULL)
}

if (length(dta_files) > 0) {
  message("Extracting data dictionaries...")
  full_dictionary <- map_dfr(dta_files, extract_dictionary)
  
  datatable(head(full_dictionary, 50), 
            caption = "Table 2: Variable Dictionary (Preview)",
            options = list(scrollX = TRUE))
}

12.7 Save Results

Save the dictionary to a CSV file for review.

Code
output_dir <- file.path("Results", "Inspect_DTA")
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)

health_file <- file.path(output_dir, paste0("DTA_Health_Check_", Sys.Date(), ".csv"))
dict_file <- file.path(output_dir, paste0("DTA_Data_Dictionary_", Sys.Date(), ".csv"))

write_csv(health_report, health_file)
write_csv(full_dictionary, dict_file)

message("Reports saved:")
message("1. ", health_file)
message("2. ", dict_file)

12.8 Curation Insights

  • Documentation Quality: If Label_Coverage_Pct is low (< 50%), the dataset is “Orphaned Data.” Without an external PDF codebook, the variables v1, v2, etc., are meaningless.

  • Preservation Strategy: If the file contains Value Labels (e.g., 1=Male), converting it to a plain CSV will result in data loss (you get 1, not Male). You must generate a setup script (R/SPSS/SAS) or use a format like CSV + DDI to preserve these mappings.

  • Privacy: Stata files often contain string variables with open-ended survey responses. If PII_Risk is flagged, these columns must be manually reviewed for names or emails.

12.9 Additional Tool

Curators often need specialized tools to manage the transition from proprietary Stata files to open preservation formats.

  • Stat/Transfer: The industry-standard software for high-fidelity conversion between statistical formats (Stata, SPSS, SAS, R) while preserving variable labels and missing values (https://stattransfer.com).

  • Colectica / DDI: A suite of tools for documenting data using the Data Documentation Initiative (DDI) standard. It allows you to extract Stata metadata into XML for long-term archiving (https://colectica.com).

  • sjlabelled (R Package): A powerful R toolkit for dealing with labelled data. It allows you to modify, remove, or standardize variable labels programmatically (lüdecke2018?).

12.10 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

12.10.1 The Inspect_dta_Script.R Script

Download the R Script: Inspect_dta_Script.R

12.10.2 Example HPC Submission Script (Inspect_dta_submit.sh)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --job-name=dta_check

# Load R module
module load R

# Define data directory
DATA_DIR="/scratch/your_user/surveys"
OUTPUT_DIR="/scratch/your_user/dta_results"

# Run Script
Rscript Inspect_dta_Script $DATA_DIR

12.11 References