13 SAS Data (.sas7bdat) Files

Author

Daniel Manrique-Castano

Published

December 18, 2025

13.1 Overview

SAS files (.sas7bdat) are a standard for clinical and statistical data. However, they often split information across multiple files, creating a significant curation challenge.

Curation Goal

Ensure the completeness and interoperability of SAS datasets. Our objective is to verify that both data files and catalog files (.sas7bcat) are present, enabling the correct decoding of categorical variables and labels.

Preservation Risk

Proprietary “CPORT” formats and missing catalog files are major risks. Without the associated formats, numeric codes in the data file become cryptic and contextually meaningless, requiring expensive proprietary software to recover.

This notebook performs a rigorous inventory:

Catalog Check: Scans for .sas7bcat files to ensure metadata isn’t missing.
Format Validation: Distinguishes between open XPORT (.xpt) and proprietary CPORT formats.
Encoding Scan: Checks for character corruption common in older datasets.

13.2 Setup

We use the haven package. It is the gold standard for reading SAS files in R and can handle both the data files and, if available, the catalog files.

13.2.1 R Packages

If you do not have the required packages, run this command once in your R console:

Code

# install.packages(c("tidyverse", "haven", "sas7bdat", "rstudioapi", "stringr"))

13.2.2 Load libraries

Code

library(tidyverse)
library(haven)      # Primary reader (Gold Standard)
library(sas7bdat)   # Fallback reader
library(rstudioapi) # For directory selection
library(stringr)    # For encoding checks

13.3 Select Target Directory

We select the folder containing the SAS files.

Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.

Code

# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select SAS Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))

[1] "Analyzing directory: data/Inspect_sas7bdat/"

13.4 Find SAS Files

We scan the directory for files ending in .sas7bdat.

Code

# 1. Find Data Files
sas_files <- list.files(
  path = target_dir,
  pattern = "\\.sas7bdat$|\\.xpt$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

# 2. Find Catalog Files
cat_files <- list.files(
  path = target_dir,
  pattern = "\\.sas7bcat$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(sas_files), "SAS Data files."))

[1] "Found 8 SAS Data files."

Code

print(paste("Found", length(cat_files), "SAS Catalog files (.sas7bcat)."))

[1] "Found 0 SAS Catalog files (.sas7bcat)."

Code

if (length(cat_files) == 0) {
  message("⚠️ Warning: No Catalog files found. Categorical variables may lack value labels.")
} else {
  print(basename(cat_files))
}

13.5 Generate Data Dictionary

We iterate through the files to extract metadata. We read only the first 100 rows to extract the metadata without processing the full dataset volume.

Code

message("Generating Data Dictionary...")

# Helper: Check for non-ASCII characters (Encoding Risk)
check_encoding <- function(col) {
  if (is.character(col)) {
    # Check if any character in the vector is non-ASCII
    # This often reveals if special characters (accents) are present
    has_special <- any(str_detect(col, "[^\\x00-\\x7F]"), na.rm = TRUE)
    return(if (has_special) "Contains Special/Non-ASCII" else "ASCII")
  }
  return("N/A")
}

report <- purrr::map_dfr(sas_files, function(file_path) {
  
  fname <- basename(file_path)
  
  # --- STRATEGY 0: DETECT FORMAT ---
  # Read first 80 bytes to check for "SAS FILE" signature common in Transport files
  header_check <- tryCatch({
    con <- file(file_path, "rb")
    on.exit(close(con))
    raw_header <- readBin(con, "raw", n = 80)
    rawToChar(raw_header)
  }, error = function(e) "")
  
  is_transport <- grepl("SAS FILE", header_check, fixed = TRUE)
  
  if (is_transport) {
    # --- STRATEGY 1A: HAVEN (XPORT) ---
    tryCatch({
      data <- read_xpt(file_path, n_max = 100)
      
      purrr::map_dfr(names(data), function(var) {
        col <- data[[var]]
        lbl <- attr(col, "label", exact = TRUE)
        if (is.null(lbl)) lbl <- "(No Label)"
        
        val_lbls <- attr(col, "labels", exact = TRUE)
        val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else ""
        
        tibble(
          FileName = fname, VariableName = var, VariableLabel = lbl, 
          DataType = class(col)[1], ValueLabels = substr(val_str, 1, 100),
          EncodingCheck = check_encoding(col),
          Method = "Haven (XPORT)"
        )
      })
    }, error = function(e_xpt) {
       tibble(
        FileName = fname, 
        VariableName = "ERROR", 
        VariableLabel = paste("Detected SAS Transport but failed to read (likely CPORT format):", e_xpt$message), 
        DataType = "Error", 
        ValueLabels = "Requires SAS to convert CPORT to XPORT/SAS7BDAT",
        EncodingCheck = "NA",
        Method = "Failed (XPORT/CPORT)"
      )
    })
    
  } else {
    # --- STRATEGY 1B: HAVEN (SAS7BDAT) ---
    tryCatch({
      # Attempt to read with Haven
      data <- read_sas(file_path, n_max = 100)
      
      # If successful, extract metadata
      purrr::map_dfr(names(data), function(var) {
        col <- data[[var]]
        lbl <- attr(col, "label", exact = TRUE)
        if (is.null(lbl)) lbl <- "(No Label)"
        
        val_lbls <- attr(col, "labels", exact = TRUE)
        val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else ""
        
        tibble(
          FileName = fname, VariableName = var, VariableLabel = lbl, 
          DataType = class(col)[1], ValueLabels = substr(val_str, 1, 100),
          EncodingCheck = check_encoding(col),
          Method = "Haven"
        )
      })
      
    }, error = function(e_haven) {
      
      # --- STRATEGY 2: SAS7BDAT (Fallback) ---
      tryCatch({
        # Haven failed, so try the pure-R sas7bdat package
        data_fallback <- read.sas7bdat(file_path, debug = FALSE) 
        
        col_info <- attr(data_fallback, "column.info")
        
        if (!is.null(col_info)) {
          purrr::map_dfr(col_info, function(col) {
            tibble(
              FileName = fname,
              VariableName = col$name,
              VariableLabel = if(!is.null(col$label) && col$label != "") col$label else "(No Label)",
              DataType = col$type,
              ValueLabels = "(Check Catalog)", 
              EncodingCheck = "Unknown",
              Method = "Fallback (sas7bdat)"
            )
          })
        } else {
          purrr::map_dfr(names(data_fallback), function(var) {
               tibble(FileName = fname, VariableName = var, VariableLabel = "-", 
                      DataType = "-", ValueLabels = "-", EncodingCheck = "Unknown", Method = "Fallback (Basic)")
          })
        }
        
      }, error = function(e_fallback) {
        # --- BOTH FAILED ---
        tibble(
          FileName = fname, 
          VariableName = "ERROR", 
          VariableLabel = paste("Haven:", e_haven$message, "| Fallback:", e_fallback$message), 
          DataType = "Error", 
          ValueLabels = "",
          EncodingCheck = "NA",
          Method = "Failed"
        )
      })
    })
  }
})

# Display preview
print("--- SAS Data Dictionary Preview ---")

[1] "--- SAS Data Dictionary Preview ---"

Code

head(report)

# A tibble: 6 × 7
  FileName  VariableName VariableLabel DataType ValueLabels EncodingCheck Method
  <chr>     <chr>        <chr>         <chr>    <chr>       <chr>         <chr> 
1 abi_chan… studyid      studyid       numeric  ""          N/A           Haven 
2 abi_chan… female       Derived Fema… numeric  ""          N/A           Haven 
3 abi_chan… prob10chd_c  Framingham R… numeric  ""          N/A           Haven 
4 abi_chan… alloloadcrp… Measure of A… numeric  ""          N/A           Haven 
5 abi_chan… EDUC3_new    Education st… numeric  ""          N/A           Haven 
6 abi_chan… ALCOHOL_FRE… ALCOHOL_FREQ… numeric  ""          N/A           Haven

13.6 Save Results

Save the dictionary to a CSV file for review.

Code

output_dir <- "Results/Inspect_sas7bdat"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("SAS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Data Dictionary saved to:", output_file))

[1] "Data Dictionary saved to: Results/Inspect_sas7bdat/SAS_Dictionary_20260515.csv"

13.7 Curation Insights

Use the generated CSV to perform these checks:

CPORT Detection: Files flagged as Failed (XPORT/CPORT) are likely CPORT files. These are proprietary and the depositor needs to re-export them as standard XPORT (.xpt) or .sas7bdat using SAS.
Missing Catalogs: If you see categorical variables (e.g., Treatment values 1, 2) with empty ValueLabels and NO .sas7bcat files were found in the inventory step. In this case, the metadata is incomplete and a request for the missing .sas7bcat file must be done.
Encoding Issues: Check the EncodingCheck column. If it says “Contains Special/Non-ASCII”, verify the text. If the text renders unreadable characters (e.g., Ã± instead of ñ), the file was likely saved in WINDOWS-1252 but read as UTF-8. You may need to use read_sas(…, encoding = “latin1”) to fix it.

13.8 Additional Tools

SAS Universal Viewer: It is a free tool from SAS Institute that allows you to view (but not edit) SAS datasets and catalogs without a full SAS license. It can be used to verify CPORT files.
StatTransfer: is a commercial tool for converting between statistical formats (SAS, SPSS, Stata, R). This tool can handle effective CPORT conversion.

13.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_sas7bdat_Script.R

13.9.1 Example HPC Submission Script (`Inspect_sas7bdat_submit.sh`)

#!/bin/bash
#SBATCH --job-name=sas_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --mem=8G
#SBATCH --output=logs/sas_check_%j.log

module load R

# Define target directory
TARGET_DIR="/scratch/user/project_data/clinical_trials"

# Prepare folders
mkdir -p Results/Inspect_sas7bdat
mkdir -p logs

# Run
echo "Starting SAS Inspection on $TARGET_DIR"
Rscript Inspect_sas7bdat_Script.R "$TARGET_DIR"

13.10 References

--- title: "SAS Data (.sas7bdat) Files" author: "Daniel Manrique-Castano" date: "2025-12-18" format: html: toc: true toc-location: left code-fold: true theme: cosmo bibliography: references.bib params: target_dir: "data/Inspect_sas7bdat/" --- ## Overview SAS files (`.sas7bdat`) are a standard for clinical and statistical data. However, they often split information across multiple files, creating a significant curation challenge. ::: {.callout-note title="Curation Goal"} Ensure the completeness and interoperability of SAS datasets. Our objective is to verify that both data files and catalog files (`.sas7bcat`) are present, enabling the correct decoding of categorical variables and labels. ::: ::: {.callout-warning title="Preservation Risk"} Proprietary "CPORT" formats and missing catalog files are major risks. Without the associated formats, numeric codes in the data file become cryptic and contextually meaningless, requiring expensive proprietary software to recover. ::: **This notebook performs a rigorous inventory:** 1. **Catalog Check:** Scans for `.sas7bcat` files to ensure metadata isn't missing. 2. **Format Validation:** Distinguishes between open XPORT (.xpt) and proprietary CPORT formats. 3. **Encoding Scan:** Checks for character corruption common in older datasets. ------------------------------------------------------------------------ ## Setup We use the `haven` package. It is the gold standard for reading SAS files in R and can handle both the data files and, if available, the catalog files. ### R Packages If you do not have the required packages, run this command once in your R console: ```{r} # install.packages(c("tidyverse", "haven", "sas7bdat", "rstudioapi", "stringr")) ``` ### Load libraries ```{r} #| label: load-libraries #| message: false library(tidyverse) library(haven) # Primary reader (Gold Standard) library(sas7bdat) # Fallback reader library(rstudioapi) # For directory selection library(stringr) # For encoding checks ``` ## Select Target Directory We select the folder containing the SAS files. Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter. ```{r} #| label: select-target-dir # 1. Try to select interactively if in RStudio if (interactive() && .Platform$OS.type == "windows") { selected_dir <- rstudioapi::selectDirectory(caption = "Select SAS Directory") } else { selected_dir <- NULL } # 2. Logic to determine final directory if (!is.null(selected_dir)) { target_dir <- selected_dir } else { target_dir <- params$target_dir } print(paste("Analyzing directory:", target_dir)) ``` ## Find SAS Files We scan the directory for files ending in .sas7bdat. ```{r} #| label: find-files # 1. Find Data Files sas_files <- list.files( path = target_dir, pattern = "\\.sas7bdat$|\\.xpt$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE ) # 2. Find Catalog Files cat_files <- list.files( path = target_dir, pattern = "\\.sas7bcat$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE ) print(paste("Found", length(sas_files), "SAS Data files.")) print(paste("Found", length(cat_files), "SAS Catalog files (.sas7bcat).")) if (length(cat_files) == 0) { message("⚠️ Warning: No Catalog files found. Categorical variables may lack value labels.") } else { print(basename(cat_files)) } ``` ## Generate Data Dictionary We iterate through the files to extract metadata. We read only the first 100 rows to extract the metadata without processing the full dataset volume. ```{r} #| label: extraction-logic #| warning: false #| message: false message("Generating Data Dictionary...") # Helper: Check for non-ASCII characters (Encoding Risk) check_encoding <- function(col) { if (is.character(col)) { # Check if any character in the vector is non-ASCII # This often reveals if special characters (accents) are present has_special <- any(str_detect(col, "[^\\x00-\\x7F]"), na.rm = TRUE) return(if (has_special) "Contains Special/Non-ASCII" else "ASCII") } return("N/A") } report <- purrr::map_dfr(sas_files, function(file_path) { fname <- basename(file_path) # --- STRATEGY 0: DETECT FORMAT --- # Read first 80 bytes to check for "SAS FILE" signature common in Transport files header_check <- tryCatch({ con <- file(file_path, "rb") on.exit(close(con)) raw_header <- readBin(con, "raw", n = 80) rawToChar(raw_header) }, error = function(e) "") is_transport <- grepl("SAS FILE", header_check, fixed = TRUE) if (is_transport) { # --- STRATEGY 1A: HAVEN (XPORT) --- tryCatch({ data <- read_xpt(file_path, n_max = 100) purrr::map_dfr(names(data), function(var) { col <- data[[var]] lbl <- attr(col, "label", exact = TRUE) if (is.null(lbl)) lbl <- "(No Label)" val_lbls <- attr(col, "labels", exact = TRUE) val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else "" tibble( FileName = fname, VariableName = var, VariableLabel = lbl, DataType = class(col)[1], ValueLabels = substr(val_str, 1, 100), EncodingCheck = check_encoding(col), Method = "Haven (XPORT)" ) }) }, error = function(e_xpt) { tibble( FileName = fname, VariableName = "ERROR", VariableLabel = paste("Detected SAS Transport but failed to read (likely CPORT format):", e_xpt$message), DataType = "Error", ValueLabels = "Requires SAS to convert CPORT to XPORT/SAS7BDAT", EncodingCheck = "NA", Method = "Failed (XPORT/CPORT)" ) }) } else { # --- STRATEGY 1B: HAVEN (SAS7BDAT) --- tryCatch({ # Attempt to read with Haven data <- read_sas(file_path, n_max = 100) # If successful, extract metadata purrr::map_dfr(names(data), function(var) { col <- data[[var]] lbl <- attr(col, "label", exact = TRUE) if (is.null(lbl)) lbl <- "(No Label)" val_lbls <- attr(col, "labels", exact = TRUE) val_str <- if (!is.null(val_lbls)) paste(val_lbls, names(val_lbls), sep="=", collapse="; ") else "" tibble( FileName = fname, VariableName = var, VariableLabel = lbl, DataType = class(col)[1], ValueLabels = substr(val_str, 1, 100), EncodingCheck = check_encoding(col), Method = "Haven" ) }) }, error = function(e_haven) { # --- STRATEGY 2: SAS7BDAT (Fallback) --- tryCatch({ # Haven failed, so try the pure-R sas7bdat package data_fallback <- read.sas7bdat(file_path, debug = FALSE) col_info <- attr(data_fallback, "column.info") if (!is.null(col_info)) { purrr::map_dfr(col_info, function(col) { tibble( FileName = fname, VariableName = col$name, VariableLabel = if(!is.null(col$label) && col$label != "") col$label else "(No Label)", DataType = col$type, ValueLabels = "(Check Catalog)", EncodingCheck = "Unknown", Method = "Fallback (sas7bdat)" ) }) } else { purrr::map_dfr(names(data_fallback), function(var) { tibble(FileName = fname, VariableName = var, VariableLabel = "-", DataType = "-", ValueLabels = "-", EncodingCheck = "Unknown", Method = "Fallback (Basic)") }) } }, error = function(e_fallback) { # --- BOTH FAILED --- tibble( FileName = fname, VariableName = "ERROR", VariableLabel = paste("Haven:", e_haven$message, "| Fallback:", e_fallback$message), DataType = "Error", ValueLabels = "", EncodingCheck = "NA", Method = "Failed" ) }) }) } }) # Display preview print("--- SAS Data Dictionary Preview ---") head(report) ``` ## Save Results Save the dictionary to a CSV file for review. ```{r} #| label: save-results output_dir <- "Results/Inspect_sas7bdat" dir.create(output_dir, recursive = TRUE, showWarnings = FALSE) output_file <- file.path(output_dir, paste0("SAS_Dictionary_", format(Sys.Date(), "%Y%m%d"), ".csv")) write.csv(report, output_file, row.names = FALSE) print(paste("Data Dictionary saved to:", output_file)) ``` ## Curation Insights Use the generated CSV to perform these checks: - **CPORT Detection:** Files flagged as Failed (XPORT/CPORT) are likely CPORT files. These are proprietary and the depositor needs to re-export them as standard XPORT (.xpt) or .sas7bdat using SAS. - **Missing Catalogs:** If you see categorical variables (e.g., Treatment values 1, 2) with empty ValueLabels and NO .sas7bcat files were found in the inventory step. In this case, the metadata is incomplete and a request for the missing .sas7bcat file must be done. - **Encoding Issues:** Check the EncodingCheck column. If it says "Contains Special/Non-ASCII", verify the text. If the text renders unreadable characters (e.g., Ã± instead of ñ), the file was likely saved in WINDOWS-1252 but read as UTF-8. You may need to use read_sas(..., encoding = "latin1") to fix it. ## Additional Tools - **SAS Universal Viewer:** It is a [free tool](https://support.sas.com/en/software/universal-viewer-support.html) from SAS Institute that allows you to view (but not edit) SAS datasets and catalogs without a full SAS license. It can be used to verify CPORT files. - **StatTransfer:** is a [commercial tool](https://stattransfer.com/) for converting between statistical formats (SAS, SPSS, Stata, R). This tool can handle effective CPORT conversion. ## Using the Non-Interactive R Script For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version. Download the **R Script:** [**`Inspect_sas7bdat_Script.R`**](Scripts/Inspect_sas7bdat_Script.R) ### Example HPC Submission Script (`Inspect_sas7bdat_submit.sh`) ``` bash #!/bin/bash #SBATCH --job-name=sas_check #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:20:00 #SBATCH --mem=8G #SBATCH --output=logs/sas_check_%j.log module load R # Define target directory TARGET_DIR="/scratch/user/project_data/clinical_trials" # Prepare folders mkdir -p Results/Inspect_sas7bdat mkdir -p logs # Run echo "Starting SAS Inspection on $TARGET_DIR" Rscript Inspect_sas7bdat_Script.R "$TARGET_DIR" ``` ## References ::: {#refs} :::

13.1 Overview

13.2 Setup

13.2.1 R Packages

13.2.2 Load libraries

13.3 Select Target Directory

13.4 Find SAS Files

13.5 Generate Data Dictionary

13.6 Save Results

13.7 Curation Insights

13.8 Additional Tools

13.9 Using the Non-Interactive R Script

13.9.1 Example HPC Submission Script (Inspect_sas7bdat_submit.sh)

13.10 References

13.9.1 Example HPC Submission Script (`Inspect_sas7bdat_submit.sh`)