Author

Daniel Manrique-Castano

Published

December 8, 2025

3.1 Overview

This notebook provides a static analysis framework for auditing R code (.R) and literate programming documents (.qmd, .Rmd).

NoteCuration Goal

Static analysis refers to the examination of source code without executing it. Our objective is to safely assess code quality, reproducibility, and potential security risks without triggering harmful scripts.

WarningPreservation Risk

“Code rot” is a pervasive issue in research curation. Scripts that run perfectly on a researcher’s laptop often fail in archival environments due to undocumented dependencies, absolute paths (e.g., setwd()), or outdated package versions (Stodden 2010).

Key Curation Objectives:

  1. Structural/syntax Validation: Verify that script files contain syntactically valid R code.
  2. Dependency Mapping: Extract explicit and implicit package calls to assist in environment reconstruction.
  3. Risk Assessment: Detect commands that threaten reproducibility or security.

3.2 Setup

We utilize tidyverse for data manipulation and rstudioapi for interactive directory selection.

3.2.1 Load Libraries

If you do not have the required packages, run this command once in your R console:

Code
# install.packages(c("DT", "tidyverse", "rstudioapi", "readr", "tools"))

3.2.2 Load libraries

Code
library(tidyverse)
library(DT)
library(rstudioapi)
library(readr)
library(tools)

3.3 Select Target Directory

Select the directory containing the code files to be analyzed. If running interactively, a dialog box will appear; otherwise, the script defaults to the parameter defined in the YAML header.

Code
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select Code Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: ."

3.4 Find Code Files

We scan the directory for .R, .qmd, and .Rmd files.

Code
code_files <- list.files(
  path = target_dir,
  pattern = "\\.(R|qmd|Rmd)$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

print(paste("Found", length(code_files), "code files."))
[1] "Found 91 code files."
Code
head(code_files)
[1] "./_book/Scripts/Inspect_Code_Script.R"      
[2] "./_book/Scripts/Inspect_Containers_Script.R"
[3] "./_book/Scripts/Inspect_csv_Script.R"       
[4] "./_book/Scripts/Inspect_dta_Script.R"       
[5] "./_book/Scripts/Inspect_Extensions_Script.R"
[6] "./_book/Scripts/Inspect_gpkg_Script.R"      

3.5 Code Analysis Function

Static Analysis involves examining the source code without executing it. This is safer for curators than running untrusted scripts. We parse the code to check for syntax errors and use Regular Expressions (Regex) to identify dependencies and risks.This function performs four checks per file:

  • Safe Reading: Uses readr::read_lines to handle encoding nuances.

  • Syntax Checking: Uses R’s parse() function to verify structural integrity..

  • Comment Stripping: Removes comments to prevent false positives (e.g., commented-out libraries).

  • Pattern Matching: Scans for dependencies, risky commands, and secrets.

Code
analyze_r_file <- function(file_path) {
  
  fname <- basename(file_path)
  
  # Regex Patterns
  patterns <- list(
    # Explicit loading: library(pkg), require(pkg), p_load(pkg)
    library_call = "(?:library|require|p_load)\\s*\\(\\s*[\"']?([a-zA-Z0-9\\.]+)[\"']?\\s*\\)",
    # Implicit loading: package::function
    implicit_call = "([a-zA-Z0-9\\.]+)::[a-zA-Z0-9_\\.]+",
    # API Tokens (Heuristics for GitHub, Slack, etc.)
    tokens = "(?:ghp_|sk-|xoxb-|xoxp-)[a-zA-Z0-9]+"
  )
  
  # Risk Patterns (Bryan, 2017)
  risk_patterns <- list(
    "Hard Setwd"    = "setwd\\s*\\(",
    "System Call"   = "(?:system|shell|system2)\\s*\\(",
    "Web Download"  = "(?:download\\.file|curl_download)\\s*\\(",
    "Source File"   = "source\\s*\\("
  )
  
  # Absolute Path Pattern (Windows/Unix roots)
  abs_path_pattern <- "(?:[a-zA-Z]:\\\\|/Users/|/home/|/scratch/)"

  tryCatch({
    # 1. Read File Content
    raw_lines <- readr::read_lines(file_path, lazy = FALSE)
    
    # 2. Syntax Validation
    syntax_status <- "Valid"
    tryCatch({
      parse(file = file_path, keep.source = FALSE)
    }, error = function(e) {
      clean_msg <- gsub("[\r\n]+", " ", e$message)
      syntax_status <<- paste("Error:", clean_msg)
    })
    
    # 3. Strip Comments for Analysis
    clean_lines <- gsub("#.*", "", raw_lines)
    content_str <- paste(clean_lines, collapse = "\n")
    
    # 4. Extract Dependencies
    lib_matches <- str_match_all(content_str, patterns$library_call)[[1]]
    explicit_pkgs <- if (length(lib_matches) > 0) lib_matches[, 2] else character(0)
    
    colon_matches <- str_match_all(content_str, patterns$implicit_call)[[1]]
    implicit_pkgs <- if (length(colon_matches) > 0) colon_matches[, 2] else character(0)
    
    all_pkgs <- unique(c(explicit_pkgs, implicit_pkgs))
    all_pkgs <- setdiff(all_pkgs, "base") # Exclude base R
    packages_str <- paste(sort(all_pkgs), collapse = ", ")
    
    # 5. Identify Risks
    risks_found <- names(risk_patterns) %>% 
      map_chr(function(risk_name) {
        if (any(str_detect(clean_lines, risk_patterns[[risk_name]]))) return(risk_name) else return(NA)
      }) %>% 
      discard(is.na) %>% 
      paste(collapse = "; ")
      
    # 6. Count Absolute Paths
    num_abs_paths <- sum(str_count(clean_lines, abs_path_pattern))
    
    # 7. Scan for Secrets (on raw lines)
    num_tokens <- sum(str_count(raw_lines, patterns$tokens))
    
    tibble(
      FileName = fname,
      FileType = tools::file_ext(fname),
      Syntax_Check = syntax_status,
      Packages = substr(packages_str, 1, 150),
      AbsPathsFound = num_abs_paths,
      Other_Risks = risks_found,
      Potential_Secrets = num_tokens,
      Status = "Success"
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname,
      FileType = tools::file_ext(fname),
      Syntax_Check = paste("Read Failed:", e$message),
      Packages = "", AbsPathsFound = NA, Other_Risks = "", Potential_Secrets = NA,
      Status = "Failed"
    )
  })
}

3.6 Execute Analysis

We map the analysis function over the list of files detected in the previous step.

Code
if (length(code_files) > 0) {
  report <- purrr::map_dfr(code_files, analyze_r_file)
  
  datatable(report, 
            caption = "Table 1: Code Inspection Report",
            options = list(scrollX = TRUE))
} else {
  message("No code files found.")
}

3.7 Visualization: Dependency Ecosystem

Understanding the software environment is critical for long-term preservation. This chart illustrates the most frequently used packages across the analyzed codebase, helping curators prioritize which libraries to document in renv.lock or DESCRIPTION files.

Code
if (nrow(report) > 0 && any(report$Packages != "")) {
  
  dependency_counts <- report %>%
    filter(Packages != "") %>%
    separate_rows(Packages, sep = ", ") %>%
    count(Packages, sort = TRUE) %>%
    head(10)
  
  ggplot(dependency_counts, aes(x = reorder(Packages, n), y = n)) +
    geom_col(fill = "#4C78A8") + # Standard formal blue
    coord_flip() +
    labs(
      title = "Most Frequent Package Dependencies",
      x = "Package Name",
      y = "Frequency (Script Count)"
    ) +
    theme_minimal() +
    theme(
      panel.grid.major.y = element_blank(),
      axis.text = element_text(size = 10)
    )
} else {
  message("No packages detected to visualize.")
}

Top 10 Package Dependencies in Project

3.8 Save Results

The full analysis report is exported to a CSV file for auditing and distribution

Code
#| label: save-results

output_dir <- file.path("Results", "Inspect_rCode")
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)

output_file <- file.path(output_dir, paste0("Code_Inspection_", Sys.Date(), ".csv"))
write_csv(report, output_file)

message("Report successfully saved to: ", output_file)

3.9 Curation Insights

Use the report to prioritize fixes:

  • Reproducibility (Packages): If Packages is empty for a script, verify if it relies on base R only or if the researcher assumes packages are pre-loaded.

  • Portability (AbsPathsFound): Any file with AbsPathsFound > 0 requires attention. Ask the researcher to replace paths like C:/Users/Dan/Project/Data with relative paths like ./Data or use the here package.

  • Security (PotentialTokens): If Potential_Secrets > 0, MANUALLY inspect the file. Do not publish code with active API keys or credentials.

  • Syntax Integrity (Syntax_Check): If the column contains an error message (e.g., “unexpected symbol”), the script is broken and will not run. These files should be flagged for immediate correction by the author.

  • Environment Reconstruction: The Packages column provides the raw material for building a DESCRIPTION file or renv.lock. Without this list, future users must guess which software versions to install.

3.10 Additional Tools for Researchers

To prevent these issues before curation, researchers may consider adopting the following tools:

  • renv (Package Management): A tool for creating reproducible environments. It generates a lockfile (renv.lock) recording the exact version of every package used, ensuring the project can be restored on another machine (Ushey and Wickham 2024) .

  • lintr (Static Analysis): A package that automatically checks code for syntax errors, style violations, and potential bugs as you write it (Hester et al. 2025).

  • Docker (Containerization): A technology that packages the entire operating system, code, and data into a single executable unit, providing the gold standard for reproducibility (Boettiger 2015).

3.11 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

3.11.1 The Inspect_Code_Script.R Script

Download the R Script: Inspect_Code_Script.R

3.11.2 Example HPC Submission Script (Inspect_Code_submit.sh)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --job-name=code_check

# Load R module
module load R

# Define data directory
DATA_DIR="/scratch/your_user/your_project/code_dir"

# Run Script
Rscript Scripts/Inspect_Code_Script.R $DATA_DIR

3.12 References