5 Text (.txt) and Markdown (.md) Files

Author

Data Curation Team

Published

December 18, 2025

5.1 Overview

Text files are the simplest form of documentation. However, they are often prone to encoding and structural issues that impede interoperability.

Curation Goal

Ensure universal readability and “Archival Readiness” of text documents. Our objective is to validate UTF-8 encoding, identify “invisible” characters (BOM), and normalize line endings to ensure documents remain readable across all operating systems.

Preservation Risk

Character corruption (“Mojibake”) caused by legacy encodings (e.g., Windows-1252) and “link rot” from broken external URLs are the primary threats to the long-term usability of plain text documentation.

Key Curation Objectives:

Encoding Validation: Detect character encoding and ensure compliance with the UTF-8 standard.
Structural Integrity: Identify Byte Order Marks (BOM) and mixed line endings (CRLF/LF).
Link & Security Scan: Extract external URLs and scan for accidental PII leaks (e.g., email addresses).

5.2 Setup

We use readr for encoding detection and stringr for link extraction.

5.2.1 R Packages

Code

# install.packages(c("tidyverse", "readr", "rstudioapi", "stringr"))

5.2.2 Load libraries

Code

library(tidyverse)
library(readr)      # For encoding guessing
library(stringr)    # For Regex (Links/Emails)
library(rstudioapi) # For directory selection

5.3 Select Target Directory

Code

if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select Text Directory")
} else {
  selected_dir <- NULL
}

if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))

[1] "Analyzing directory: data/Inspect_Text/"

5.4 Inventory and Inspection

We scan for .txt, .md, .csv, and .rmd files. The inspection extracts encoding confidence, checks for the hidden BOM, identifies line endings, and scans for PII (Emails).

Code

message("Generating Text Report...")

text_files <- list.files(
  path = target_dir,
  pattern = "\\.(txt|md|csv|rmd)$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

# Regex Patterns
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

report <- purrr::map_dfr(text_files, function(file_path) {
  
  fname <- basename(file_path)
  
  tryCatch({
    # 1. BOM Detection (Read raw bytes)
    con <- file(file_path, "rb")
    bytes <- readBin(con, "raw", n = 4)
    close(con)
    
    # Check for UTF-8 BOM (EF BB BF)
    has_bom <- identical(bytes[1:3], as.raw(c(0xef, 0xbb, 0xbf)))
    
    # 2. Encoding Guess
    guess <- readr::guess_encoding(file_path, n_max = 1000)[1, ]
    encoding <- if (!is.na(guess$encoding)) guess$encoding else "Unknown"
    confidence <- if (!is.na(guess$confidence)) guess$confidence else 0
    
    # 3. Content Analysis (Read text)
    # Read safely with UTF-8 fallback
    content_lines <- readLines(file_path, warn = FALSE)
    full_text <- paste(content_lines, collapse = "\n")
    
    # 4. Line Ending Detection
    # We read raw again to distinguish \r\n vs \n (readLines normalizes them)
    raw_text <- readChar(file_path, nchars = 2000, useBytes = TRUE)
    eol_type <- "Unknown"
    if (grepl("\r\n", raw_text)) {
      eol_type <- "Windows (CRLF)"
    } else if (grepl("\n", raw_text)) {
      eol_type <- "Unix (LF)"
    } else if (grepl("\r", raw_text)) {
      eol_type <- "Classic Mac (CR)"
    }
    
    # 5. Extract Artifacts
    urls <- str_extract_all(full_text, url_pattern)[[1]]
    emails <- str_extract_all(full_text, email_pattern)[[1]]
    
    example_links <- paste(head(unique(urls), 3), collapse = ", ")
    
    tibble(
      FileName = fname,
      Encoding = encoding,
      Confidence = confidence,
      HasBOM = has_bom,
      LineEndings = eol_type,
      LineCount = length(content_lines),
      URL_Count = length(urls),
      Email_Count = length(unique(emails)),
      Example_Links = substr(example_links, 1, 100),
      Status = "Success"
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname, Encoding = NA, Confidence = NA, HasBOM = NA,
      LineEndings = NA, LineCount = NA, URL_Count = NA, Email_Count = NA,
      Example_Links = NA, Status = paste("Failed:", e$message)
    )
  })
})

# Display preview
print("--- Text Report Preview ---")

[1] "--- Text Report Preview ---"

Code

head(report)

# A tibble: 6 × 10
  FileName            Encoding Confidence HasBOM LineEndings LineCount URL_Count
  <chr>               <chr>         <dbl> <lgl>  <chr>           <int>     <int>
1 14_corr_ring_stats… ASCII             1 FALSE  Unix (LF)          12         0
2 388_corr_ring_stat… ASCII             1 FALSE  Unix (LF)          11         0
3 413_corr_ring_stat… ASCII             1 FALSE  Unix (LF)          13         0
4 File_tree.txt       UTF-8             1 FALSE  Unix (LF)         146         0
5 README_v2.txt       UTF-8             1 FALSE  Unix (LF)         783        14
6 readme.md           ASCII             1 FALSE  Unix (LF)          31         1
# ℹ 3 more variables: Email_Count <int>, Example_Links <chr>, Status <chr>

5.5 Visualization

We can visualize the distribution of detected encodings. Ideally, the repository should be 100% UTF-8 (or ASCII). Any “ISO-8859” or “Windows-1252” files are candidates for remediation.

Code

if (nrow(report) > 0) {
  ggplot(report %>% filter(Status == "Success"), aes(x = Encoding, fill = Encoding)) +
    geom_bar() +
    labs(
      title = "Text File Encodings",
      subtitle = "Archival Standard: UTF-8 / ASCII",
      x = "Detected Encoding",
      y = "File Count"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
}

5.6 Save Results

Code

output_dir <- "Results/Inspect_Text"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("Text_Report_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Report saved to:", output_file))

[1] "Report saved to: Results/Inspect_Text/Text_Report_20260515.csv"

5.7 Curation Insights

Use the generated CSV to perform these checks:

PII Check (Email_Count > 0): Text files (especially Readmes) often contain developer contact info.Verify if these emails are personal (e.g., gmail.com) or professional.
Encoding (Encoding != UTF-8): Legacy files (Windows-1252) or other encodings may display corrupted characters on the web. It is recommended to convert them to UTF-8 using procedures like iconv (see below).
BOM (HasBOM = TRUE): The Byte Order Mark (BOM) is often unnecessary for UTF-8 and can break some scripts (e.g., shebang lines in bash). Curators can remove the BOM if the file is intended for code execution.

5.8 Additional Tools

iconv: The standard command-line tool for converting text encodings (e.g., iconv -f WINDOWS-1252 -t UTF-8 in.txt > out.txt).
dos2unix: A tool to normalize line endings (converting Windows CRLF to Unix LF). This may be useful for ensuring scripts run correctly on Linux clusters.
Internet Archive Wayback Machine: Use this website to find live versions of broken URLs.

5.9 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_Text_Script.R

5.9.1 Example HPC Submission Script (`Inspect_Text_submit.sh`)

#!/bin/bash
#SBATCH --job-name=text_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:15:00
#SBATCH --mem=4G
#SBATCH --output=logs/text_check_%j.log

module load R

# Define target directory
TARGET_DIR="/scratch/user/project_data/docs"

# Prepare folders
mkdir -p Results/Inspect_Text
mkdir -p logs

# Run
echo "Starting Text Inspection on $TARGET_DIR"
Rscript Inspect_Text_Script.R "$TARGET_DIR"

5.10 References

--- title: "Text (.txt) and Markdown (.md) Files" author: "Data Curation Team" date: "2025-12-18" format: html: toc: true toc-location: left code-fold: true theme: cosmo bibliography: references.bib params: target_dir: "data/Inspect_Text/" --- ## Overview Text files are the simplest form of documentation. However, they are often prone to encoding and structural issues that impede interoperability. ::: {.callout-note title="Curation Goal"} Ensure universal readability and "Archival Readiness" of text documents. Our objective is to validate UTF-8 encoding, identify "invisible" characters (BOM), and normalize line endings to ensure documents remain readable across all operating systems. ::: ::: {.callout-warning title="Preservation Risk"} Character corruption ("Mojibake") caused by legacy encodings (e.g., Windows-1252) and "link rot" from broken external URLs are the primary threats to the long-term usability of plain text documentation. ::: **Key Curation Objectives:** 1. **Encoding Validation:** Detect character encoding and ensure compliance with the UTF-8 standard. 2. **Structural Integrity:** Identify Byte Order Marks (BOM) and mixed line endings (CRLF/LF). 3. **Link & Security Scan:** Extract external URLs and scan for accidental PII leaks (e.g., email addresses). ------------------------------------------------------------------------ ## Setup We use `readr` for encoding detection and `stringr` for link extraction. ### R Packages ```{r} # install.packages(c("tidyverse", "readr", "rstudioapi", "stringr")) ``` ### Load libraries ```{r} #| label: load-libraries #| message: false library(tidyverse) library(readr) # For encoding guessing library(stringr) # For Regex (Links/Emails) library(rstudioapi) # For directory selection ``` ## Select Target Directory ```{r} #| label: select-target-dir if (interactive() && .Platform$OS.type == "windows") { selected_dir <- rstudioapi::selectDirectory(caption = "Select Text Directory") } else { selected_dir <- NULL } if (!is.null(selected_dir)) { target_dir <- selected_dir } else { target_dir <- params$target_dir } print(paste("Analyzing directory:", target_dir)) ``` ## Inventory and Inspection We scan for .txt, .md, .csv, and .rmd files. The inspection extracts encoding confidence, checks for the hidden BOM, identifies line endings, and scans for PII (Emails). ```{r} #| label: extraction-logic #| warning: false #| message: false message("Generating Text Report...") text_files <- list.files( path = target_dir, pattern = "\\.(txt|md|csv|rmd)$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE ) # Regex Patterns url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\$\$,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" report <- purrr::map_dfr(text_files, function(file_path) { fname <- basename(file_path) tryCatch({ # 1. BOM Detection (Read raw bytes) con <- file(file_path, "rb") bytes <- readBin(con, "raw", n = 4) close(con) # Check for UTF-8 BOM (EF BB BF) has_bom <- identical(bytes[1:3], as.raw(c(0xef, 0xbb, 0xbf))) # 2. Encoding Guess guess <- readr::guess_encoding(file_path, n_max = 1000)[1, ] encoding <- if (!is.na(guess$encoding)) guess$encoding else "Unknown" confidence <- if (!is.na(guess$confidence)) guess$confidence else 0 # 3. Content Analysis (Read text) # Read safely with UTF-8 fallback content_lines <- readLines(file_path, warn = FALSE) full_text <- paste(content_lines, collapse = "\n") # 4. Line Ending Detection # We read raw again to distinguish \r\n vs \n (readLines normalizes them) raw_text <- readChar(file_path, nchars = 2000, useBytes = TRUE) eol_type <- "Unknown" if (grepl("\r\n", raw_text)) { eol_type <- "Windows (CRLF)" } else if (grepl("\n", raw_text)) { eol_type <- "Unix (LF)" } else if (grepl("\r", raw_text)) { eol_type <- "Classic Mac (CR)" } # 5. Extract Artifacts urls <- str_extract_all(full_text, url_pattern)[[1]] emails <- str_extract_all(full_text, email_pattern)[[1]] example_links <- paste(head(unique(urls), 3), collapse = ", ") tibble( FileName = fname, Encoding = encoding, Confidence = confidence, HasBOM = has_bom, LineEndings = eol_type, LineCount = length(content_lines), URL_Count = length(urls), Email_Count = length(unique(emails)), Example_Links = substr(example_links, 1, 100), Status = "Success" ) }, error = function(e) { tibble( FileName = fname, Encoding = NA, Confidence = NA, HasBOM = NA, LineEndings = NA, LineCount = NA, URL_Count = NA, Email_Count = NA, Example_Links = NA, Status = paste("Failed:", e$message) ) }) }) # Display preview print("--- Text Report Preview ---") head(report) ``` ## Visualization We can visualize the distribution of detected encodings. Ideally, the repository should be 100% UTF-8 (or ASCII). Any "ISO-8859" or "Windows-1252" files are candidates for remediation. ```{r} #| label: visual-text #| fig-cap: "Distribution of File Encodings" if (nrow(report) > 0) { ggplot(report %>% filter(Status == "Success"), aes(x = Encoding, fill = Encoding)) + geom_bar() + labs( title = "Text File Encodings", subtitle = "Archival Standard: UTF-8 / ASCII", x = "Detected Encoding", y = "File Count" ) + theme_minimal() + theme(legend.position = "none") } ``` ## Save Results ```{r} #| label: save-results output_dir <- "Results/Inspect_Text" dir.create(output_dir, recursive = TRUE, showWarnings = FALSE) output_file <- file.path(output_dir, paste0("Text_Report_", format(Sys.Date(), "%Y%m%d"), ".csv")) write.csv(report, output_file, row.names = FALSE) print(paste("Report saved to:", output_file)) ``` ## Curation Insights Use the generated CSV to perform these checks: - **PII Check (Email_Count \> 0):** Text files (especially Readmes) often contain developer contact info.Verify if these emails are personal (e.g., gmail.com) or professional. - **Encoding (Encoding != UTF-8):** Legacy files (Windows-1252) or other encodings may display corrupted characters on the web. It is recommended to convert them to UTF-8 using procedures like `iconv` (see below). - **BOM (HasBOM = TRUE):** The Byte Order Mark (BOM) is often unnecessary for UTF-8 and can break some scripts (e.g., shebang lines in bash). Curators can remove the BOM if the file is intended for code execution. ## Additional Tools - **iconv:** The standard [command-line tool](https://pubs.opengroup.org/onlinepubs/007904975/functions/iconv.html) for converting text encodings (e.g., iconv -f WINDOWS-1252 -t UTF-8 in.txt \> out.txt). - **dos2unix:** A [tool](https://linux.die.net/man/1/dos2unix) to normalize line endings (converting Windows CRLF to Unix LF). This may be useful for ensuring scripts run correctly on Linux clusters. - **Internet Archive Wayback Machine:** Use this [website](https://web.archive.org/) to find live versions of broken URLs. ## Using the Non-Interactive R Script For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version. Download the **R Script:** [**`Inspect_Text_Script.R`**](Scripts/Inspect_Text_Script.R) ### Example HPC Submission Script (`Inspect_Text_submit.sh`) ``` bash #!/bin/bash #SBATCH --job-name=text_check #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:15:00 #SBATCH --mem=4G #SBATCH --output=logs/text_check_%j.log module load R # Define target directory TARGET_DIR="/scratch/user/project_data/docs" # Prepare folders mkdir -p Results/Inspect_Text mkdir -p logs # Run echo "Starting Text Inspection on $TARGET_DIR" Rscript Inspect_Text_Script.R "$TARGET_DIR" ``` ## References ::: {#refs} :::

5.1 Overview

5.2 Setup

5.2.1 R Packages

5.2.2 Load libraries

5.3 Select Target Directory

5.4 Inventory and Inspection

5.5 Visualization

5.6 Save Results

5.7 Curation Insights

5.8 Additional Tools

5.9 Using the Non-Interactive R Script

5.9.1 Example HPC Submission Script (Inspect_Text_submit.sh)

5.10 References

5.9.1 Example HPC Submission Script (`Inspect_Text_submit.sh`)