8  Container/Archive (.zip, .tar, .7z) Formats

Author

Data Curation Team

Published

January 5, 2025

8.1 Overview

Curators often receive data packaged in containers like ZIP, TAR, or 7-Zip. While convenient for transfer, these formats are “black boxes” that hide potential preservation risks.

NoteCuration Goal

“Peek” inside compressed containers without full extraction. Our objective is to inventory contents, verify file integrity, and identify nested structures to ensure all data remains discoverable and accessible.

WarningPreservation Risk

Archives are prone to bit-rot; a single corrupted bit in a solid archive can render the entire dataset unreadable. Furthermore, “Zip Bombs” can crash curation workstations, and nested archives often defy automated indexing systems.

This notebook uses the archive package to assess:

  1. Integrity: Can the file be read? (Basic corruption check).
  2. Expansion Ratio: Identifying potential “Zip Bombs” (Expansion > 100x).
  3. Manifest: Generating a detailed summary of contents.

8.2 Setup

We use the archive package, which is a robust binding to the industry-standard libarchive C library.

8.2.1 R Packages

If you do not have the required packages, run this command once in your R console:

Code
# install.packages(c("tidyverse", "archive", "fs", "rstudioapi"))

8.2.2 Load libraries

Code
library(tidyverse)
library(archive)    # The engine for reading archives
library(fs)         # File system tools
library(rstudioapi)

8.3 Select Target Directory

Select the folder containing the TiFF files.

Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.

Code
if (interactive() && .Platform$OS.type == "windows") {
  selected_dir <- rstudioapi::selectDirectory(caption = "Select TIFF Directory")
} else {
  selected_dir <- NULL
}

if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: data/Inspect_Containers/"

8.4 Inventory of files

We scan for .tif and .tiff files.

Code
# Find archives (zip, tar, gz, 7z, rar)
archive_files <- list.files(
  path = target_dir,
  pattern = "\\.(zip|tar|tar\\.gz|tgz|7z|rar)$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

if (length(archive_files) == 0) {
  stop("No archive files found in this folder.")
}

print(paste("Found", length(archive_files), "archive files."))
[1] "Found 4 archive files."

8.5 Non-Invasive Inspection

This function reads the header (Table of Contents) of the archive to calculate metrics without filling up your hard drive with extracted files.

Code
message("Generating Archive Manifests...")

inspect_archive <- function(fp) {
  fname <- basename(fp)
  
  tryCatch({
    # Get Physical File Size (Compressed)
    size_compressed_bytes <- file.size(fp)
    
    # Read the Manifest (Does NOT extract files)
    # Returns a tibble: path, size, date, mode
    contents <- archive::archive(fp)
    
    # Calculate Metrics
    file_count <- nrow(contents)
    size_extracted_bytes <- sum(contents$size)
    
    # Zip Bomb Detection (Compression Ratio)
    # Avoid division by zero for empty archives
    ratio <- if(size_compressed_bytes > 0) size_extracted_bytes / size_compressed_bytes else 0
    
    # Content Profiling
    # What kind of files are inside? Get extensions.
    extensions <- fs::path_ext(contents$path)
    top_exts <- names(sort(table(extensions), decreasing = TRUE))[1:3]
    content_summary <- paste(top_exts, collapse = ", ")
    
    # Check for Nested Archives
    has_nested <- any(extensions %in% c("zip", "tar", "gz", "7z", "rar"))
    
    tibble(
      FileName = fname,
      FileCount = file_count,
      Compressed_MB = round(size_compressed_bytes / 1024^2, 2),
      Extracted_MB = round(size_extracted_bytes / 1024^2, 2),
      CompressionRatio = round(ratio, 1),
      ContentTypes = content_summary,
      HasNestedArchives = has_nested,
      Status = "Success"
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname, FileCount = NA, Compressed_MB = NA, 
      Extracted_MB = NA, CompressionRatio = NA, ContentTypes = NA,
      HasNestedArchives = NA,
      Status = paste("Corrupt/Unreadable:", e$message)
    )
  })
}

# Execute Analysis
report <- map_dfr(archive_files, inspect_archive)

# Display
print("--- Archive Safety Report ---")
[1] "--- Archive Safety Report ---"
Code
print(head(report))
# A tibble: 4 × 8
  FileName    FileCount Compressed_MB Extracted_MB CompressionRatio ContentTypes
  <chr>           <int>         <dbl>        <dbl>            <dbl> <chr>       
1 ctd_splitt…         1          0            0.01              3.2 py, NA, NA  
2 ddi_2_5_1.…       963          3.59        39.6              11   html, , xsd 
3 ScatterPlo…        52         15.4         15.6               1   png, NA, NA 
4 VietorisRi…       287         24.4         24.7               1   png, , NA   
# ℹ 2 more variables: HasNestedArchives <lgl>, Status <chr>

8.6 Expansion Risk

We visualize the difference between the Compressed Size (Storage) and Extracted Size (Risk). Huge discrepancies indicate potential decompression issues.

Code
if (nrow(report) > 0 && any(report$Status == "Success")) {
  
  # Reshape for plotting
  plot_data <- report %>%
    filter(Status == "Success") %>%
    select(FileName, Compressed_MB, Extracted_MB) %>%
    pivot_longer(cols = c("Compressed_MB", "Extracted_MB"), names_to = "Type", values_to = "Size")

  ggplot(plot_data, aes(x = FileName, y = Size, fill = Type)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_y_log10() + # Use log scale because differences can be massive
    labs(
      title = "Archive Expansion Potential",
      y = "Size (MB) - Log Scale",
      x = NULL
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

Compressed vs. Extracted Size (Log Scale)

8.7 Save Results

Save the report to a CSV file for review.

Code
output_dir <- "Results/Inspect_Containers"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("Container_Manifest_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Manifest saved to:", output_file))
[1] "Manifest saved to: Results/Inspect_Containers/Container_Manifest_20260515.csv"

8.8 Curation Insights

Use this report to ensure safety before ingesting:

  • Compression Ratio > 100: This is a potential Zip Bomb. A 500 MB file turning into 500 GB can fill a hard drive instantly. Inspect these files cautiously.

  • Status “Corrupt/Unreadable”: This implies the container header may be broken and the data inside is likely lost. Request a re-transfer from the depositor.

  • HasNestedArchives = TRUE: Preservation tools often fail to index “Zip inside a Zip.” In this case, the curator must unpack the outer layer and archive the inner files directly to ensure discoverability.

8.9 Additional Tools

  • 7-Zip: The open-source standard for handling high-compression archives. It supports almost every format.

  • Droid (Digital Record Object Identification): A tool from The National Archives (UK) to profile file formats inside archives without full extraction.

8.10 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_Containers_Script.R

8.10.1 Example HPC Submission Script (Inspect_Containers_submit.sh)

#!/bin/bash
#SBATCH --job-name=archive_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --output=logs/archive_check_%j.log

module load R
# Note: The 'archive' package relies on libarchive. 
# If it fails, you might need: module load libarchive

# Define target directory
DATA_DIR="/scratch/user/project_data/deposits"

# Prepare Environment
mkdir -p Results/Inspect_Archives
mkdir -p logs

# Run
echo "Starting Archive Inspection on $DATA_DIR"
Rscript Inspect_Archive_Script.R "$DATA_DIR"

8.11 References