8 Container/Archive (.zip, .tar, .7z) Formats

Author

Data Curation Team

Published

January 5, 2025

8.1 Overview

Curators often receive data packaged in containers like ZIP, TAR, or 7-Zip. While convenient for transfer, these formats are “black boxes” that can hide potential data security risks.

Curation Goal

Peek inside compressed containers without full extraction. Our objective is to inventory contents, verify file integrity, and identify nested structures to ensure all data remains discoverable and accessible.

Identifying Risks

Archives are prone to bit-rot; a single corrupted bit in a solid archive can render the entire dataset unreadable. Furthermore, “Zip Bombs” can crash curation workstations, and nested archives often defy automated indexing systems.

This notebook uses the archive package to assess:

Integrity: Can the file be read? (Basic corruption check).
Expansion Ratio: Identifying potential “Zip Bombs” (Expansion > 100x).
Manifest: Generating a detailed summary of contents.

8.2 Setup

We use the archive package, which is a robust binding to the industry-standard libarchive C library.

8.2.1 R Packages

If you do not have the required packages, run this command once in your R console:

Code

# install.packages(c("tidyverse", "archive", "fs", "rstudioapi"))

8.2.2 Load libraries

Code

library(tidyverse)
library(archive)    # The engine for reading archives
library(fs)         # File system tools
library(rstudioapi)

8.3 Select Target Directory

Select the folder containing the archive files (e.g. zip, tar, gz, 7z).

Note: If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter.

Code

if (interactive() && .Platform$OS.type == "windows") {
  selected_dir <- rstudioapi::selectDirectory(caption = "Select TIFF Directory")
} else {
  selected_dir <- NULL
}

if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))

[1] "Analyzing directory: data/Inspect_Containers/"

8.4 Inventory of files

We scan for .zip, .tar, .gz, .7z, and .rar files.

Code

# Find archives (zip, tar, gz, 7z, rar)
archive_files <- list.files(
  path = target_dir,
  pattern = "\\.(zip|tar|tar\\.gz|tgz|7z|rar)$", 
  recursive = TRUE, 
  full.names = TRUE, 
  ignore.case = TRUE
)

if (length(archive_files) == 0) {
  stop("No archive files found in this folder.")
}

print(paste("Found", length(archive_files), "archive files."))

[1] "Found 4 archive files."

8.5 Non-Invasive Inspection

This function reads the header (Table of Contents) of the archive to calculate metrics without filling up your hard drive with extracted files.

Code

message("Generating Archive Manifests...")

inspect_archive <- function(fp) {
  fname <- basename(fp)
  
  tryCatch({
    # Get Physical File Size (Compressed)
    size_compressed_bytes <- file.size(fp)
    
    # Read the Manifest (Does NOT extract files)
    # Returns a tibble: path, size, date, mode
    contents <- archive::archive(fp)
    
    # Calculate Metrics
    file_count <- nrow(contents)
    size_extracted_bytes <- sum(contents$size)
    
    # Zip Bomb Detection (Compression Ratio)
    # Avoid division by zero for empty archives
    ratio <- if(size_compressed_bytes > 0) size_extracted_bytes / size_compressed_bytes else 0
    
    # Content Profiling
    # What kind of files are inside? Get extensions.
    extensions <- fs::path_ext(contents$path)
    top_exts <- names(sort(table(extensions), decreasing = TRUE))[1:3]
    content_summary <- paste(top_exts, collapse = ", ")
    
    # Check for Nested Archives
    has_nested <- any(extensions %in% c("zip", "tar", "gz", "7z", "rar"))
    
    tibble(
      FileName = fname,
      FileCount = file_count,
      Compressed_MB = round(size_compressed_bytes / 1024^2, 2),
      Extracted_MB = round(size_extracted_bytes / 1024^2, 2),
      CompressionRatio = round(ratio, 1),
      ContentTypes = content_summary,
      HasNestedArchives = has_nested,
      Status = "Success"
    )
    
  }, error = function(e) {
    tibble(
      FileName = fname, FileCount = NA, Compressed_MB = NA, 
      Extracted_MB = NA, CompressionRatio = NA, ContentTypes = NA,
      HasNestedArchives = NA,
      Status = paste("Corrupt/Unreadable:", e$message)
    )
  })
}

# Execute Analysis
report <- map_dfr(archive_files, inspect_archive)

# Display
print("--- Archive Safety Report ---")

[1] "--- Archive Safety Report ---"

Code

print(head(report))

# A tibble: 4 × 8
  FileName    FileCount Compressed_MB Extracted_MB CompressionRatio ContentTypes
  <chr>           <int>         <dbl>        <dbl>            <dbl> <chr>       
1 ctd_splitt…         1          0            0.01              3.2 py, NA, NA  
2 ddi_2_5_1.…       963          3.59        39.6              11   html, , xsd 
3 ScatterPlo…        52         15.4         15.6               1   png, NA, NA 
4 VietorisRi…       287         24.4         24.7               1   png, , NA   
# ℹ 2 more variables: HasNestedArchives <lgl>, Status <chr>

8.6 Expansion Risk

We visualize the difference between the Compressed Size (Storage) and Extracted Size (Risk). Huge discrepancies indicate potential decompression issues.

Code

if (nrow(report) > 0 && any(report$Status == "Success")) {
  
  # Reshape for plotting
  plot_data <- report %>%
    filter(Status == "Success") %>%
    select(FileName, Compressed_MB, Extracted_MB) %>%
    pivot_longer(cols = c("Compressed_MB", "Extracted_MB"), names_to = "Type", values_to = "Size")

  ggplot(plot_data, aes(x = FileName, y = Size, fill = Type)) +
    geom_bar(stat = "identity", position = "dodge") +
    scale_y_log10() + # Use log scale because differences can be massive
    labs(
      title = "Archive Expansion Potential",
      y = "Size (MB) - Log Scale",
      x = NULL
    ) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

Compressed vs. Extracted Size (Log Scale)

8.7 Save Results

Save the report to a CSV file for review.

Code

output_dir <- "Results/Inspect_Containers"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

output_file <- file.path(output_dir, paste0("Container_Manifest_", format(Sys.Date(), "%Y%m%d"), ".csv"))

write.csv(report, output_file, row.names = FALSE)
print(paste("Manifest saved to:", output_file))

[1] "Manifest saved to: Results/Inspect_Containers/Container_Manifest_20260703.csv"

8.8 Curation Insights

Use this report to ensure safety before ingesting:

Compression Ratio > 100: This is a potential Zip Bomb. A 500 MB file turning into 500 GB can fill a hard drive instantly. Inspect these files cautiously.
Status “Corrupt/Unreadable”: This implies the container header may be broken and the data inside is likely lost. Request a re-transfer from the depositor.
HasNestedArchives = TRUE: There is extra complexity in properly reviewing a Zip inside a Zip. If the above cautions are clear, then unpack the outer layer and review the inner archive formats directly. Consider discussing with the depositor ways to re-organize or re-package the dataset so that nested archive formats aren’t needed.

8.9 Additional Tools

7-Zip: The open-source standard for handling high-compression archives. It supports almost every format.
Droid (Digital Record Object Identification): A tool from The National Archives (UK) to profile file formats inside archives without full extraction.

8.10 Using the Non-Interactive R Script

For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version.

Download the R Script: Inspect_Containers_Script.R

8.10.1 Example HPC Submission Script (`Inspect_Containers_submit.sh`)

#!/bin/bash
#SBATCH --job-name=archive_check
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --output=logs/archive_check_%j.log

module load R
# Note: The 'archive' package relies on libarchive. 
# If it fails, you might need: module load libarchive

# Define target directory
DATA_DIR="/scratch/user/project_data/deposits"

# Prepare Environment
mkdir -p Results/Inspect_Archives
mkdir -p logs

# Run
echo "Starting Archive Inspection on $DATA_DIR"
Rscript Inspect_Archive_Script.R "$DATA_DIR"

8.11 References

--- title: "Container/Archive (.zip, .tar, .7z) Formats" author: "Data Curation Team" date: "2025-01-05" format: html: toc: true toc-location: left code-fold: true theme: cosmo params: target_dir: "data/Inspect_Containers/" bibliography: references.bib --- ## Overview Curators often receive data packaged in containers like **ZIP**, **TAR**, or **7-Zip**. While convenient for transfer, these formats are "black boxes" that can hide potential data security risks. ::: {.callout-note title="Curation Goal"} Peek inside compressed containers without full extraction. Our objective is to inventory contents, verify file integrity, and identify nested structures to ensure all data remains discoverable and accessible. ::: ::: {.callout-warning title="Identifying Risks"} Archives are prone to bit-rot; a single corrupted bit in a solid archive can render the entire dataset unreadable. Furthermore, "Zip Bombs" can crash curation workstations, and nested archives often defy automated indexing systems. ::: **This notebook uses the `archive` package to assess:** 1. **Integrity:** Can the file be read? (Basic corruption check). 2. **Expansion Ratio:** Identifying potential "Zip Bombs" (Expansion \> 100x). 3. **Manifest:** Generating a detailed summary of contents. ------------------------------------------------------------------------ ## Setup We use the `archive` package, which is a robust binding to the industry-standard `libarchive` C library. ### R Packages If you do not have the required packages, run this command once in your R console: ```{r} # install.packages(c("tidyverse", "archive", "fs", "rstudioapi")) ``` ### Load libraries ```{r} #| label: setup #| message: false #| warning: false library(tidyverse) library(archive) # The engine for reading archives library(fs) # File system tools library(rstudioapi) ``` ## Select Target Directory Select the folder containing the archive files (e.g. zip, tar, gz, 7z). **Note:** If running interactively, a dialog box will appear. Otherwise, it defaults to the target_dir parameter. ```{r} #| label: select-target-dir if (interactive() && .Platform$OS.type == "windows") { selected_dir <- rstudioapi::selectDirectory(caption = "Select TIFF Directory") } else { selected_dir <- NULL } if (!is.null(selected_dir)) { target_dir <- selected_dir } else { target_dir <- params$target_dir } print(paste("Analyzing directory:", target_dir)) ``` ## Inventory of files We scan for `.zip`, `.tar`, `.gz`, `.7z`, and `.rar` files. ```{r} #| label: inventory # Find archives (zip, tar, gz, 7z, rar) archive_files <- list.files( path = target_dir, pattern = "\\.(zip|tar|tar\\.gz|tgz|7z|rar)$", recursive = TRUE, full.names = TRUE, ignore.case = TRUE ) if (length(archive_files) == 0) { stop("No archive files found in this folder.") } print(paste("Found", length(archive_files), "archive files.")) ``` ## Non-Invasive Inspection This function reads the header (Table of Contents) of the archive to calculate metrics without filling up your hard drive with extracted files. ```{r} #| label: inspection-logic #| warning: false #| message: false message("Generating Archive Manifests...") inspect_archive <- function(fp) { fname <- basename(fp) tryCatch({ # Get Physical File Size (Compressed) size_compressed_bytes <- file.size(fp) # Read the Manifest (Does NOT extract files) # Returns a tibble: path, size, date, mode contents <- archive::archive(fp) # Calculate Metrics file_count <- nrow(contents) size_extracted_bytes <- sum(contents$size) # Zip Bomb Detection (Compression Ratio) # Avoid division by zero for empty archives ratio <- if(size_compressed_bytes > 0) size_extracted_bytes / size_compressed_bytes else 0 # Content Profiling # What kind of files are inside? Get extensions. extensions <- fs::path_ext(contents$path) top_exts <- names(sort(table(extensions), decreasing = TRUE))[1:3] content_summary <- paste(top_exts, collapse = ", ") # Check for Nested Archives has_nested <- any(extensions %in% c("zip", "tar", "gz", "7z", "rar")) tibble( FileName = fname, FileCount = file_count, Compressed_MB = round(size_compressed_bytes / 1024^2, 2), Extracted_MB = round(size_extracted_bytes / 1024^2, 2), CompressionRatio = round(ratio, 1), ContentTypes = content_summary, HasNestedArchives = has_nested, Status = "Success" ) }, error = function(e) { tibble( FileName = fname, FileCount = NA, Compressed_MB = NA, Extracted_MB = NA, CompressionRatio = NA, ContentTypes = NA, HasNestedArchives = NA, Status = paste("Corrupt/Unreadable:", e$message) ) }) } # Execute Analysis report <- map_dfr(archive_files, inspect_archive) # Display print("--- Archive Safety Report ---") print(head(report)) ``` ## Expansion Risk We visualize the difference between the Compressed Size (Storage) and Extracted Size (Risk). Huge discrepancies indicate potential decompression issues. ```{r} #| label: viz-risk #| fig-cap: "Compressed vs. Extracted Size (Log Scale)" if (nrow(report) > 0 && any(report$Status == "Success")) { # Reshape for plotting plot_data <- report %>% filter(Status == "Success") %>% select(FileName, Compressed_MB, Extracted_MB) %>% pivot_longer(cols = c("Compressed_MB", "Extracted_MB"), names_to = "Type", values_to = "Size") ggplot(plot_data, aes(x = FileName, y = Size, fill = Type)) + geom_bar(stat = "identity", position = "dodge") + scale_y_log10() + # Use log scale because differences can be massive labs( title = "Archive Expansion Potential", y = "Size (MB) - Log Scale", x = NULL ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) } ``` ## Save Results Save the report to a CSV file for review. ```{r} #| label: save-results output_dir <- "Results/Inspect_Containers" dir.create(output_dir, recursive = TRUE, showWarnings = FALSE) output_file <- file.path(output_dir, paste0("Container_Manifest_", format(Sys.Date(), "%Y%m%d"), ".csv")) write.csv(report, output_file, row.names = FALSE) print(paste("Manifest saved to:", output_file)) ``` ## Curation Insights Use this report to ensure safety before ingesting: - **Compression Ratio \> 100:** This is a potential Zip Bomb. A 500 MB file turning into 500 GB can fill a hard drive instantly. Inspect these files cautiously. - **Status "Corrupt/Unreadable":** This implies the container header may be broken and the data inside is likely lost. Request a re-transfer from the depositor. - **HasNestedArchives = TRUE:** There is extra complexity in properly reviewing a Zip inside a Zip. If the above cautions are clear, then unpack the outer layer and review the inner archive formats directly. Consider discussing with the depositor ways to re-organize or re-package the dataset so that nested archive formats aren't needed. ## Additional Tools - **7-Zip:** The [open-source standard](https://www.7-zip.org/) for handling high-compression archives. It supports almost every format. - **Droid (Digital Record Object Identification):** A [tool](https://github.com/digital-preservation/droid) from The National Archives (UK) to profile file formats inside archives without full extraction. ## Using the Non-Interactive R Script For users who want to run this analysis on a server (HPC), in a batch job, or from the command line, here is the pure R script version. Download the **R Script:** [**`Inspect_Containers_Script.R`**](Scripts/Inspect_Containers_Script.R) ### Example HPC Submission Script (`Inspect_Containers_submit.sh`) ``` bash #!/bin/bash #SBATCH --job-name=archive_check #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:30:00 #SBATCH --mem=8G #SBATCH --output=logs/archive_check_%j.log module load R # Note: The 'archive' package relies on libarchive. # If it fails, you might need: module load libarchive # Define target directory DATA_DIR="/scratch/user/project_data/deposits" # Prepare Environment mkdir -p Results/Inspect_Archives mkdir -p logs # Run echo "Starting Archive Inspection on $DATA_DIR" Rscript Inspect_Archive_Script.R "$DATA_DIR" ``` ## References ::: {#refs} :::

8.1 Overview

8.2 Setup

8.2.1 R Packages

8.2.2 Load libraries

8.3 Select Target Directory

8.4 Inventory of files

8.5 Non-Invasive Inspection

8.6 Expansion Risk

8.7 Save Results

8.8 Curation Insights

8.9 Additional Tools

8.10 Using the Non-Interactive R Script

8.10.1 Example HPC Submission Script (Inspect_Containers_submit.sh)

8.11 References

8.10.1 Example HPC Submission Script (`Inspect_Containers_submit.sh`)