Author

Natalie Williams

Published

October 16, 2025

17.1 Overview

NetCDF (Network Common Data Form) is the OGC standard for storing multidimensional scientific data.

NoteCuration Goal

Validate the self-describing nature of scientific data. Our objective is to ensure NetCDF files follow community conventions (CF Conventions) and contain sufficient metadata (units, dimensions, and CRS) for reliable future reuse.

WarningPreservation Risk

NetCDF defines how to store data, but not what it represents. Non-compliance with the Climate and Forecast (CF) conventions renders the data contextually unusable, as software cannot reliably interpret physical units or spatial alignment.

This notebook evaluates NetCDF files on three levels:

  1. Metadata Compliance: Checking for global attributes like Conventions or institution.
  2. Spatial Awareness: Detecting Coordinate Reference Systems (CRS).
  3. Data Health: Scanning for empty datasets (100% NaNs).

17.2 Setup

Before running this notebook, you need to ensure the required R packages are installed.

17.2.1 R Packages

The following R packages are required. If you don’t have them, run this code once in your R console:

Code
# install.packages(c("tidyverse", "tidync", "ncmeta", "rstudioapi"))

17.3 Load Libraries

This chunk loads all the necessary libraries for the session.

Code
library(tidyverse)
library(tidync)     # Tidy interface for NetCDF
library(ncmeta)     # Low-level metadata extraction
library(rstudioapi) # For directory selection

17.4 Select target directory with NetCDF files

This section identifies the directory to be analyzed and finds all .nc files within it.

Note: If you are running this interactively in RStudio, a dialog box will appear. If you are rendering this document (where no user interaction is possible), it will default to the params$target_dir defined in the YAML header.

Code
# 1. Try to select interactively if in RStudio
if (interactive() && .Platform$OS.type == "windows") { 
  selected_dir <- rstudioapi::selectDirectory(caption = "Select NetCDF Directory")
} else {
  selected_dir <- NULL
}

# 2. Logic to determine final directory (Interactive vs Parameter)
if (!is.null(selected_dir)) {
  target_dir <- selected_dir
} else {
  target_dir <- params$target_dir
}

print(paste("Analyzing directory:", target_dir))
[1] "Analyzing directory: data/Inspect_nc/"

17.4.1 Find NetCDF files

Now we scan the selected directory for all .nc files.

Code
# Find all NetCDF files
nc_files <- list.files(
  path = target_dir,
  pattern = "\\.nc$",
  recursive = TRUE,
  full.names = TRUE,
  ignore.case = TRUE
)

# Print the number of files found and show the first few paths
print(paste("Found", length(nc_files), "NetCDF files."))
[1] "Found 6 NetCDF files."
Code
head(nc_files)
[1] "data/Inspect_nc//Averaged_exceedance_TP_p99p0_ERA5.nc"
[2] "data/Inspect_nc//Geopotential_orography.nc"           
[3] "data/Inspect_nc//TPp99p0_2001_2020_ERA5.nc"           
[4] "data/Inspect_nc//TPp99p0_2001_2020_IMERG.nc"          
[5] "data/Inspect_nc//TPp99p0_2001_2020_IMERG025grid.nc"   
[6] "data/Inspect_nc//WSp99p0_2001_2020_ERA5.nc"           

17.5 Usability scan

This phase evaluates the “fitness for use.” We check if the files contain valid spatial coordinates (CRS) and if the data variables actually contain numbers (Data Health).

Code
nc_files <- list.files(target_dir, pattern = "\\.nc$", full.names = TRUE, recursive = TRUE)

# Helper: Usability Scan
inspect_nc_inventory <- function(fp) {
  fname <- basename(fp)
  
  # 1. Safe Load
  tnc <- tryCatch(tidync(fp), error = function(e) NULL)
  
  if (is.null(tnc)) {
    return(tibble(
      FileName = fname,
      Status = "Corrupt/Unreadable",
      DimsSummary = NA,
      VarCount = NA,
      HasCRS = NA,
      DataHealth = NA
    ))
  }
  
  # 2. Extract Metadata Summary
  # Active grid dimensions
  dims <- tnc %>% hyper_dims()
  dims_str <- paste(dims$name, collapse = " x ")
  
  # Variables
  vars <- tnc %>% hyper_vars()
  var_count <- length(vars$name)
  
  # 3. Spatial Check (CRS)
  # Look for standard lat/lon names or "grid_mapping" attribute
  has_lat <- any(str_detect(dims$name, "(?i)lat|y"))
  has_lon <- any(str_detect(dims$name, "(?i)lon|x"))
  
  # Robust attribute check using ncmeta
  all_atts <- ncmeta::nc_atts(fp)
  has_grid_mapping <- any(all_atts$name == "grid_mapping")
  
  spatial_status <- if (has_grid_mapping || (has_lat && has_lon)) "Georeferenced" else "No Spatial Grid"
  
  # 4. Data Health Check (Sparsity)
  # We read a tiny slice of the first active variable to see if it contains valid data
  is_empty_label <- "Unknown"
  try({
    first_var <- vars$name[1]
    # Pull first 100 values only
    sample_data <- tnc %>% 
      activate(first_var) %>% 
      hyper_slice(select_var = first_var) %>% 
      as_tibble()
    
    val_col <- names(sample_data)[ncol(sample_data)]
    if (all(is.na(sample_data[[val_col]]))) {
      is_empty_label <- "⚠️ All NaNs (Empty)"
    } else {
      is_empty_label <- "Contains Data"
    }
  }, silent = TRUE)
  
  return(tibble(
    FileName = fname,
    Status = "Valid",
    DimsSummary = dims_str,
    VarCount = var_count,
    HasCRS = spatial_status,
    DataHealth = is_empty_label
  ))
}

# Run Inventory
message(paste("Scanning", length(nc_files), "files for usability..."))
inventory_results <- map_dfr(nc_files, inspect_nc_inventory)

print(paste("Inventory complete for", nrow(inventory_results), "files."))
[1] "Inventory complete for 6 files."
Code
head(inventory_results)
# A tibble: 6 × 6
  FileName                         Status DimsSummary VarCount HasCRS DataHealth
  <chr>                            <chr>  <chr>          <int> <chr>  <chr>     
1 Averaged_exceedance_TP_p99p0_ER… Valid  longitude …        2 Geore… Unknown   
2 Geopotential_orography.nc        Valid  longitude …        1 Geore… Unknown   
3 TPp99p0_2001_2020_ERA5.nc        Valid  longitude …        1 Geore… Unknown   
4 TPp99p0_2001_2020_IMERG.nc       Valid  longitude …        1 Geore… Unknown   
5 TPp99p0_2001_2020_IMERG025grid.… Valid  lon x lat          1 Geore… Unknown   
6 WSp99p0_2001_2020_ERA5.nc        Valid  longitude …        1 Geore… Unknown   

17.6 Metadata Extraction

Now, we perform a deep extraction of all attributes to create detailed documentation.

Code
# Create a "safely" version of tidync to handle potentially corrupt files
safe_tidync <- purrr::safely(tidync)

# 1. Process all files
processed_files <- purrr::map(nc_files, ~safe_tidync(.x)) %>% 
  set_names(nc_files)

# 2. Separate successful results
successful_results <- purrr::map(processed_files, "result") %>% 
  purrr::compact()

errors <- purrr::map(processed_files, "error") %>% 
  purrr::compact()

if (length(errors) > 0) {
  message("The following files failed deep extraction:")
  walk(names(errors), message)
}

# 3. Extract Raw Components
nc_dimensions <- purrr::map(successful_results, ~.x$dimension) %>% 
  bind_rows(.id = "FileName") %>% mutate(FileName = basename(FileName))

nc_variables <- purrr::map(successful_results, ~.x$variable) %>% 
  bind_rows(.id = "FileName") %>% mutate(FileName = basename(FileName))

nc_attributes <- purrr::map(successful_results, ~.x$attribute) %>% 
  bind_rows(.id = "FileName") %>% mutate(FileName = basename(FileName))

print("Deep extraction complete.")
[1] "Deep extraction complete."

17.7 Reshape metadata for comparison

We reshape the metadata into two summary tables: one for Global Attributes (file-level) and one for Variable Attributes (variable-level).

Code
# A. Global Attributes Summary
nc_attributes_global <- nc_attributes %>%
  filter(variable == "NC_GLOBAL") %>%
  pivot_wider(
    id_cols = FileName,
    names_from = name,
    values_from = value,
    values_fn = ~paste(., collapse = "; ")
  )

print("--- Global Attributes Summary ---")
[1] "--- Global Attributes Summary ---"
Code
glimpse(nc_attributes_global)
Rows: 1
Columns: 3
$ FileName    <chr> "Geopotential_orography.nc"
$ Conventions <chr> "CF-1.6"
$ history     <chr> "2024-04-18 11:32:43 GMT by grib_to_netcdf-2.25.1: /opt/ec…
Code
# B. Variable Attributes Summary
nc_variables_with_attributes <- nc_variables %>%
  left_join(
    filter(nc_attributes, variable != "NC_GLOBAL"),
    by = c("name" = "variable", "FileName")
  ) %>%
  pivot_wider(
    names_from = name.y, 
    values_from = value,
    values_fn = ~paste(., collapse = "; ")
  )

print("--- Variables and Attributes Summary ---")
[1] "--- Variables and Attributes Summary ---"
Code
head(nc_variables_with_attributes)
# A tibble: 6 × 20
  FileName       id.x name  type  ndims natts dim_coord active  id.y `NA`  units
  <chr>         <int> <chr> <chr> <int> <int> <lgl>     <lgl>  <dbl> <chr> <chr>
1 Averaged_exc…     0 long… NC_F…     1     0 TRUE      FALSE     NA NULL  <NA> 
2 Averaged_exc…     1 lati… NC_F…     1     0 TRUE      FALSE     NA NULL  <NA> 
3 Averaged_exc…     2 seaE… NC_F…     3     0 FALSE     TRUE      NA NULL  <NA> 
4 Averaged_exc…     3 seaE… NC_F…     3     0 FALSE     TRUE      NA NULL  <NA> 
5 Averaged_exc…     4 Exco… NC_F…     2     0 FALSE     FALSE     NA NULL  <NA> 
6 Averaged_exc…     5 Exce… NC_F…     2     0 FALSE     FALSE     NA NULL  <NA> 
# ℹ 9 more variables: long_name <chr>, calendar <chr>, scale_factor <chr>,
#   add_offset <chr>, `_FillValue` <chr>, missing_value <chr>,
#   standard_name <chr>, coordinates <chr>, axis <chr>

17.8 Save Summary Reports

Finally, we save the most useful summary tables to .csv files for documentation and further analysis.

Code
output_dir <- "Results/Inspect_nc"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

timestamp <- format(Sys.Date(), "%Y%m%d")

# 1. Save Inventory (High-Level Scan)
write.csv(inventory_results, file.path(output_dir, paste0("NetCDF_Inventory_", timestamp, ".csv")), row.names = FALSE)

# 2. Save Deep Metadata (Detailed Tables)
write.csv(nc_dimensions, file.path(output_dir, paste0("NetCDF_Dimensions_", timestamp, ".csv")), row.names = FALSE)
write.csv(nc_attributes_global, file.path(output_dir, paste0("NetCDF_Global_Attributes_", timestamp, ".csv")), row.names = FALSE)
write.csv(nc_variables_with_attributes, file.path(output_dir, paste0("NetCDF_Variables_Attributes_", timestamp, ".csv")), row.names = FALSE)

print(paste("All reports saved to:", output_dir))
[1] "All reports saved to: Results/Inspect_nc"

17.9 Curation Insights

Use the generated reports to guide your preservation actions:

  • Spatial Awareness (HasCRS): Files marked “No Spatial Grid” lack standard latitude/longitude coordinates or a grid_mapping attribute. These files will have difficulties to load in GIS software. Check if they are non-spatial (e.g., time-series at a single station) or if the metadata is missing.

  • Data Health (DataHealth): Files marked “⚠️ All NaNs” are likely empty shells—the model ran but produced no output. You can verify these files manually and exclude them from the archive.

  • Metadata Compliance: In the NetCDF_Global_Attributes CSV, the curator can check for the attribute Conventions = “CF-1.x” and verify fullfilment of gold standards.

17.10 Additional Tools & Resources

17.10.1 Verify consistent global attributes

Verify if all files in a dataset have the same title, institution, source, and CF Conventions.

if (nrow(nc_attributes_global) > 0) {
  # Select a few key attributes and count the unique combinations
  global_consistency_check <- nc_attributes_global %>%
    select(FileName, contains("title"), contains("institution"), contains("source"), contains("Conventions")) %>%
    # The line below groups by all columns except FileName
    group_by(across(-FileName)) %>%
    summarise(file_count = n(), .groups = "drop")
  
  print("Consistency Check of Key Global Attributes:")
  print(global_consistency_check)
}
[1] "Consistency Check of Key Global Attributes:"
# A tibble: 1 × 2
  Conventions file_count
  <chr>            <int>
1 CF-1.6               1

17.10.2 Check for essential variable attributes

For data to be reusable, variables should always have attributes like long_name and units. This section allows checking for missing attributes across all variables.

if (nrow(nc_variables_with_attributes) > 0) {
  missing_attribute_check <- nc_variables_with_attributes %>%
    # Summarise the number of variables missing these key attributes
    summarise(
      missing_long_name = sum(is.na(long_name)),
      missing_units = sum(is.na(units))
    )
  
  print("Check for Missing Essential Variable Attributes:")
  print(missing_attribute_check)
}
[1] "Check for Missing Essential Variable Attributes:"
# A tibble: 1 × 2
  missing_long_name missing_units
              <int>         <int>
1                46            46
  • CDO (Climate Data Operators): Is a command-line suite for manipulating and analyzing NetCDF data. It is the industry standard for regridding and statistical aggregation.

  • Panoply: A cross-platform application from NASA that plots geo-referenced arrays from NetCDF files. Excellent for “Visual QC”.

  • NCO (NetCDF Operators): A toolkit to perform arithmetic and attribute editing on NetCDF files (Zender 2008).

17.11 Using the Non-Interactive R Script

For users who want to run this analysis on a server, in a batch job, or from the command line, here is a pure R script that performs the same process.

17.11.1 The Inspect_nc_Script.R Script

Download the R Script: Inspect_nc_Script.R

17.11.2 Example HPC Submission Script (Inspect_nc_submit.sh)

#!/bin/bash
#SBATCH --job-name=nc_inspect
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=00:30:00
#SBATCH --output=logs/nc_inspect_%j.log

# 1. Load R Module
module load R

# 2. Define Directory
TARGET_DIR="/scratch/user/project_data/climate_models"

# 3. Prepare Environment
mkdir -p Results/Inspect_nc
mkdir -p logs

# 4. Run Analysis
echo "Starting NetCDF Inspection on $TARGET_DIR"
Rscript Inspect_nc_Script.R "$TARGET_DIR"

17.12 References