Documents & Unstructured Text
Overview
Documentation is the backbone of reproducible research. README files, codebooks, and methodology reports provide the context necessary to understand data. However, for these documents to remain accessible over the long term, they must be curated just like the data itself.
Key Objectives:
- Accessibility (OCR): Ensure PDF documents contain searchable text, not just scanned images.
- Archival Formats: Verify that documents are saved in stable, open formats (e.g., PDF/A for PDFs, UTF-8 text for READMEs).
- Metadata: Extract embedded metadata (Author, Title) to ensure it matches the dataset description.
- Encoding: Ensure plain text files use standard UTF-8 encoding to prevent “mojibake” (garbled characters).
Supported Formats
- Portable Document Format (
.pdf): Standard for fixed-layout documents. - Plain Text (
.txt,.md,.csv): Standard for READMEs, codebooks, and simple data.
Common Curation Challenges
- “Dead” PDFs: Scanned documents that look like text but are actually images. These cannot be searched or indexed.
- Proprietary Encodings: Text files saved in Windows-1252 or MacRoman often display incorrectly on other systems.
- Broken Links: Documentation often references external URLs that may no longer exist.