CDM Parquet → linkml-store Loading Guide
Overview
This guide explains how to load KBase Common Data Model (CDM) parquet files into a queryable linkml-store DuckDB database for efficient analysis and querying.
What is CDM? The KBase Common Data Model is a comprehensive data warehouse containing: - 44 parquet tables (157 MB total) - 515K rows across static entities and system tables - 82.6M additional rows in dynamic brick tables - Complete ENIGMA genomic and experimental data
Why linkml-store? LinkML-Store provides: - Fast DuckDB-based columnar storage - Schema validation against LinkML models - Pythonic query interface - Easy integration with LinkML ecosystem
Quick Start
1. Load CDM Data (Static + System Tables)
Load the core CDM tables (23 tables, 515K rows):
# Using default paths
just load-cdm-store
# Or specify paths
just load-cdm-store data/enigma_coral.db output.db
This loads: - Static entity tables (sdt_*): Location, Sample, Reads, Assembly, Genome, Gene, etc. (17 tables, 273K rows) - System tables (sys_*): Ontology terms, Type definitions, Process records (6 tables, 242K rows)
Expected output:
📦 Loading CDM parquet data into linkml-store...
📦 Connecting to database: cdm_store.db
📋 Loaded schema: kbase-cdm
============================================================
📦 Loading Static Entity Tables (sdt_*)
============================================================
📥 Loading sdt_location as Location...
📊 Total rows: 42
✅ Loaded 42 records in 0.15s
📥 Loading sdt_sample as Sample...
📊 Total rows: 4,330
✅ Loaded 4,330 records in 0.45s
... (continues for all tables)
📊 Summary: Loaded 515,109 total records across 23 collections
⏱️ Total time: 45.2s (11,399 records/sec)
💾 Database saved to: cdm_store.db
Size: 48.23 MB
2. Query the Database
Show database statistics:
just cdm-store-stats
Find samples from a location:
just cdm-find-samples Location0000001
Search ontology terms:
just cdm-search-oterm "soil"
Trace provenance lineage:
just cdm-lineage Assembly Assembly0000001
Loading Options
Option 1: Core Tables Only (Default)
Tables: Static entities (sdt_) + System tables (sys_) Size: ~50 MB database, 515K rows Time: ~1 minute
just load-cdm-store
Option 2: Include Dynamic Brick Tables (Sampled)
Tables: Core + Dynamic bricks (sampled at 10K rows each) Size: ~100 MB database Time: ~2-3 minutes
just load-cdm-store-full
Option 3: Custom Loading
Use the Python script directly for fine-grained control:
# Load only static tables
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
data/enigma_coral.db \
--output my_cdm.db \
--include-static \
--no-system
# Load with custom dynamic brick sampling
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
data/enigma_coral.db \
--include-dynamic \
--max-dynamic-rows 50000
# Verbose output for debugging
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
data/enigma_coral.db \
--verbose
CDM Table Structure
Static Entity Tables (sdt_*)
17 tables representing core scientific entities:
| Table | LinkML Class | Rows | Description |
|---|---|---|---|
sdt_location |
Location | 42 | Sampling locations with coordinates |
sdt_sample |
Sample | 4,330 | Environmental samples |
sdt_community |
Community | 2,150 | Microbial communities |
sdt_reads |
Reads | 19,307 | Sequencing reads datasets |
sdt_assembly |
Assembly | 3,427 | Genome assemblies |
sdt_bin |
Bin | 1,234 | Metagenomic bins |
sdt_genome |
Genome | 6,688 | Annotated genomes |
sdt_gene |
Gene | 15,015 | Annotated genes |
sdt_strain |
Strain | 8,901 | Microbial strains |
sdt_taxon |
Taxon | 1,234 | Taxonomic classifications |
sdt_asv |
ASV | 213,044 | Amplicon sequence variants |
sdt_protocol |
Protocol | 42 | Experimental protocols |
sdt_image |
Image | 15 | Microscopy images |
sdt_condition |
Condition | 234 | Growth conditions |
sdt_dubseq_library |
DubSeqLibrary | 12 | DubSeq libraries |
sdt_tnseq_library |
TnSeqLibrary | 8 | TnSeq libraries |
sdt_enigma |
ENIGMA | 1 | Root entity |
System Tables (sys_*)
6 tables for metadata and provenance:
| Table | LinkML Class | Rows | Description |
|---|---|---|---|
sys_typedef |
SystemTypedef | 118 | Type definitions |
sys_ddt_typedef |
SystemDDTTypedef | 101 | Dynamic data type defs |
sys_oterm |
SystemOntologyTerm | 10,594 | Ontology term catalog |
sys_process |
SystemProcess | 142,958 | Provenance records |
sys_process_input |
SystemProcessInput | 90,395 | Process inputs (denormalized) |
sys_process_output |
SystemProcessOutput | 38,228 | Process outputs (denormalized) |
Dynamic Data Tables (ddt_*)
21 tables for measurement arrays:
| Table | Rows | Default Strategy |
|---|---|---|
ddt_ndarray |
20 | Full load |
ddt_brick* (20 tables) |
82.6M total | Sampled or skipped |
CDM Naming Conventions
The CDM uses specific naming patterns different from the original CORAL schema:
Primary Keys
- Pattern:
sdt_{entity}_id(e.g.,sdt_sample_id) - Example:
Sample0000001
Entity Names
- Pattern:
sdt_{entity}_name(e.g.,sdt_sample_name) - Used in foreign key references instead of IDs
Foreign Keys
- Use
_namesuffix, not_id - Example:
location_refreferencesLocation.sdt_location_name
Ontology Terms
- Split into ID + name pairs
- Pattern:
{field}_sys_oterm_id+{field}_sys_oterm_name - Example:
material_sys_oterm_id+material_sys_oterm_name
Query Interface
Python API
from scripts.cdm_analysis.query_cdm_store import CDMStoreQuery
# Initialize
query = CDMStoreQuery('cdm_store.db')
# Get statistics
stats = query.stats()
print(f"Total records: {stats['total_records']:,}")
# Find samples by location
samples = query.find_samples_by_location('Location0000001')
# Search ontology terms
terms = query.search_ontology_terms('soil')
# Trace provenance
lineage = query.trace_lineage('Assembly', 'Assembly0000001')
Command Line
# Show database stats
python scripts/cdm_analysis/query_cdm_store.py --db cdm_store.db stats
# Find samples
python scripts/cdm_analysis/query_cdm_store.py --db cdm_store.db \
find-samples --location Location0000001
# Search ontology terms
python scripts/cdm_analysis/query_cdm_store.py --db cdm_store.db \
search-oterm "soil"
# Trace lineage
python scripts/cdm_analysis/query_cdm_store.py --db cdm_store.db \
lineage Assembly Assembly0000001
# Export results to JSON
python scripts/cdm_analysis/query_cdm_store.py --db cdm_store.db \
stats --export stats.json
Justfile Commands
# Statistics
just cdm-store-stats
# Find samples
just cdm-find-samples Location0000001
# Search ontology terms
just cdm-search-oterm "soil"
# Trace lineage
just cdm-lineage Assembly Assembly0000001
Common Queries
1. Get All Samples from a Location
query = CDMStoreQuery('cdm_store.db')
samples = query.find_samples_by_location('Location0000001')
for sample in samples:
print(f"Sample: {sample['sdt_sample_name']}")
print(f" Depth: {sample.get('depth')}m")
print(f" Date: {sample.get('date')}")
2. Search Ontology Terms
# Find all soil-related terms
terms = query.search_ontology_terms('soil', limit=50)
for term in terms:
print(f"{term['sys_oterm_id']}: {term['sys_oterm_name']}")
3. Trace Assembly Provenance
lineage = query.trace_lineage('Assembly', 'Assembly0000001')
# What reads were used to create this assembly?
for process in lineage['upstream']:
print(f"Process: {process['process_type']}")
for input_ref in process['inputs']:
print(f" Input: {input_ref}")
4. Get Database Statistics
stats = query.stats()
print(f"Database: {stats['database']}")
print(f"Collections: {stats['total_collections']}")
print(f"Total records: {stats['total_records']:,}")
for coll_name, count in stats['collections'].items():
print(f" {coll_name}: {count:,}")
Performance Considerations
Loading Time
| Dataset | Tables | Rows | Size | Time |
|---|---|---|---|---|
| Static + System | 23 | 515K | 50 MB | ~1 min |
| + Dynamic (sampled) | 28 | ~615K | 100 MB | ~2-3 min |
Query Performance
- Small tables (<10K rows): Instantaneous
- Medium tables (10-100K rows): <1 second
- Large tables (>100K rows): 1-5 seconds
- Full table scans: May require optimization
Optimization Tips
- Use indexes (created automatically with
--create-indexes) - Limit result sets (use
limitparameters) - Filter early (query by ID/name when possible)
- Export large results (use JSON export for downstream processing)
Data Quality Notes
Known Issues
- NULL Values in Required Fields
- Some CDM tables have NULL values in fields marked as required
- Example:
sdt_enigma.sdt_enigma_idmay be NULL -
Loader handles these gracefully (converts pandas NaN → None)
-
Foreign Key References
- CDM uses
_namefields for foreign keys, not_id - Example:
Sample.location_ref→Location.sdt_location_name -
Ensure proper join queries
-
Ontology Term Splitting
- Single fields in original CORAL split into ID + name
- Example:
material→material_sys_oterm_id+material_sys_oterm_name - Both fields preserved in linkml-store
Validation
Validate parquet data before loading:
# Validate single table
just validate-cdm-parquet /path/to/sdt_sample Sample
# Validate all tables (sample mode)
just validate-all-cdm-parquet
# Full validation with detailed report
just validate-cdm-full
Troubleshooting
"Database not found"
# Create database first
just load-cdm-store
"Collection not found"
Check collection names:
just cdm-store-stats
Common collection names:
- Static entities: Location, Sample, Reads, Assembly, Genome
- System tables: SystemOntologyTerm, SystemProcess, SystemTypedef
"Error reading parquet"
Verify CDM database path:
ls -lh data/enigma_coral.db/
Check for Delta Lake format (directories with _delta_log/):
ls -la data/enigma_coral.db/sdt_sample/
Memory Issues
For very large queries, use sampling or export:
# Sample large tables during load
python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
data/enigma_coral.db \
--max-dynamic-rows 5000
# Or query in chunks
collection = db.get_collection("ASV")
results = list(collection.find(limit=1000))
Integration with CDM Analysis Tools
CDM Schema
The loader uses the CDM LinkML schema:
src/linkml_coral/schema/cdm/linkml_coral_cdm.yaml
├── cdm_base.yaml # Base types and mixins
├── cdm_static_entities.yaml # 17 entity classes
├── cdm_system_tables.yaml # 6 system classes
└── cdm_dynamic_data.yaml # Brick infrastructure
CDM Analysis Scripts
Related tools:
# Analyze parquet structure
just analyze-cdm
# Generate schema reports
just cdm-report
# Validate parquet data
just validate-all-cdm-parquet
# Full validation with error report
just validate-cdm-full
Comparison with TSV Loader
| Aspect | TSV Loader | CDM Parquet Loader |
|---|---|---|
| Input | 10 TSV files | 44 parquet tables |
| Schema | linkml_coral.yaml |
linkml_coral_cdm.yaml |
| Tables | 10 collections | 23-44 collections |
| Rows | 280K | 515K (static+system) |
| Size | ~10 MB | ~50 MB |
| Ontology | Single fields | Split ID+name fields |
| Foreign Keys | Use IDs | Use names |
Advanced Usage
Custom Computed Fields
The loader adds computed fields automatically:
# Reads: read_count_category
# - 'very_high': >= 100K reads
# - 'high': >= 50K reads
# - 'medium': >= 10K reads
# - 'low': < 10K reads
# Assembly: contig_count_category
# - 'high': >= 1000 contigs
# - 'medium': >= 100 contigs
# - 'low': < 100 contigs
Query by computed fields:
collection = db.get_collection("Reads")
high_quality = list(collection.find({'read_count_category': 'very_high'}))
Provenance Parsing
SystemProcess records have parsed provenance arrays:
collection = db.get_collection("SystemProcess")
processes = list(collection.find(limit=10))
for proc in processes:
# Original arrays
print(f"Inputs: {proc['sys_process_input_objects']}")
print(f"Outputs: {proc['sys_process_output_objects']}")
# Parsed fields (added by loader)
print(f"Input types: {proc['input_entity_types']}")
print(f"Input IDs: {proc['input_entity_ids']}")
print(f"Output types: {proc['output_entity_types']}")
print(f"Output IDs: {proc['output_entity_ids']}")
Direct linkml-store API
For advanced queries, use linkml-store directly:
from linkml_store import Client
client = Client()
db = client.attach_database("duckdb:///cdm_store.db", alias="cdm")
# Get collection
samples = db.get_collection("Sample")
# Query with filters
results = list(samples.find({'depth': {'$gt': 100}}))
# Complex queries
for result in results:
print(f"Sample {result['sdt_sample_name']} at {result['depth']}m")
Next Steps
- Load the data:
just load-cdm-store - Explore collections:
just cdm-store-stats - Try queries:
just cdm-find-samples Location0000001 - Integrate into workflows: Use Python API in your scripts
- Contribute queries: Add new query types to
query_cdm_store.py
Related Documentation
- CDM Parquet Validation Guide
- CDM Schema Implementation Summary
- CDM Parquet Analysis Report
- linkml-store Documentation
- LinkML Documentation
Support
For issues or questions:
- Check validation reports:
just validate-cdm-full - Review CDM analysis:
just analyze-cdm - Examine schema:
src/linkml_coral/schema/cdm/linkml_coral_cdm.yaml - Open issue: https://github.com/linkml/linkml-coral/issues