CDM Store Quick Start Guide

Load and query KBase Common Data Model (CDM) parquet files using linkml-store with DuckDB backend.

Load CDM Data into linkml-store

# Load core CDM tables (static entities + system tables, ~1.1M rows)
just load-cdm-store

# This will create a file called cdm_store.db (~44 MB)
# Takes approximately 60-90 seconds

What gets loaded:

17 static entity tables: Location, Sample, Reads, Assembly, Genome, Gene, ASV, Bin, Community, Strain, Taxon, Protocol, Image, Condition, DubSeqLibrary, TnSeqLibrary, ENIGMA
6 system tables: Ontology terms, Type definitions, Process records, Process inputs/outputs
Total: 1,110,656 records across 23 collections
Database size: ~44 MB (highly compressed columnar storage)
Load time: ~60-90 seconds (12,000+ records/sec)

Example Queries

1. Show Database Statistics

just cdm-store-stats

Output:

📊 CDM Store Database Statistics
============================================================

📂 Database: cdm_store.db
📚 Total collections: 23
📄 Total records: 1,110,656

Collections:
  • ASV                               426,088 records (100K+)
  • Assembly                            6,854 records
  • Bin                                 1,246 records
  • Community                           4,418 records
  • Condition                           2,092 records
  • DubSeqLibrary                           6 records
  • ENIGMA                                  2 records
  • Gene                               30,030 records (10K+)
  • Genome                             13,376 records (10K+)
  • Image                                 436 records
  • Location                            1,188 records
  • Protocol                               84 records
  • Reads                              38,614 records (10K+)
  • Sample                              8,660 records
  • Strain                              6,220 records
  • SystemDDTTypedef                      202 records
  • SystemOntologyTerm                 21,188 records (10K+)
  • SystemProcess                     285,916 records (100K+)
  • SystemProcessInput                180,790 records (100K+)
  • SystemProcessOutput                76,456 records (10K+)
  • SystemTypedef                         236 records
  • Taxon                               6,552 records
  • TnSeqLibrary                            2 records

2. Find Samples from a Location

# Find all samples from a specific location
just cdm-find-samples EU02

Output:

🔍 Finding samples from location: EU02
============================================================

Found 100 sample(s):

  1. EU02-D01 (Sample0000001)
     Depth: 5.4m
     Date: 2019-07-29

  2. EU02-D02 (Sample0000033)
     Depth: 5.4m
     Date: 2019-08-05

  3. EU02-D03 (Sample0000065)
     Depth: 5.4m
     Date: 2019-08-06

  ... (continues)

3. Search Ontology Terms

# Search for soil-related terms
just cdm-search-oterm "soil"

Output:

🔍 Searching ontology terms for: 'soil'
============================================================

Found 50 term(s):

  1. ENVO:00001998: soil
     Soil is an environmental material which is primarily composed of minerals...

  2. ENVO:00002116: contaminated soil
     A portion of contaminated soil is a portion of soil with elevated levels...

  3. ENVO:00002117: creosote contaminated soil
     Soil which has elevated concentrations of creosote.

  4. ENVO:00002145: chromate contaminated soil
     Soil which has elevated concentrations of chromate.

  5. ENVO:00002259: agricultural soil

  6. ENVO:00002260: dune soil

  7. ENVO:00002261: forest soil
     A portion of soil which is found in a forested area.

  ... (continues)

4. Trace Provenance Lineage

# Trace what created an assembly and what it produced
just cdm-lineage Assembly Assembly0000001

Output:

🔗 Tracing lineage for: Assembly:Assembly0000001
============================================================

⬆️  Upstream (inputs that produced this entity):
  1. Process: Process0006710 (None)
     Inputs: Reads:Reads0000868

⬇️  Downstream (outputs produced by this entity):
  1. Process: Process0005950 (None)
     Outputs: Genome:Genome0000001

5. Using Python API Directly

#!/usr/bin/env python3
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd() / "scripts" / "cdm_analysis"))

from query_cdm_store import CDMStoreQuery

# Initialize
query = CDMStoreQuery('cdm_store.db')

# Get statistics
stats = query.stats()
print(f"Total records: {stats['total_records']:,}")
print(f"Collections: {stats['total_collections']}")

# Find samples by location
samples = query.find_samples_by_location('EU02', limit=10)
for sample in samples:
    print(f"Sample: {sample['sdt_sample_name']}")
    print(f"  ID: {sample['sdt_sample_id']}")
    print(f"  Depth: {sample.get('depth')}m")

# Search ontology terms
terms = query.search_ontology_terms('soil', limit=20)
for term in terms:
    print(f"{term['sys_oterm_id']}: {term['sys_oterm_name']}")

# Trace lineage
lineage = query.trace_lineage('Assembly', 'Assembly0000001')
print(f"Upstream processes: {len(lineage['upstream'])}")
print(f"Downstream processes: {len(lineage['downstream'])}")

# Access detailed provenance
for proc in lineage['upstream']:
    print(f"Process: {proc['process_id']}")
    print(f"  Type: {proc['process_type']}")
    print(f"  Inputs: {', '.join(proc['inputs'])}")

6. Export Query Results to JSON

# Export statistics to JSON
uv run python scripts/cdm_analysis/query_cdm_store.py \
    --db cdm_store.db stats --export stats.json

# Export search results
uv run python scripts/cdm_analysis/query_cdm_store.py \
    --db cdm_store.db search-oterm "soil" --export soil_terms.json

# Export lineage
uv run python scripts/cdm_analysis/query_cdm_store.py \
    --db cdm_store.db lineage Assembly Assembly0000001 \
    --export assembly_lineage.json

# Export samples
uv run python scripts/cdm_analysis/query_cdm_store.py \
    --db cdm_store.db find-samples --location EU02 \
    --export eu02_samples.json

Quick Reference

Command	Description
`just load-cdm-store`	Load all core CDM tables
`just cdm-store-stats`	Show database statistics
`just cdm-find-samples <location>`	Find samples by location
`just cdm-search-oterm <term>`	Search ontology terms
`just cdm-lineage <type> <id>`	Trace provenance lineage
`just clean-cdm-store`	Delete database files

Advanced Loading Options

Include Dynamic Brick Tables

# Include dynamic brick tables (sampled at 10K rows each)
just load-cdm-store-full

Custom Loading with Python

# Use Python directly with custom options
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
    data/enigma_coral.db \
    --output my_cdm.db \
    --include-static \
    --include-system \
    --include-dynamic \
    --max-dynamic-rows 50000 \
    --create-indexes \
    --show-info \
    --verbose

Available options:

--output, -o - Output database path (default: cdm_store.db)
--schema - Path to CDM LinkML schema
--include-static - Load static entity tables (default: yes)
--no-static - Skip static entity tables
--include-system - Load system tables (default: yes)
--no-system - Skip system tables
--include-dynamic - Load dynamic brick tables (default: no, 82.6M rows)
--max-dynamic-rows - Max rows per dynamic table (default: 10000)
--create-indexes - Create indexes after loading
--show-info - Show database information after loading
--verbose - Verbose output

Load Only Specific Tables

# Load only static tables
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
    data/enigma_coral.db \
    --output static_only.db \
    --include-static \
    --no-system

# Load only system tables
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
    data/enigma_coral.db \
    --output system_only.db \
    --no-static \
    --include-system

CDM Naming Conventions

The CDM uses specific naming patterns different from the original CORAL schema:

Primary Keys

Pattern: sdt_{entity}_id (e.g., sdt_sample_id)
Example: Sample0000001

Entity Names

Pattern: sdt_{entity}_name (e.g., sdt_sample_name)
Used in foreign key references instead of IDs

Foreign Keys

Use _name suffix, not _id
Example: sdt_location_name references Location.sdt_location_name

Ontology Terms

Split into ID + name pairs
Pattern: {field}_sys_oterm_id + {field}_sys_oterm_name
Example: material_sys_oterm_id + material_sys_oterm_name

Performance

Loading Performance

Core tables (static + system): ~60-90 seconds for 1.1M records
Load rate: 12,000+ records/second
Database size: 44 MB (highly compressed)

Query Performance

Small tables (<10K rows): Instantaneous
Medium tables (10-100K rows): <1 second
Large tables (>100K rows): 1-2 seconds
Provenance queries: <1 second with indexes

Architecture

CDM Parquet Files (Delta Lake format)
    ↓
load_cdm_parquet_to_store.py
    ↓
linkml-store (DuckDB backend)
    ↓
query_cdm_store.py (Python API)
    ↓
Justfile commands (CLI)

Key Features

Delta Lake support: Reads parquet files in Delta Lake directory format
NaN handling: Converts pandas NaN to None for database compatibility
Array processing: Converts numpy arrays to Python lists for SQL storage
Computed fields: Automatic categorization (read_count_category, contig_count_category)
Provenance parsing: Extracts entity types and IDs from process arrays
Indexing: Automatic index creation for primary keys and foreign keys

Troubleshooting

Database not found

# Create database first
just load-cdm-store

Collection not found

# Check available collections
just cdm-store-stats

Common collection names: - Static entities: Location, Sample, Reads, Assembly, Genome - System tables: SystemOntologyTerm, SystemProcess, SystemTypedef

Memory issues

# Use sampling for large tables
uv run python scripts/cdm_analysis/load_cdm_parquet_to_store.py \
    data/enigma_coral.db \
    --max-dynamic-rows 5000

CDM Parquet Store Guide - Comprehensive guide
CDM Parquet Validation Guide - Data validation
CDM Schema Implementation - Schema details
linkml-store Documentation - linkml-store docs

Support

For issues or questions:

Check validation reports: just validate-cdm-full
Review CDM analysis: just analyze-cdm
Examine schema: src/linkml_coral/schema/cdm/linkml_coral_cdm.yaml
Open issue: https://github.com/linkml/linkml-coral/issues

CDM Store Quick Start Guide

Load CDM Data into linkml-store

What gets loaded:

Example Queries

1. Show Database Statistics

2. Find Samples from a Location

3. Search Ontology Terms

4. Trace Provenance Lineage

5. Using Python API Directly

6. Export Query Results to JSON

Quick Reference

Advanced Loading Options

Include Dynamic Brick Tables

Custom Loading with Python

Load Only Specific Tables

CDM Naming Conventions

Primary Keys

Entity Names

Foreign Keys

Ontology Terms

Performance

Loading Performance

Query Performance

Architecture

Key Features

Troubleshooting

Database not found

Collection not found

Memory issues

Related Documentation

Support