KBase CDM Parquet Database Analysis Report

Database Location: data/enigma_coral.db

Analysis Date: 2025-12-01

Executive Summary

The KBase CDM (Common Data Model) database contains 44 tables organized into three categories: - 6 system tables (sys_*): Metadata, type definitions, provenance, and ontology catalogs - 17 static data tables (sdt_*): Core scientific entities (samples, assemblies, genomes, etc.) - 21 dynamic data tables (ddt_*): Measurement arrays stored as "bricks" with flexible schemas

Total Data Volume: - 272,934 rows across static data tables - 82.6 million rows across dynamic data tables (bricks) - 142,958 process records tracking provenance - 10,594 ontology terms in catalog

1. System Tables Analysis

1.1 sys_typedef (118 rows)

Purpose: Maps CORAL static types to CDM table/column names with constraints

Key Columns: - type_name: CORAL entity type (e.g., "Gene", "Bin", "Assembly") - field_name: Original CORAL field name - cdm_column_name: Mapped CDM column name (snake_case with prefixes) - scalar_type: Data type (text, int, float, [text] for arrays) - pk, upk, fk: Primary key, unique key, foreign key flags - constraint: Validation patterns or ontology constraints - units_sys_oterm_id, type_sys_oterm_id: Ontology term references

Key Findings: - Defines schema for 18 entity types - Documents field transformations from CORAL to CDM - Includes validation constraints and ontology mappings - Foreign key relationships explicitly defined

1.2 sys_ddt_typedef (101 rows)

Purpose: Defines dynamic data table schemas (bricks and microtypes)

Key Columns: - ddt_ndarray_id: Brick identifier (e.g., "Brick0000010") - cdm_column_name: Column name in brick table - cdm_column_data_type: "variable", "dimension_variable", or "dimension_index" - scalar_type: Data type (float, object_ref, oterm_ref, text, int, bool) - dimension_number, variable_number: Position in N-dimensional array - dimension_oterm_id/name, variable_oterm_id/name: Semantic metadata - unit_sys_oterm_id/name: Measurement units

Key Findings: - 20 brick types defined with varying dimensionality - Supports complex multi-dimensional arrays (e.g., [209, 52, 3, 3]) - Dimension semantics (Environmental Sample, Molecule, State, Statistic) - Variable semantics (Concentration, Molecular Weight, etc.)

1.3 sys_oterm (10,594 rows)

Purpose: Central ontology term catalog

Key Columns: - sys_oterm_id: CURIE (e.g., "ME:0000129", "ENVO:00002041") - sys_oterm_name: Human-readable term name - sys_oterm_ontology: Source ontology - parent_sys_oterm_id: Hierarchical relationships - sys_oterm_definition: Term definitions - sys_oterm_synonyms, sys_oterm_links, sys_oterm_properties: Additional metadata

Key Findings: - Centralized ontology management - Primary ontology: context_measurement_ontology (custom ENIGMA ontology) - Supports ENVO, UO (units), PROCESS, CONTINENT, COUNTRY, MIxS, and other standard ontologies - Hierarchical structure with parent references

1.4 sys_process (142,958 rows)

Purpose: Provenance tracking for all data transformations

Key Columns: - sys_process_id: Unique process identifier - process_sys_oterm_id/name: Process type (e.g., "Assay Growth", "Sequencing") - person_sys_oterm_id/name: Person who performed the process - campaign_sys_oterm_id/name: Research campaign - sdt_protocol_name: Protocol used - date_start, date_end: Temporal metadata - input_objects, output_objects: Arrays of entity references (type:id format)

Key Findings: - Complete provenance lineage for all data - Links processes to protocols, people, and campaigns - Input/output tracking enables full dependency graph - Supports multiple inputs and outputs per process

1.5 sys_process_input (90,395 rows)

Purpose: Normalized process input relationships

Key Columns: - sys_process_id: Process reference - sdt_<entity>_id: Foreign keys to input entities (Assembly, Bin, Community, Genome, Location, Reads, Sample, Strain, TnSeq_Library)

Key Findings: - One row per (process, input) pair - Denormalized for query performance - Null columns for entities not involved in specific process

1.6 sys_process_output (38,228 rows)

Purpose: Normalized process output relationships

Key Columns: - sys_process_id: Process reference - Entity ID columns for outputs (including ddt_ndarray_id for brick outputs)

Key Findings: - Similar structure to sys_process_input - Includes dynamic data (bricks) as outputs - Enables forward and backward provenance queries

2. Static Data Tables (sdt_*)

2.1 Naming Conventions

Table Names: sdt_<snake_case_entity_name> - Original CORAL "OTU" → sdt_asv (preferred name) - "TnSeq Library" → sdt_tnseq_library

Column Names: - Primary keys: sdt_<entity>_id (e.g., sdt_sample_id) - Names: sdt_<entity>_name (e.g., sdt_sample_name) - Foreign keys: sdt_<referenced_entity>_name (uses name, not ID for FKs) - Ontology term splitting: <field>_sys_oterm_id + <field>_sys_oterm_name

2.2 Ontology Term Splitting Pattern

Original CORAL fields with ontology constraints are split into two columns:

Example from sdt_sample: - material → material_sys_oterm_id (CURIE) + material_sys_oterm_name (label) - env_package → env_package_sys_oterm_id + env_package_sys_oterm_name

Example from sdt_location: - continent → continent_sys_oterm_id + continent_sys_oterm_name - country → country_sys_oterm_id + country_sys_oterm_name - biome → biome_sys_oterm_id + biome_sys_oterm_name - feature → feature_sys_oterm_id + feature_sys_oterm_name

Example from sdt_reads: - read_type → read_type_sys_oterm_id + read_type_sys_oterm_name - sequencing_technology → sequencing_technology_sys_oterm_id + sequencing_technology_sys_oterm_name

Example from sdt_community: - community_type → community_type_sys_oterm_id + community_type_sys_oterm_name

Benefits: - Enables ontology validation via FK to sys_oterm - Preserves human-readable labels for queries - Supports ontology evolution without data migration

2.3 Table Inventory and Sizes

Table	Rows	Key Columns	Notes
sdt_asv	213,044	sdt_asv_id, sdt_asv_name	ASV (Amplicon Sequence Variant) sequences
sdt_reads	19,307	sdt_reads_id, sdt_reads_name, read_count, read_type_, sequencing_technology_, link	Sequencing reads with ontology terms
sdt_gene	15,015	sdt_gene_id, sdt_gene_name, sdt_genome_name, contig_number, strand, start, stop, function	Gene annotations
sdt_genome	6,688	sdt_genome_id, sdt_genome_name, sdt_strain_name, n_contigs, n_features, link	Assembled genomes
sdt_sample	4,330	sdt_sample_id, sdt_sample_name, sdt_location_name, depth, elevation, date, material_, env_package_	Environmental samples with spatial/temporal metadata
sdt_assembly	3,427	sdt_assembly_id, sdt_assembly_name, sdt_strain_name, n_contigs, link	Sequence assemblies
sdt_taxon	3,276	sdt_taxon_id, sdt_taxon_name, ncbi_taxid	Taxonomic classifications
sdt_strain	3,110	sdt_strain_id, sdt_strain_name, sdt_genome_name, derived_from_sdt_strain_name, sdt_gene_names_changed	Microbial strains
sdt_community	2,209	sdt_community_id, sdt_community_name, community_type_*, sdt_sample_name, parent_sdt_community_name, defined_sdt_strain_names	Microbial communities
sdt_condition	1,046	sdt_condition_id, sdt_condition_name	Growth conditions
sdt_location	594	sdt_location_id, sdt_location_name, latitude, longitude, continent_, country_, region, biome_, feature_	Sampling locations
sdt_bin	623	sdt_bin_id, sdt_bin_name, sdt_assembly_name, contigs	Genome bins from metagenomes
sdt_image	218	sdt_image_id, sdt_image_name, mime_type, size, dimensions, link	Microscopy and other images
sdt_protocol	42	sdt_protocol_id, sdt_protocol_name, description, link	Experimental protocols
sdt_dubseq_library	3	sdt_dubseq_library_id, sdt_dubseq_library_name, sdt_genome_name, n_fragments	DubSeq libraries
sdt_tnseq_library	1	sdt_tnseq_library_id, sdt_tnseq_library_name, sdt_genome_name, primers_model, n_mapped_reads, etc.	TnSeq library
sdt_enigma	1	sdt_enigma_id	Root entity (database singleton)

2.4 Foreign Key Relationships

Key Patterns: - FKs use _name suffix, not _id (e.g., sdt_location_name in Sample references Location.name) - Self-referential relationships: Community.parent_sdt_community_name, Strain.derived_from_sdt_strain_name - Many-to-many via arrays: Community.defined_sdt_strain_names (array of strain names)

Relationship Graph:

Location
  └─ Sample
      ├─ Community
      │   └─ (parent_community, defined_strains)
      └─ Reads
          └─ Assembly
              ├─ Bin
              │   └─ Genome
              │       ├─ Gene
              │       ├─ DubSeq_Library
              │       └─ TnSeq_Library
              └─ Genome

Strain
  ├─ Assembly
  ├─ Genome
  └─ (derived_from relationship)

Image (standalone, linked via sys_process)
Protocol (referenced by sys_process)
Condition (referenced by Community)
Taxon (standalone)
ASV (standalone, linked via bricks)

2.5 Schema Differences from Original CORAL

New Fields: - link: External URLs/references (Assembly, Genome, Reads, Image, Protocol) - Ontology term splitting (see section 2.2) - sdt_ prefix on all ID and name columns

Renamed Fields: - id → sdt_<entity>_id - name → sdt_<entity>_name - Foreign key references use sdt_<entity>_name convention

Split Fields: - All ontology-constrained fields split into _sys_oterm_id + _sys_oterm_name pairs

Normalized Structures: - Process inputs/outputs denormalized into separate tables for query performance

3. Dynamic Data Tables (ddt_*)

3.1 ddt_ndarray (Brick Index) - 20 rows

Purpose: Index of all measurement bricks with metadata

Key Columns: - ddt_ndarray_id: Brick identifier (e.g., "Brick0000010") - ddt_ndarray_name: Descriptive name (e.g., "adams_metals_100ws.ndarray") - ddt_ndarray_description: Full description - ddt_ndarray_type_sys_oterm_id/name: Brick type (e.g., "Chemical Measurement") - ddt_ndarray_shape: Array dimensions (e.g., [209, 52, 3, 3]) - ddt_ndarray_dimension_types_sys_oterm_id/name: Dimension semantics (arrays) - ddt_ndarray_dimension_variable_types_sys_oterm_id/name: Variables per dimension - ddt_ndarray_variable_types_sys_oterm_id/name: Value variables - withdrawn_date, superceded_by_ddt_ndarray_id: Versioning metadata

Example Brick:

Brick0000010: adams_metals_100ws.ndarray
  Description: Adams Lab Metals Measurements for 100 Well Survey
  Type: Chemical Measurement (DA:0000005)
  Shape: [209, 52, 3, 3]
  Dimensions:
    1. Environmental Sample (DA:0000042) - variables: Environmental Sample ID
    2. Molecule (ME:0000027) - variables: Molecule from list, Molecular Weight, Algorithm Parameter, Detection Limit
    3. State (ME:0000037) - (no variables listed)
    4. Replicate Series (ME:0000???) - variables: Count Unit
  Values: Concentration (ME:0000129) in micromolar (UO:0000064)

3.2 Brick Tables (20 tables)

Table Naming: ddt_brick<7-digit-id> (e.g., ddt_brick0000010)

Column Structure: - Dimension variables: Reference columns (e.g., sdt_sample_name, molecule_from_list_sys_oterm_id) - Dimension metadata: Properties (e.g., molecule_molecular_weight_dalton, state) - Value variables: Measurements (e.g., concentration_micromolar)

Example Brick Schemas:

ddt_brick0000010 (52,884 rows):

- sdt_sample_name (dimension 1: Environmental Sample)
- molecule_from_list_sys_oterm_id (dimension 2: Molecule)
- molecule_from_list_sys_oterm_name
- molecule_molecular_weight_dalton
- molecule_algorithm_parameter
- molecule_detection_limit_micromolar
- state (dimension 3: State)
- replicate_series_count_unit (dimension 4)
- concentration_micromolar (value variable)

ddt_brick0000452 (113,741 rows):

- sdt_asv_name
- sequence_sequence_type_16s_sequence

ddt_brick0000080 (98,176 rows):

- sdt_sample_name
- molecule_from_list_sys_oterm_id
- molecule_from_list_sys_oterm_name
- molecule_molecular_weight_dalton
- molecule_presence_molecule_from_list_helium_0
- concentration_statistic_average_parts_per_billion
- concentration_statistic_standard_deviation_parts_per_billion
- detection_limit_parts_per_billion

Key Findings: - Flexible schema supports different measurement types - Ontology terms embedded for molecules and other categorical dimensions - Statistical aggregates (mean, std dev) stored in same row - Detection limits and quality metadata included - Large data volume: 82+ million rows across 20 bricks

4. Key Schema Patterns

4.1 Identifier Conventions

System IDs: sys_<entity>_id (e.g., sys_process_id, sys_oterm_id)
Static Data IDs: sdt_<entity>_id (e.g., sdt_sample_id)
Dynamic Data IDs: ddt_ndarray_id (for bricks)
All IDs: Zero-padded sequential (e.g., "Assembly0000001", "Process0122921")

4.2 Ontology Integration

Three Types of Ontology References: 1. Direct term usage: field_sys_oterm_id + field_sys_oterm_name pairs 2. Metadata annotations: units_sys_oterm_id, type_sys_oterm_id in sys_typedef 3. Semantic dimensions: Dimension and variable ontology terms in sys_ddt_typedef

Ontology Namespaces: - ME:* - ENIGMA Measurement/Context Ontology (custom) - ENVO:* - Environment Ontology - UO:* - Units of Measurement Ontology - DA:* - Data Array Ontology (custom) - PROCESS:* - Process Types (custom) - ENIGMA:* - People and Campaigns (custom) - CONTINENT:*, COUNTRY:* - Geography (custom) - MIxS:* - Minimum Information Standards

4.3 Provenance Model

Hierarchical Tracking:

sys_process (workflow step)
  ├─ inputs (sys_process_input)
  │   ├─ Static entities (sdt_*)
  │   └─ Dynamic arrays (ddt_*)
  ├─ outputs (sys_process_output)
  │   ├─ Static entities (sdt_*)
  │   └─ Dynamic arrays (ddt_*)
  └─ metadata
      ├─ Process type (ontology)
      ├─ Person (ontology)
      ├─ Campaign (ontology)
      ├─ Protocol (reference)
      └─ Dates

Query Capabilities: - Forward lineage: "What was derived from this sample?" - Backward lineage: "What inputs created this assembly?" - Process-centric: "Show all sequencing processes by person X" - Campaign tracking: "All data from Metal Metabolism campaign"

4.4 Delta Lake Format

All tables stored as Delta Lake: - Directory per table with _delta_log/ subdirectory - Parquet files with UUID naming: part-00000-<uuid>-c000.snappy.parquet - ACID transactions and time travel support - Schema evolution capability

5. Data Volume Summary

5.1 Row Counts by Category

Category	Table Count	Total Rows
System Tables	6	242,176
Static Data	17	272,934
Dynamic Data (Bricks)	20	82,627,111
Total	43	83,142,221

5.2 Top 10 Tables by Size

ddt_brick0000452 - 113,741 rows (ASV sequences)
ddt_brick0000080 - 98,176 rows (molecule concentrations)
ddt_brick0000010 - 52,884 rows (metals measurements)
sdt_asv - 213,044 rows
sys_process - 142,958 rows
sys_process_input - 90,395 rows
sys_process_output - 38,228 rows
sdt_reads - 19,307 rows
sdt_gene - 15,015 rows
sys_oterm - 10,594 rows

6. Comparison to Original CORAL Schema

6.1 Structural Changes

Table-Level Changes: - Added sys_* prefix to all system tables - Added sdt_* prefix to all static data tables - Added ddt_* prefix to all dynamic data tables - Renamed "OTU" → "ASV" (Amplicon Sequence Variant)

Column-Level Changes: - All ID columns: id → sdt_<entity>_id - All name columns: name → sdt_<entity>_name - All FK columns: <entity> → sdt_<entity>_name - Ontology fields: Split into _sys_oterm_id + _sys_oterm_name pairs

6.2 New Capabilities

Enhanced Provenance: - Denormalized sys_process_input and sys_process_output tables for query performance - Process linking to protocols, people, and campaigns

Ontology Management: - Centralized sys_oterm catalog - Foreign key constraints ensure ontology term validity - Embedded ontology names eliminate joins for display

Measurement Versioning: - withdrawn_date and superceded_by_ddt_ndarray_id in brick index - Supports data quality improvements without deletion

External Linking: - link fields for URLs to sequence files, images, protocols

6.3 Preserved Features

Core entity model (Sample, Location, Reads, Assembly, Genome, Gene, etc.)
Multi-dimensional array structure (bricks)
Process-based provenance
Flexible ontology annotations

7. Recommendations for LinkML Schema

7.1 Schema Organization

Proposed Module Structure:

linkml_coral_kbase/
  ├── core/
  │   ├── ontology.yaml          # sys_oterm, ontology term references
  │   ├── provenance.yaml        # sys_process, sys_process_input/output
  │   └── types.yaml             # sys_typedef, sys_ddt_typedef
  ├── static_entities/
  │   ├── location_sample.yaml   # Location, Sample
  │   ├── genomics.yaml          # Assembly, Genome, Gene, Bin
  │   ├── microbiology.yaml      # Strain, Community, Taxon
  │   ├── sequencing.yaml        # Reads, ASV
  │   ├── libraries.yaml         # TnSeq_Library, DubSeq_Library
  │   └── metadata.yaml          # Image, Protocol, Condition
  └── dynamic_arrays/
      └── bricks.yaml            # ddt_ndarray, brick metadata

7.2 Key Patterns to Implement

1. Ontology Term Mixin:

mixins:
  OntologyTermPair:
    attributes:
      sys_oterm_id:
        range: OntologyTerm
        required: true
      sys_oterm_name:
        range: string
        required: true

2. Entity Naming Convention:

classes:
  Sample:
    attributes:
      sdt_sample_id:
        identifier: true
        pattern: "^Sample\\d{7}$"
      sdt_sample_name:
        required: true
        unique_key: true

3. Foreign Key Conventions:

attributes:
  sdt_location_name:
    range: Location
    inlined: false  # Reference by name, not inline

4. Process Tracking:

classes:
  Process:
    attributes:
      sys_process_id:
        identifier: true
      input_objects:
        multivalued: true
        range: string  # Format: "EntityType:EntityID"
      output_objects:
        multivalued: true
        range: string

7.3 Validation Rules

Required Constraints: - ID format patterns (e.g., ^Assembly\\d{7}$) - Latitude range: [-90, 90] - Longitude range: [-180, 180] - Ontology term format: ^[A-Z_]+:\\d+$ - Date format: ISO 8601

Recommended Enhancements: - FK validation against sys_oterm - Process input/output entity type validation - Brick dimension/shape validation - Unit compatibility checks

7.4 Documentation Needs

Critical Documentation: - Ontology term splitting pattern explanation - FK naming convention (uses _name, not _id) - Process provenance model - Brick structure and query patterns - Delta Lake storage format

Migration Guide: - Original CORAL → KBase CDM mapping - Field renaming dictionary - Ontology term migration paths

8. Next Steps

8.1 Immediate Actions

Create LinkML schema modules following proposed structure
Document ontology term catalog with term counts and coverage
Map sys_typedef records to LinkML class definitions
Define process provenance classes with input/output tracking
Create brick schema templates from sys_ddt_typedef

8.2 Validation Tasks

Cross-validate FKs between tables
Check ontology term usage against sys_oterm catalog
Verify process lineage completeness
Test brick dimension consistency
Validate ID format patterns

8.3 Query Examples to Develop

Sample → Reads → Assembly → Genome lineage
Provenance chain for a specific measurement brick
All data from a specific geographic location
Strain derivation tree
Process attribution (who did what when)

Appendix A: Complete Table Listing

System Tables

sys_typedef (118 rows)
sys_ddt_typedef (101 rows)
sys_oterm (10,594 rows)
sys_process (142,958 rows)
sys_process_input (90,395 rows)
sys_process_output (38,228 rows)

Static Data Tables

sdt_assembly (3,427 rows)
sdt_asv (213,044 rows)
sdt_bin (623 rows)
sdt_community (2,209 rows)
sdt_condition (1,046 rows)
sdt_dubseq_library (3 rows)
sdt_enigma (1 row)
sdt_gene (15,015 rows)
sdt_genome (6,688 rows)
sdt_image (218 rows)
sdt_location (594 rows)
sdt_protocol (42 rows)
sdt_reads (19,307 rows)
sdt_sample (4,330 rows)
sdt_strain (3,110 rows)
sdt_taxon (3,276 rows)
sdt_tnseq_library (1 row)

Dynamic Data Tables

ddt_ndarray (20 rows)
ddt_brick0000010, 0000072, 0000073, 0000080
ddt_brick0000452, 0000454, 0000457, 0000458, 0000459
ddt_brick0000460, 0000461, 0000462
ddt_brick0000476, 0000477, 0000478, 0000479
ddt_brick0000495, 0000501, 0000507, 0000508

Appendix B: Sample Queries

Query 1: Find all samples from a location

SELECT s.*
FROM sdt_sample s
WHERE s.sdt_location_name = 'CPT1'

Query 2: Trace assembly provenance

SELECT
  p.sys_process_id,
  p.process_sys_oterm_name,
  pi.sdt_reads_id AS input_reads,
  po.sdt_assembly_id AS output_assembly
FROM sys_process p
JOIN sys_process_input pi ON p.sys_process_id = pi.sys_process_id
JOIN sys_process_output po ON p.sys_process_id = po.sys_process_id
WHERE po.sdt_assembly_id = 'Assembly0000001'

Query 3: Find measurements for a sample

SELECT
  b.*
FROM ddt_brick0000010 b
WHERE b.sdt_sample_name = 'EU02-D01'

End of Report