scalable-lore-expert

Scalable Lore Expert System

CI Coverage License: MIT Organ I Status Python

CI License: MIT ORGAN-I: Theoria Status: DESIGN_ONLY Python

Neo4j Kafka Docker Kubernetes

An ontological knowledge management system that applies Dante Alighieri’s epistemological framework from the Divine Comedy to the problem of modeling, querying, and exploring complex fictional universes at scale.

This repository is part of ORGAN-I (Theoria) — the epistemological and theoretical foundations layer of the eight-organ system. It treats narrative lore not as a flat database problem but as an ontological one: the structure of knowledge about a fictional universe mirrors the structure of knowledge itself, and the medieval epistemological architecture Dante inherited from Aquinas, Aristotle, and the Neoplatonists provides a surprisingly effective framework for organizing that knowledge into layers of increasing abstraction. The system ingests raw narrative data (character entries, episode summaries, timeline events), transforms it through entity resolution and relationship extraction, and surfaces it as a semantic graph supporting natural-language queries, spoiler-aware progressive disclosure, and cross-narrative thematic analysis.


Table of Contents

  1. Epistemological Foundation
  2. Solution Overview
  3. Technical Architecture
  4. Installation and Setup
  5. Usage and API
  6. Working Examples
  7. Research Foundation
  8. Testing and Quality
  9. Cross-References
  10. Contributing
  11. License and Author

1. Epistemological Foundation

The system is organized around a three-layer ontological model derived from the tripartite structure of Dante’s Commedia. This is not an arbitrary metaphor. Dante’s poem encodes a complete medieval epistemology — a theory of how knowledge is structured, layered, and progressively refined as the knower ascends from particulars to universals. That epistemological architecture maps directly onto the problem of narrative knowledge management.

Inferno: Descent into Detail

The Inferno is a catalogue of particulars. Dante descends through nine concentric circles, each more specific and more granular than the last, encountering individual sinners with individual stories in individual circumstances. There is no abstraction here — only data points.

In the Scalable Lore Expert, the Inferno layer is the granular data layer. It stores individual narrative facts at the highest resolution available: character names, birthdates, and physical descriptions; specific events with episode numbers and timestamps; location coordinates within a fictional geography; exact dialogue quotations with source citations. This layer makes no interpretive claims. It records what happened, to whom, where, and when — and nothing more.

The Inferno layer is implemented as the ingestion and raw storage tier. Structured data (wiki dumps, episode guides, fan databases) and unstructured data (novel text, screenplay dialogue, fan wikis) enter through dedicated connectors and land in a staging area before any transformation occurs. Every fact carries a provenance record: where it came from, when it was ingested, what confidence level it carries, and what verification chain connects it to a primary source.

Purgatorio: Transformation

The Purgatorio is the realm of process. Souls do not merely exist there — they are actively transformed. The seven terraces of Mount Purgatory correspond to stages of purification, each one refining the soul further, resolving contradictions, shedding accretions, preparing the essential form for what comes next.

In the system, the Purgatorio layer is the ETL and transformation layer. Raw facts from the Inferno undergo entity resolution (is “Monkey D. Luffy” the same entity as “Straw Hat Luffy” and “The Fifth Emperor”?), relationship extraction (what connects this character to that event?), temporal ordering (does this event precede or follow that one?), conflict resolution (two sources disagree about a character’s age — which is canonical?), and spoiler classification (at what narrative checkpoint does this fact become safe to reveal?).

This layer is where the ontological work happens. Raw data becomes structured knowledge. Isolated facts become nodes in a graph. The Purgatorio is computationally the most intensive layer and conceptually the most important — it is where the system’s understanding of the narrative universe actually forms.

Paradiso: Ascent to Understanding

The Paradiso is the realm of universals. Dante ascends through the celestial spheres and encounters not individuals but principles — justice, temperance, wisdom, love. The particular gives way to the general. The soul that has been purified in Purgatory can now perceive patterns invisible from below.

In the system, the Paradiso layer is the semantic and analytical layer. It operates on the structured graph produced by the Purgatorio and enables capabilities that require understanding, not just retrieval: thematic analysis (what narrative themes connect the Impel Down arc to the Marineford arc?), pattern recognition (which character arcs across the entire corpus follow a death-and-rebirth pattern?), cross-narrative comparison (how does the mentor-student relationship in Final Fantasy VII compare to the same archetype in One Piece?), and generative insight (what narrative tensions remain unresolved as of chapter N?).

The Paradiso layer is where vector embeddings, semantic search, and LLM-assisted reasoning operate. Queries at this level are not lookups — they are acts of interpretation. The system does not merely retrieve facts; it synthesizes understanding from the structured knowledge below.

Recursive Ontology

The three layers are not a pipeline that data flows through once. They form a recursive loop. Insights generated at the Paradiso level can trigger re-evaluation at the Purgatorio level (a newly discovered thematic parallel causes the system to re-examine entity relationships) or even new ingestion at the Inferno level (a pattern detected across arcs prompts the system to seek additional source material). This recursive self-refinement is central to Dante’s epistemology — understanding is not a destination but an ongoing ascent — and it is central to how the Scalable Lore Expert evolves its knowledge over time.


2. Solution Overview

The Scalable Lore Expert addresses a specific problem: fictional universes of sufficient complexity (One Piece’s 1,100+ chapters, Final Fantasy’s 16 mainline entries plus spinoffs, the Marvel Cinematic Universe’s 30+ films) generate knowledge corpora that exceed any individual’s ability to hold in memory. Existing solutions — fan wikis, episode guides, Reddit threads — are flat, unsearchable by meaning, and indifferent to spoiler boundaries.

Three-Layer Capabilities

Layer 1 (Inferno) — Storage and Provenance:

Layer 2 (Purgatorio) — Transformation and Resolution:

Layer 3 (Paradiso) — Semantic Understanding:

Multiverse Logic

The system supports multiple canonical universes simultaneously. Each universe (Final Fantasy VII, One Piece, Elden Ring) is a self-contained graph with its own entity namespace, timeline, and spoiler boundaries. Universes can be linked through explicit cross-reference edges (is_thematic_parallel_to, is_adaptation_of, shares_archetype_with) without collapsing their internal consistency. Canon hierarchy within a single universe distinguishes primary sources (the manga itself) from secondary sources (databooks, interviews) from tertiary sources (fan translations, wiki summaries), and query results respect this hierarchy by default.


3. Technical Architecture

Microservices Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                          API Gateway                                     │
│          (FastAPI / auth / rate limiting / routing)                       │
├──────────┬──────────┬──────────┬──────────┬──────────┬──────────────────┤
│          │          │          │          │          │                  │
│  Data    │  Graph   │ Semantic │  Query   │ Spoiler  │  Citation        │
│ Ingest   │ Database │  Search  │ Orchestr │ Filter   │  Tracker         │
│ Service  │ Service  │ Service  │ Service  │ Service  │  Service         │
│          │          │          │          │          │                  │
│ Airflow  │ Neo4j /  │ Pinecone │ Ranking  │ Episode/ │  Provenance     │
│ + Kafka  │ ArangoDB │ Weaviate │ + Merge  │ Volume   │  + Confidence   │
│          │          │ + HNSW   │          │ Markers  │  Scoring        │
└──────┬───┴────┬─────┴────┬─────┴────┬─────┴────┬─────┴──────┬──────────┘
       │        │          │          │          │            │
       ▼        ▼          ▼          ▼          ▼            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Shared Infrastructure                                  │
│   Docker / Kubernetes / Prometheus / Grafana / Redis (cache)             │
└─────────────────────────────────────────────────────────────────────────┘

API Gateway: FastAPI application handling authentication (JWT), rate limiting (token bucket), request routing, and response aggregation. All client interactions enter through this single endpoint.

Data Ingestion Service: Apache Airflow DAGs for batch ingestion (wiki dumps, structured databases) and Apache Kafka consumers for streaming ingestion (real-time updates from monitored sources). Connectors exist for MediaWiki XML exports, structured JSON/CSV, and raw text with configurable chunking strategies.

Graph Database Service: Neo4j (primary) or ArangoDB (alternative) storing the entity-relationship graph. Node types: Character, Event, Location, Group, Artifact, Theme, Arc, Universe. Edge types: participated_in, member_of, located_at, possesses, spoils_in, caused, is_thematic_parallel_to, preceded_by, succeeded_by.

Semantic Search Service: Vector embeddings generated via Sentence Transformers (all-MiniLM-L6-v2 for speed, all-mpnet-base-v2 for quality), BERT, or OpenAI ada-002. Stored in Pinecone or Weaviate with HNSW indexing for approximate nearest-neighbor retrieval. Hybrid scoring combines semantic similarity with graph-distance relevance.

Query Orchestration Service: Receives parsed natural-language queries, decomposes them into sub-queries (entity lookup + relationship traversal + semantic search), dispatches to appropriate services, merges results, applies ranking, and returns unified responses.

Spoiler Filtering Service: Maintains per-universe spoiler boundaries at episode/volume/chapter granularity. Every query carries a spoiler context (e.g., “I have read One Piece through chapter 800”) and every result is filtered against that boundary before delivery. Supports progressive disclosure — results are not binary (show/hide) but can be returned with redacted fields and a “contains spoilers beyond chapter X” annotation.

Citation Tracker Service: Maintains provenance chains from every assertion back to its source. Each fact carries a confidence score (0.0-1.0) based on source tier, corroboration count, and recency. Query results include citation metadata so users can verify claims.

Graph Schema

(:Character {name, aliases[], universe, first_appearance, status})
    -[:PARTICIPATED_IN {role, chapter_range}]->
(:Event {name, universe, arc, chapter, timestamp_narrative})
    -[:LOCATED_AT]->
(:Location {name, universe, region, coordinates_narrative})

(:Character)-[:MEMBER_OF {joined, left, role}]->(:Group)
(:Character)-[:POSSESSES {acquired, lost}]->(:Artifact)
(:Event)-[:SPOILS_IN {chapter, volume, episode}]->(:SpoilerBoundary)
(:Arc)-[:CONTAINS]->(:Event)
(:Theme)-[:MANIFESTS_IN {strength: float}]->(:Arc)
(:Character)-[:IS_THEMATIC_PARALLEL_TO {dimension}]->(:Character)

Semantic Search Pipeline

  1. Query embedding: Natural-language query is encoded into a dense vector using the same model that embedded the corpus
  2. Approximate nearest neighbor: HNSW index returns top-K candidate nodes by cosine similarity
  3. Graph expansion: For each candidate, traverse 1-2 hops in the graph to gather context (related events, connected characters, containing arcs)
  4. Hybrid scoring: Final score = (α × semantic_similarity) + (β × graph_relevance) + (γ × source_confidence) where α, β, γ are tunable weights
  5. Spoiler filtering: Remove or redact results that exceed the user’s declared spoiler boundary
  6. Citation attachment: Each result is annotated with provenance metadata and confidence score

4. Installation and Setup

Prerequisites

Quick Start

# Clone the repository
git clone https://github.com/organvm-i-theoria/scalable-lore-expert.git
cd scalable-lore-expert

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e ".[dev]"

# Start infrastructure (Neo4j, Redis, vector store)
docker compose up -d

# Run database migrations
python -m lore_expert.migrations.setup

# Start the API server
uvicorn lore_expert.gateway.app:app --reload --port 8000

Environment Configuration

# .env (copy from .env.example)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password
VECTOR_STORE=weaviate          # or "pinecone"
WEAVIATE_URL=http://localhost:8080
EMBEDDING_MODEL=all-MiniLM-L6-v2
KAFKA_BOOTSTRAP=localhost:9092
REDIS_URL=redis://localhost:6379
JWT_SECRET=your-secret-key

Docker Compose (Full Stack)

# Start all services including the API
docker compose --profile full up -d

# Verify health
curl http://localhost:8000/health

5. Usage and API

Natural-Language Queries

# Simple entity lookup
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Who is Roronoa Zoro?",
    "universe": "one-piece",
    "spoiler_boundary": {"chapter": 800}
  }'

# Temporal range query
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What happened to Luffy between Marineford and Fishman Island?",
    "universe": "one-piece",
    "spoiler_boundary": {"chapter": 653}
  }'

# Thematic analysis
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What themes connect the Impel Down and Marineford arcs?",
    "universe": "one-piece",
    "spoiler_boundary": {"chapter": 600},
    "mode": "thematic"
  }'

Spoiler-Aware Retrieval

Every query accepts a spoiler_boundary object that defines the user’s current progress. The system guarantees that no result will reveal information beyond that boundary:

{
  "spoiler_boundary": {
    "chapter": 800,
    "include_databooks": true,
    "include_sbs": false
  }
}

Results that touch spoiler-sensitive material return with redacted fields:

{
  "character": "Trafalgar D. Water Law",
  "status": "[REDACTED — spoiler beyond chapter 800]",
  "known_aliases": ["Surgeon of Death", "Trafalgar Law"],
  "spoiler_note": "This entity has 3 additional facts beyond your current boundary."
}

Relationship Exploration

# Find connections between two characters
curl -X POST http://localhost:8000/api/v1/relationships \
  -d '{
    "source": "Monkey D. Luffy",
    "target": "Shanks",
    "universe": "one-piece",
    "max_depth": 3,
    "spoiler_boundary": {"chapter": 1000}
  }'

Cross-Narrative Comparison

# Compare archetypes across universes
curl -X POST http://localhost:8000/api/v1/compare \
  -d '{
    "query": "How does the mentor-student relationship differ between Cloud/Zack and Luffy/Shanks?",
    "universes": ["final-fantasy-vii", "one-piece"],
    "mode": "comparative"
  }'

6. Working Examples

Example 1: Character Timeline Reconstruction

Query: “Trace Nico Robin’s journey from Ohara to joining the Straw Hats.”

The system traverses the graph starting from the Character node for Nico Robin, follows PARTICIPATED_IN edges filtered to events occurring between the Ohara Incident and the Enies Lobby arc, orders results by narrative timestamp, and returns a chronological sequence:

1. Ohara Incident (chapter 391-398) — Sole survivor, age 8
2. 20 years of flight — Bounty established, alliance with Crocodile
3. Alabasta Arc (chapters 155-217) — Encounter with Straw Hats
4. Joins crew (chapter 218) — Self-invitation after Alabasta
5. Water 7 / Enies Lobby (chapters 322-430) — Departure and rescue
6. "I want to live!" (chapter 398) — Full commitment to crew

Each entry carries citations back to specific chapters and confidence scores based on source tier.

Example 2: Thematic Clustering

Query: “What narrative themes recur across the Final Fantasy series?”

The system queries all Theme nodes in the final-fantasy universe cluster, aggregates MANIFESTS_IN edge weights across all arcs, and returns ranked thematic clusters:

Cluster 1: "Resistance to Authority" (strength: 0.91)
  — Manifests in: FF6 (Empire), FF7 (Shinra), FF8 (Galbadia),
    FF10 (Yevon), FF12 (Archadia), FF15 (Niflheim), FF16 (Bearers)

Cluster 2: "Memory and Identity" (strength: 0.87)
  — Manifests in: FF7 (Cloud's false memories), FF8 (GF-induced amnesia),
    FF9 (Vivi's existential questioning), FF10 (dream Zanarkand)

Cluster 3: "Sacrifice and Renewal" (strength: 0.84)
  — Manifests in: FF4 (Palom/Porom), FF6 (Celes), FF7 (Aerith),
    FF10 (Tidus), FF15 (Noctis)

Example 3: Spoiler-Filtered Progressive Disclosure

A user who has read One Piece through chapter 500 asks: “Tell me about the Yonko.”

The system returns information available as of chapter 500 — the four Emperors as known at that narrative point — while withholding all developments from chapters 501 onward (the post-timeskip power shifts, new Emperor appointments, defeats). The response notes: “12 additional facts are available beyond your current spoiler boundary.”


7. Research Foundation

Dante Scholarship

The epistemological framework draws on a tradition of reading the Commedia as a structured theory of knowledge, not merely as literary allegory:

Knowledge Graph Research

Narrative Informatics


8. Testing and Quality

Test Architecture

# Run full test suite
pytest tests/ -v

# Run by layer
pytest tests/inferno/ -v       # Ingestion and storage tests
pytest tests/purgatorio/ -v    # Transformation and resolution tests
pytest tests/paradiso/ -v      # Semantic and analytical tests

# Integration tests (requires Docker services running)
pytest tests/integration/ -v --docker

# Spoiler boundary compliance tests
pytest tests/spoiler/ -v

Quality Gates

Linting and Static Analysis

# Code quality
ruff check src/ tests/
mypy src/ --strict

# Schema validation
python -m lore_expert.validation.schema_check

9. Cross-References

Within ORGAN-I (Theoria)

This repository connects to several other ORGAN-I projects that provide foundational frameworks:

System Context

The Scalable Lore Expert is a pure ORGAN-I project: it formalizes an epistemological framework (Dante’s tripartite knowledge model) and applies it to a knowledge management domain (narrative universes). It does not depend on any ORGAN-II or ORGAN-III project, respecting the system’s no-back-edges invariant. However, the interactive exploration interface (query UI, visualization) is a natural candidate for ORGAN-II promotion if the project reaches CANDIDATE status.


10. Contributing

Contributions are welcome. Areas where help is particularly valuable:

Please open an issue before submitting large changes to discuss the approach. All contributions must include tests and maintain spoiler boundary compliance.


11. License and Author

MIT — Anthony Padavano

Anthony Padavano (@4444j99)

Part of the ORGAN-I: Theoria epistemological foundations layer within the eight-organ system.

The Eight-Organ System

Organ Domain GitHub Organization
I Theory (Theoria) organvm-i-theoria
II Art (Poiesis) organvm-ii-poiesis
III Commerce (Ergon) organvm-iii-ergon
IV Orchestration (Taxis) organvm-iv-taxis
V Public Process (Logos) organvm-v-logos
VI Community (Koinonia) organvm-vi-koinonia
VII Marketing (Kerygma) organvm-vii-kerygma
Meta Governance meta-organvm

Scalable Lore Expert System is part of the ORGAN-I epistemological research layer — formalizing how knowledge is structured, layered, and recursively refined.