An epistemological infrastructure for converting AI conversations into durable, searchable, interconnected knowledge — with multi-modal retrieval and LLM-powered intelligence extraction.
“Knowledge base” — from the Greek epistēmē (knowledge) and basis (foundation). Where conventional knowledge management systems store what you put in, this system excavates what you did not know you had: latent patterns, implicit connections, and higher-order insights buried across hundreds of AI conversation threads. It treats each exchange with Claude, Gemini, or ChatGPT not as disposable dialogue but as raw epistemological material awaiting decomposition into atomic units of reusable understanding.
Every meaningful AI conversation generates knowledge that evaporates. You spend ninety minutes with Claude working through a recursive data structure design, or an afternoon with Gemini prototyping a deployment strategy, and afterward that knowledge exists only inside a vendor-specific chat interface — unsearchable, unstructured, disconnected from every other conversation you have ever had.
This is not a minor inconvenience. It is a structural epistemological failure. The problem decomposes into four discrete failures that existing tools do not address:
1. No export pathway. Claude.app, Gemini, and ChatGPT offer no bulk export mechanism. Your knowledge lives behind their UIs, inaccessible to any external system. The few available export formats (ChatGPT’s JSON dump, for instance) are monolithic and unprocessed — they give you raw data, not usable knowledge.
2. No atomization. Conversations are monolithic blobs. A single session might contain a database schema discussion, a deployment decision, three code patterns, and a conceptual insight about state machines — all stored as one undifferentiated unit. Traditional note-taking tools (Notion, Obsidian, Logseq) require you to manually decompose conversations into discrete notes. This is precisely the labor that AI should automate.
3. No cross-conversation search. There is no way to query across sessions, across platforms, or across time. You cannot ask “What have I learned about recursive architectures across all my Claude conversations this quarter?” Tools like Readwise, Mem, or Reflect offer search, but none ingest AI conversations natively, none atomize automatically, and none provide multi-modal retrieval that combines keyword precision with semantic understanding.
4. No intelligence extraction. Conversations contain implicit insights — recurring themes, latent contradictions, evolving understanding — that are invisible without systematic analysis. No existing tool applies LLM intelligence to a corpus of AI conversations to surface what you did not explicitly articulate.
My Knowledge Base addresses each failure with a dedicated subsystem: a multi-source export engine, a five-strategy atomizer, a three-modal search index, and an LLM-powered intelligence layer.
The central epistemological operation is atomization: decomposing raw conversations and documents into the smallest self-contained units of knowledge that can stand alone, be retrieved independently, and form connections with other units. The KnowledgeAtomizer class (src/atomizer.ts) implements this through a strategy pattern, with five distinct chunking strategies dispatched based on content characteristics.
For conversations, two complementary passes run in parallel. Message-level atomization (atomizeByMessage) processes each message independently: messages shorter than 20 characters are discarded, each surviving message is classified by type (insight, code, question, reference, or decision), assigned auto-generated tags via content analysis, categorized into a controlled vocabulary (programming, writing, research, design, devops, data, general), and enriched with extracted keywords ranked by frequency. Code extraction (atomizeCodeBlocks) uses regex-based parsing to identify fenced code blocks, extracting them as discrete code-typed units with language metadata.
For documents — markdown, plain text, HTML, and PDF — the atomizer delegates to format-specific ChunkingStrategy implementations defined in src/chunking-strategies.ts:
| Strategy | Class | Target |
|---|---|---|
| Markdown Semantic | MarkdownSemanticChunkingStrategy |
.md, .txt, .html — splits on headings and semantic boundaries using SemanticChunker |
| PDF Sliding Window | PdfSlidingWindowChunkingStrategy |
.pdf — configurable token windows (default 500 tokens, 50 overlap) with page range estimation |
| Single Chunk | SingleChunkStrategy |
Fallback for any format — passes content through as one unit |
The SemanticChunker (src/semantic-chunker.ts) performs content-type detection (markdown, code, or plain text), splits on structural boundaries (headings, code fences, list blocks), merges chunks below a configurable minimum size, and calculates confidence scores for each resulting segment. Chunk guardrails enforce maximum chunk counts per document (CHUNK_MAX_PER_DOC, default 40) and minimum token thresholds (CHUNK_MIN_TOKENS, default 160), merging small fragments and capping excessive fragmentation.
Every atomic unit carries a typed AtomicUnit interface (src/types.ts) with provenance fields (conversationId, documentId, parentSectionId), hierarchical metadata (sectionType, hierarchyLevel), and a RedactionMetadata payload when the built-in RedactionService has processed the content.
interface AtomicUnit {
id: string;
type: 'insight' | 'code' | 'question' | 'reference' | 'decision';
timestamp: Date;
title: string;
content: string;
context: string;
tags: string[];
category: string;
conversationId?: string;
documentId?: string;
keywords: string[];
sectionType?: 'list' | 'table' | 'blockquote' | 'heading' | 'code' | 'paragraph';
hierarchyLevel?: number;
}
Retrieval operates across three complementary modalities, each optimized for a different query type.
Full-Text Search uses SQLite FTS5 via better-sqlite3. The KnowledgeDatabase class (src/database.ts) maintains a virtual FTS5 table (units_fts) indexed on title, content, context, and tags. FTS5 provides BM25 ranking, prefix matching, boolean operators, and sub-millisecond latency for keyword-exact queries. The database runs in WAL (Write-Ahead Logging) mode for concurrent read/write performance.
Semantic Search uses ChromaDB as a vector store with configurable embedding providers. The VectorDatabase class (src/vector-database.ts) manages profile-aware collections — each embedding model gets its own ChromaDB collection (e.g., knowledge_units_text-embedding-3-small), with legacy fallback support for migration scenarios. The EmbeddingsService generates vectors via the EmbeddingFactory (src/embedding-factory.ts), which supports OpenAI (text-embedding-3-small at 1536 dimensions, text-embedding-3-large at 3072), Ollama (nomic-embed-text at 768, mxbai-embed-large at 1024), and Hugging Face Sentence Transformers (384 dimensions). An EmbeddingCache (src/embedding-cache.ts) avoids redundant API calls for previously embedded content.
Hybrid Search (src/hybrid-search.ts) combines both modalities using Reciprocal Rank Fusion (RRF). The HybridSearch.search() method executes FTS and embedding generation in parallel, then merges ranked lists with the formula:
score(unit) = w_fts / (k + rank_fts + 1) + w_semantic / (k + rank_semantic + 1)
where k=60 (the RRF smoothing constant), and weights default to fts=0.6, semantic=0.4 but are caller-configurable. Post-fusion boosts are applied for chunked content (+0.05) and visual-intent queries (+0.02 for units tagged has-image). The system supports filtering by date range, source, and document format, with visual intent detection that recognizes 18 query terms (e.g., “diagram”, “screenshot”, “wireframe”).
Three LLM-powered analysis modules transform the knowledge base from a retrieval system into an analytical instrument. All use the AIFactory (src/ai-factory.ts) abstraction, which supports Anthropic, OpenAI, and Ollama backends interchangeably.
The InsightExtractor (src/insight-extractor.ts) analyzes conversation transcripts through a structured system prompt that instructs the LLM to identify reusable technical insights, decision rationale, code patterns, and actionable recommendations. Each extraction produces 3-10 AtomicUnit objects tagged with claude-extracted and importance levels (high, medium, low). Batch processing with checkpoint support (BatchProcessor) enables resumable extraction across large corpora.
The SmartTagger (src/smart-tagger.ts) generates controlled-vocabulary tags using LLM classification. Given a unit’s title and first 1,000 characters of content, it produces 3-8 hyphenated tags, a category assignment from the canonical taxonomy, extracted keywords, and a confidence score. The taxonomy.ts module enforces normalization: aliases like technical or tooling resolve to programming; infrastructure resolves to devops.
The RelationshipDetector (src/relationship-detector.ts) maps connections between units using a two-phase approach: first, vector similarity identifies candidate pairs above a configurable threshold (default 0.7); then the LLM validates and classifies each relationship using an OpenMetadata-inspired type system — references, builds_on, contradicts, implements, derived_from, prerequisite, or the fallback related. Each relationship carries a confidence score and human-readable explanation.
The ConversationSummarizer (src/conversation-summarizer.ts) produces structured summaries with title, overview, key points, topics, outcome, action items, code snippet counts, and technology mentions — providing a high-level map of conversational content before deep retrieval.
The KnowledgeGraph class (src/knowledge-graph.ts) implements a directed graph with typed nodes and weighted edges. Nodes correspond to atomic units; edges represent typed relationships (RelationshipType enum: RELATED, SIMILAR, CONTRADICTS, EXTENDS, REFERENCES, DEPENDS_ON, PART_OF, FOLLOWS, PRECEDES, SAME_CATEGORY, SAME_TOPIC).
Graph operations include BFS-based shortest path finding (findShortestPath), configurable neighborhood extraction (getNeighborhood with hop-depth control), type-based and category-based filtering, and graph statistics (node count, edge count, density, average/max degree). The GraphBuilder.detectRelationships static method computes Jaccard similarity over keyword sets to auto-detect RELATED edges above a configurable threshold.
The graph exports to both raw JSON and vis.js-compatible format (toVisFormat) for direct rendering in the web frontend’s graph visualization tab.
The SourceManager (src/sources/manager.ts) orchestrates ingestion from nine source adapters, each implementing the KnowledgeSource interface:
| Source | Class | Type | Method |
|---|---|---|---|
| Claude (browser) | ClaudeSource |
chat | Playwright browser automation |
| Claude (export) | ClaudeExportSource |
chat | JSON export file parsing |
| ChatGPT (browser) | ChatGPTSource |
chat | Playwright browser automation |
| ChatGPT (export) | ChatGPTExportSource |
chat | JSON export file parsing |
| Gemini | GeminiSource |
chat | Playwright browser automation |
| Local files | LocalFileSource |
file | Filesystem glob (.md, .txt, .pdf, .html) |
| Google Docs | GoogleDocsSource |
file | OAuth2 API integration |
| Apple Notes | AppleNotesSource |
file | macOS-only native bridge |
| Bookmarks | BookmarkSource |
file | Browser bookmark parsing |
A Federation layer (src/federation/) extends this with persistent, schedulable source management. FederatedSourceRecord entries define filesystem paths with include/exclude glob patterns; the FederatedIndexer performs incremental scans, content hashing for change detection, and scan history tracking via FederatedScanRunRecord. Federated search (src/federation/search.ts) queries across all indexed federation sources with unified scoring.
The system also supports real-time ingestion via watch mode — chokidar-based filesystem monitoring that triggers atomization and indexing as new files appear.
┌────────────────────────────────────────────────────────────────────────────┐
│ Source Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Claude │ │ ChatGPT │ │ Gemini │ │ Local FS │ │ Google Docs │ │
│ │(Playwrgt)│ │(Playwrgt)│ │(Playwrgt)│ │(glob/pdf)│ │ (OAuth2) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ └────────────┬┴───────────┬┴────────────┘ │ │
│ ▼ ▼ ▼ │
│ SourceManager.ingestAll() Federation Indexer │
└────────────────────────┬─────────────────────────────┬───────────────────┘
▼ ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ Atomization Layer │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ KnowledgeAtomizer │ │
│ │ ├── atomizeConversation() → message-level + code extraction │ │
│ │ └── atomizeDocument() → strategy dispatch: │ │
│ │ ├── MarkdownSemanticChunkingStrategy (md/txt/html) │ │
│ │ ├── PdfSlidingWindowChunkingStrategy (pdf) │ │
│ │ └── SingleChunkStrategy (fallback) │ │
│ │ + RedactionService (PII/secret masking before storage) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────┬──────────────────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ Storage + Index Layer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌───────────────────────┐ │
│ │ SQLite + FTS5 │ │ ChromaDB │ │ Knowledge Graph │ │
│ │ (better-sqlite3)│ │ (vector store) │ │ (in-memory directed) │ │
│ │ atomic_units │ │ profile-aware │ │ BFS/neighborhood/ │ │
│ │ units_fts │ │ collections │ │ Jaccard detection │ │
│ │ documents │ │ legacy fallback │ │ vis.js export │ │
│ │ tags, rels │ │ EmbeddingCache │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └───────────┬───────────┘ │
│ └──────────────┬───────┘ │ │
│ ▼ │ │
│ ┌──────────────────────────────────┐ │ │
│ │ HybridSearch (RRF fusion) │◀─────────────────────┘ │
│ │ w_fts / (k + rank) + w_vec /.. │ │
│ │ + date/source/format filters │ │
│ │ + visual intent boost │ │
│ └──────────────────────────────────┘ │
└────────────────────────┬──────────────────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ Intelligence Layer (LLM) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Insight │ │ Smart │ │ Relationship │ │ Conversation │ │
│ │ Extractor │ │ Tagger │ │ Detector │ │ Summarizer │ │
│ │ (3-10/conv) │ │ (3-8 tags) │ │ (vec+LLM) │ │ (structured) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └─────────────────┘ │
│ AIFactory: Anthropic | OpenAI | Ollama │
└────────────────────────┬──────────────────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌─────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │ Express.js REST API │ │ React Frontend (web-react/) │ │
│ │ + WebSocket (real-time) │ │ SearchTab, GraphTab, TagsTab, │ │
│ │ + Auth middleware │ │ ConversationsTab, FederationTab, │ │
│ │ + Rate limiting │ │ AdminTab, SettingsTab, ExportsTab │ │
│ │ + Collections API │ │ + Zustand stores + keyboard nav │ │
│ │ + Saved Searches API │ │ │ │
│ └─────────────────────────┘ └──────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CLI (30+ commands): search, search:semantic, search:hybrid, │ │
│ │ extract-insights, smart-tag, find-relationships, summarize, │ │
│ │ export-obsidian, watch, probe:runtime, redact:scan, ... │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
better-sqlite3)# Clone the repository
git clone https://github.com/organvm-i-theoria/my-knowledge-base.git
cd my-knowledge-base
# Install dependencies
npm install
# Configure environment
cp .env.example .env
# Edit .env: set ANTHROPIC_API_KEY, OPENAI_API_KEY, or configure Ollama
# (Optional) Configure for local-only operation via config.yaml:
# llm.provider: ollama
# embedding.provider: local
# Initialize database (runs migrations + seeds)
npm run prepare-db
# Run the test suite
npm test
# Start the web server (React UI at http://localhost:3000)
npm run web
# Or start in dev mode with hot reload
npm run dev
Edit config.yaml to use Ollama for both LLM and embeddings:
llm:
provider: ollama
model: gemma3:4b
baseUrl: http://localhost:11434/v1
embedding:
provider: local
model: nomic-embed-text
search:
semanticPolicy: degrade
hybridPolicy: degrade
maxSearchWindow: 2000
enforceVectorSqlParity: true
With semanticPolicy: degrade, the system gracefully falls back to FTS-only search when ChromaDB is unavailable.
# Export all conversations from configured sources
npm run ingest:all
# Full-text keyword search
npm run search -- "recursive data structures"
# Semantic search (requires embeddings)
npm run search:semantic -- "approaches to handling state in deeply nested systems"
# Hybrid search (FTS + vector, RRF fusion)
npm run search:hybrid -- "deployment patterns for microservices"
# Extract insights from all conversations (LLM-powered)
npm run extract-insights
# Generate smart tags for all untagged units
npm run smart-tag:backfill
# Discover relationships between knowledge units
npm run find-relationships
# Summarize a batch of conversations
npm run summarize
# Watch for new files and auto-ingest with embeddings
npm run watch:semantic
# Export entire knowledge base to Obsidian vault
npm run export-obsidian -- --vault-path ~/Documents/ObsidianVault
# Export processed knowledge base as structured JSON
npm run export-processed-pkb
# Scan for secrets and PII before committing data
npm run redact:scan
# Apply redaction to all detected items
npm run redact:apply
# Validate no secrets remain
npm run redact:validate
The RedactionService (src/redaction-service.ts) detects 17 secret types (OpenAI, Anthropic, AWS, GitHub, Stripe, Slack, Discord API keys; JWTs; private keys; bearer tokens; connection strings) and 6 PII types (SSN, phone, email, credit card, IPv4, IPv6) with sophisticated false-positive filtering to avoid flagging code patterns like process.env.API_KEY.
As an ORGAN-I (Theory) repository, My Knowledge Base provides foundational infrastructure that downstream organs build upon:
The unidirectional dependency constraint (I -> II -> III) means this repository is consumed by, but never depends on, any downstream organ.
The system is validated through multiple layers:
src/**/*.test.ts) and integration tests (tests/), including search parity smoke tests, federation endpoint tests, WebSocket integration, and end-to-end auth and export flows.scripts/probe-search-runtime.ts) — automated health checks for local, staging, and production environments that verify search latency, semantic readiness, and cross-modal parity.benchmarks/search-latency.bench.ts) — search latency measurement under controlled conditions.scripts/test-stability.sh) — repeated test runs to detect flaky tests and non-deterministic behavior.scripts/verify-alerts.ts) — validates that Prometheus alert rules (ops/alerts/search-runtime-alerts.yaml) fire correctly under simulated failure conditions.docs/evidence/ for audit trail purposes.providers, provider_accounts, chat_threads, chat_turns, term_lexicon, term_occurrences, thematic_edges, ingest_runs)./api/universe/* for macro→micro traversal and term occurrence search.npm run ingest:universe -- --root=intakenpm run ingest:universe:save -- --root=intake --report-dir=intake/reportsnpm run test:universe.Universe tab for provider→chat→turn drill-down and click-to-global-term exploration./apps/mobile (Expo/React Native scaffold)/apps/desktop (Tauri scaffold)/packages/contracts.apps/mobile/src/*, apps/desktop/src/*), validated by tests/native-parity.test.ts.npm run build:all compiles server + web + native-core modules in one gate.text-embedding-3-small to text-embedding-3-large)RSSBuilder already implemented; wire to API endpoint for knowledge feed subscriptionThis repository is part of ORGAN-I: Theoria, the theoretical and epistemological organ of the ORGAN system.
| Repository | Organ | Relationship |
|---|---|---|
| recursive-engine | I | Recursive self-modeling framework — the theoretical foundation for the knowledge graph’s self-referential structure |
| organon-noumenon | I | Ontological category system — informs the taxonomy and category normalization used in atomization |
| metasystem-master | II | Generative art system that consumes extracted insights as creative seed material |
| public-process | V | Public essays drawing on intelligence extraction outputs from this system |
| agentic-titan | IV | Orchestration agent that coordinates automated knowledge processing workflows |
| organvm-i-theoria | I | Parent organization — all theory, epistemology, and ontology repositories |
| meta-organvm | VIII | Umbrella organization coordinating all eight organs |
Contributions are welcome. This project follows standard GitHub workflow:
git checkout -b feature/your-feature)src/ for unit tests or tests/ for integration)npm test passesSee CONTRIBUTING.md for detailed guidelines, including code style, commit conventions, and the pull request template.
Part of ORGAN-I: Theoria — the theoretical organ of the ORGAN system.