A systematic excavation engine that audits every layer of a creator’s digital footprint — archives, AI conversations, personal repositories, organisation repositories, and web bookmarks — and produces a unified inventory, a knowledge graph of relationships, and a prioritised triage report that transforms scattered creative history into an organised system foundation.
Cognitive Archaeology refers to the practice of excavating layers of thought. Just as physical archaeology digs through strata of soil to reconstruct civilisations, cognitive archaeology digs through strata of digital artefacts — conversation logs, code repositories, cloud archives, browser bookmarks — to reconstruct the intellectual history of a creative practice. Each layer carries its own temporal signature, its own logic, and its own relationship to the layers above and below it.
Tribunal carries the dual meaning of judgment and adjudication. The tool does not merely inventory what it finds; it renders verdicts. Every artefact passes through triage: what survives into the organised system, what gets archived, what gets merged, and what gets discarded. The tribunal metaphor reflects the epistemological seriousness of the task — deciding what knowledge is worth preserving is itself an act of knowledge-making.
Together, the name describes a system that excavates cognitive strata and adjudicates what survives the transition from chaos to order.
Any sufficiently long creative practice accumulates entropy. Years of work produce scattered artefacts across incompatible systems: hundreds of AI conversation threads (ChatGPT, Claude, Gemini) containing architectural decisions that were never extracted; personal GitHub repositories that are really evaluation forks of external tools, not original work, but mixed indistinguishably with originals; cloud storage directories (iCloud, Dropbox, network drives) containing duplicated files across multiple locations with no canonical version; browser bookmark collections numbering in the thousands with no topical organisation; and organisation repositories that may be active, stale, or silently abandoned.
The fundamental problem is not storage — storage is cheap. The problem is epistemic: you cannot govern what you cannot see. Before the eight-organ system could be designed, before repositories could be assigned to organisations, before any architectural decision could be made, someone had to answer the question: what do we actually have?
Manual inventory is not viable at scale. A human scanning 85 repositories, thousands of archived files, and hundreds of AI conversations would take weeks and produce an incomplete, error-prone result. The Cognitive Archaeology Tribunal exists to answer this question programmatically, comprehensively, and repeatedly. It scans every layer of a creator’s digital footprint — from local filesystem archives through GitHub API endpoints — and produces machine-readable outputs (JSON inventories, knowledge graphs, triage reports) that downstream tools and human reviewers can act on.
This is not a backup tool or a file manager. It is a pre-governance audit system — the epistemological foundation that makes governance possible. Within ORGAN-I (Theoria), it embodies the principle that theory begins with accurate observation. You cannot theorise about a system you have not first measured.
The tribunal models a creator’s digital footprint as four discrete layers, each with its own scanning module, data model, and triage logic:
| Layer | Domain | Scanner Module | Key Metric |
|---|---|---|---|
| Layer 0 | Filesystem archives (iCloud, Dropbox, local/network drives) | ArchiveScanner |
Files counted, duplicates found, space savings |
| Layer 1 | AI conversation history (ChatGPT, Claude, generic JSON) | AIContextAggregator |
Conversations loaded, messages parsed, topics extracted |
| Layer 2 | Personal GitHub repositories | PersonalRepoAnalyzer |
Repos classified (fork vs. original), modification detection |
| Layer 3 | Organisation GitHub repositories | OrgRepoAnalyzer |
Repos by status (active/stale/abandoned), health scoring |
| Layer 4 | Web bookmarks (Netscape format exports) | WebBookmarkAnalyzer |
Bookmarks parsed, URLs and timestamps extracted |
Each layer is independently scannable. You can run the archive scanner without touching GitHub, or analyse organisation repos without providing bookmark exports. The unified inventory and knowledge graph only gain cross-layer power when multiple layers are combined, but each layer stands alone as a useful audit.
Raw inventories are useful but flat. The KnowledgeGraphGenerator transforms audit results into a directed graph of nodes (entities) and edges (relationships). Nodes carry type metadata — repository, file_category, conversation_source, repo_status — and edges encode relationships like contains, forked_from, and has_status. The graph exports in two formats: a native JSON structure for programmatic consumption and a Cytoscape.js-compatible format for interactive visualisation.
The knowledge graph enables queries that no single module can answer alone: “Which AI conversations relate to repositories that are currently stale?” or “Which archive categories overlap with active org repositories?” These cross-layer queries are the tribunal’s primary analytical value.
The TriageReportGenerator converts raw audit data into prioritised action items across three tiers — high, medium, and low priority. Triage logic is domain-specific per layer:
The triage report generates both JSON (for downstream automation) and plain-text (for human review) output formats. The text report is structured as an executive summary suitable for reading in a terminal or including in planning documents.
The Deduplicator class within file_utils.py implements a two-pass deduplication strategy. The first pass groups files by size — an O(1) lookup that eliminates most non-duplicates without reading file contents. The second pass computes SHA-256 content hashes only for files that share a size, avoiding expensive hashing for unique files. This approach scales to large archives (the roadmap targets 1M+ files) while remaining accurate. The deduplicator reports both the number of duplicate groups and the total space savings achievable by removing redundant copies.
The PersonalRepoAnalyzer does not treat all repositories equally. It classifies each repository as fork, template, archived, private-original, or public-original using the classify_repo_type utility. For forks specifically, the detect_modifications function compares creation timestamps against push timestamps and checks for the presence of commits beyond the fork point, answering the critical question: is this fork actively modified, or is it an untouched copy?
This distinction matters operationally. During the initial audit of 42 personal repositories, 38 were identified as forks — but the reframing that emerged was that these forks were not clutter but an integration staging area. The Tribunal’s classification data made that reframing possible by quantifying the difference between evaluated forks (with modifications) and untouched forks (candidates for deletion).
main.py # CLI entry point (argparse)
|
+-- cognitive_tribunal/
| +-- modules/
| | +-- archive_scanner.py # Layer 0: Filesystem scanning + deduplication
| | +-- ai_context_aggregator.py # Layer 1: ChatGPT/generic JSON parsing
| | +-- personal_repo_analyzer.py # Layer 2: Personal GitHub repo analysis
| | +-- org_repo_analyzer.py # Layer 3: Org GitHub repo analysis
| | +-- web_bookmark_analyzer.py # Layer 4: Netscape bookmark parsing
| |
| +-- outputs/
| | +-- inventory.py # Unified inventory generator
| | +-- knowledge_graph.py # Knowledge graph + Cytoscape export
| | +-- triage_report.py # Prioritised triage (JSON + text)
| |
| +-- utils/
| +-- file_utils.py # FileClassifier, FileHasher, Deduplicator
| +-- github_utils.py # GitHubClient wrapper (PyGithub)
| +-- metadata.py # Dublin Core / DataCite metadata
|
+-- ingestion/ # Ingestion chambers (input staging)
| +-- ai-exports/ # ChatGPT/Claude export JSON files
| +-- archives/ # Filesystem archive path references
| +-- bookmarks/ # Netscape-format bookmark HTML exports
| +-- browser-tabs/ # Browser session JSON exports
|
+-- data/
| +-- genesis-conversations/ # 16 indexed genesis conversation JSONs
|
+-- scripts/ # Shell utilities
| +-- ingest_all.sh # Run full ingestion pipeline
| +-- sanitize_for_public.sh # Privacy sanitisation for publishing
| +-- publish_dataset.sh # Dataset publication workflow
| +-- generate_snapshot.sh # Point-in-time snapshot generation
| +-- convert_genesis_chat.py # Genesis chat format conversion
|
+-- context/ # Historical and planning context
| +-- history/ # Genesis transcripts, cognitive OS master plan
| +-- planning/ # Ingestion plans, integration queue, task hierarchy
|
+-- docs/ # Documentation
| +-- analysis/ # Phase 2 analysis summaries
| +-- guides/ # Integration quick start, usage guides
| +-- setup/ # GitHub token setup, configuration guides
|
+-- output/ # Generated outputs (.gitignored)
+-- *.json # Inventory, knowledge graph, triage report
Sources Modules Outputs
+--------------+ +--------------------+ +---------------------+
| iCloud |---+ | ArchiveScanner |-+ | |
| Dropbox | +--->| (classify, dedup) | | | inventory.json |
| Local dirs |---+ +--------------------+ | | (unified catalog) |
+--------------+ +--------------------+ | | |
| ChatGPT |---+ | AIContext | +->| knowledge_graph |
| exports | +--->| Aggregator |-+ | .json + cytoscape |
| Claude |---+ +--------------------+ | | |
+--------------+ +--------------------+ | | triage_report |
| Personal |------->| PersonalRepo |-+ | .json + .txt |
| GitHub | | Analyzer | | | |
+--------------+ +--------------------+ | | per-module |
| Org |------->+--------------------+ | | .json files |
| GitHub | | OrgRepo |-+ | |
+--------------+ | Analyzer | | +---------------------+
| Bookmark |------->+--------------------+ |
| HTML files | +--------------------+ |
+--------------+------->| WebBookmark |-+
| Analyzer |
+--------------------+
Each module writes its own per-module JSON output (archives.json, personal_repos.json, etc.) to the output directory. The three output generators — InventoryGenerator, KnowledgeGraphGenerator, and TriageReportGenerator — then consume these per-module results and produce the unified cross-layer outputs.
Scans local filesystem directories recursively with configurable depth limits and exclusion patterns. Default exclusions include .git, node_modules, __pycache__, and OS metadata files (.DS_Store, Thumbs.db). Classifies files into 12 categories (code, document, spreadsheet, image, video, audio, archive, data, database, notebook, config, other) using extension-based matching. Runs the two-pass deduplication engine on all discovered files. Exposes convenience methods for filtering results: get_files_by_category(), get_large_files(min_size_mb), get_old_files(days_old).
Parses ChatGPT conversation exports (JSON with nested mapping tree structure) and generic JSON conversation formats. Extracts message content, author roles, and timestamps. Sorts messages chronologically within each conversation. Provides search (search_conversations(keyword)), date-range filtering (get_conversations_by_date_range()), and topic extraction (extract_topics()) using word-frequency analysis across conversation titles.
Uses the GitHubClient wrapper around PyGithub to enumerate all repositories for a GitHub username. Classifies each repo by type (fork, template, archived, private-original, public-original). For forks, retrieves parent repository information and runs modification detection. Analyses commit history, dependency files, and activity metrics. Generates a triage report separating unmodified forks (deletion candidates), inactive repos (archival candidates), and active projects.
Mirrors the personal analyser but targets GitHub organisation repositories. Adds status determination (active/stale/abandoned/archived) based on time since last update. Generates an organisation health report with a 0-100 health score weighted by the ratio of active to stale/abandoned repos. Produces migration plans with prioritised recommendations. Provides a dependency summary across all organisation repositories.
Parses Netscape Bookmark File Format (the standard export format for Chrome, Firefox, and most browsers). Extracts URLs, titles, and addition timestamps using regex pattern matching. Reports total bookmark counts. Designed as the simplest module — a foundation for future enrichment with domain categorisation and link-rot detection.
repo scope (for GitHub analysis modules)git clone https://github.com/organvm-i-theoria/cognitive-archaelogy-tribunal.git
cd cognitive-archaelogy-tribunal
pip install -r requirements.txt
Copy and edit the example configuration:
cp config.example.yaml config.yaml
The configuration file specifies archive paths, AI export locations, GitHub usernames/organisations, output directory, and exclusion patterns. Alternatively, all parameters can be passed as CLI arguments.
Set your GitHub token as an environment variable:
export GITHUB_TOKEN="ghp_your_token_here"
For token creation instructions, see docs/setup/GITHUB_TOKEN_SETUP.md.
python main.py \
--scan-archives "/path/to/iCloud,/path/to/Dropbox" \
--ai-conversations /path/to/chatgpt/export \
--personal-repos your-username \
--org-repos your-org \
--web-bookmarks /path/to/bookmarks.html \
--output-dir ./output
# Archive scanning only
python main.py --scan-archives /path/to/archives --output-dir ./output
# AI conversation analysis only
python main.py --ai-conversations /path/to/exports --output-dir ./output
# Personal GitHub repos only
python main.py --personal-repos your-username --output-dir ./output
# Organisation repos only
python main.py --org-repos your-org-name --output-dir ./output
# Web bookmarks only
python main.py --web-bookmarks /path/to/bookmarks.html --output-dir ./output
# Skip knowledge graph generation
python main.py --personal-repos username --no-graph --output-dir ./output
# Skip triage report
python main.py --org-repos orgname --no-triage --output-dir ./output
# Inventory only (skip graph and triage)
python main.py --scan-archives /path --no-graph --no-triage --output-dir ./output
Before designing the eight-organ system, we needed a complete picture of what existed. Running the tribunal against both personal and organisation accounts:
python main.py \
--personal-repos 4444J99 \
--org-repos organvm-i-theoria \
--output-dir ./output/pre-governance-audit
This produced a 793KB output set revealing 85 total repositories (42 personal, 43 organisational). The critical insight: 38 of 42 personal repos were forks functioning as an evaluation staging area, not clutter. The triage report quantified this, enabling the reframing that shaped the entire migration strategy.
Scanning multiple cloud storage mounts to find wasted space:
python main.py \
--scan-archives "/Users/me/Library/Mobile Documents/com~apple~CloudDocs,/Users/me/Dropbox" \
--output-dir ./output/cloud-dedup
The deduplication engine groups files by size first, then computes SHA-256 hashes only for size-colliding files, reporting both the number of duplicate groups and the total recoverable disk space.
After exporting ChatGPT conversations (Settings > Data Controls > Export Data):
python main.py \
--ai-conversations ~/Downloads/chatgpt-export/ \
--output-dir ./output/ai-archaeology
The aggregator parses the nested mapping tree structure, extracts all messages chronologically, and runs topic extraction to surface recurring themes across hundreds of conversations.
| Output | Format | Description |
|---|---|---|
inventory.json |
JSON | Unified catalog of all discovered assets across all scanned layers |
knowledge_graph.json |
JSON | Node-edge graph with typed entities and relationships |
knowledge_graph_cytoscape.json |
Cytoscape.js JSON | Import-ready format for interactive graph visualisation |
triage_report.json |
JSON | Structured priority tiers (high/medium/low) with action items |
triage_report.txt |
Plain text | Human-readable executive summary for terminal or document inclusion |
archives.json |
JSON | Per-module archive scan results with deduplication data |
personal_repos.json |
JSON | Per-module personal repository analysis |
org_repos.json |
JSON | Per-module organisation repository analysis |
ai_conversations.json |
JSON | Per-module AI conversation aggregation |
web_bookmarks.json |
JSON | Per-module bookmark analysis |
All output files are written to the --output-dir directory. The output/ directory in the repository root is .gitignored — generated data stays local.
The Cognitive Archaeology Tribunal sits at the very top of the dependency chain within the eight-organ system. Its outputs were the empirical basis for the organ-system design itself:
Registry Foundation: The repo audit in 00-d-organ-system-audit.md and the registry-v2.json source of truth in the ingesting-organ-document-structure (planning corpus) planning corpus both trace their lineage to data this tool produced.
ORGAN-IV (Taxis) Orchestration: The tribunal’s triage reports feed directly into agentic-titan, the orchestration layer that governs cross-organ routing and governance. Health scores, status classifications, and migration plans from OrgRepoAnalyzer.generate_health_report() provide the quantitative inputs that ORGAN-IV needs to make routing decisions.
ORGAN-V (Logos) Public Process: The narrative of “pre-synthesis chaos to organised system” — documented through the tribunal’s outputs — forms one of the flagship essay themes in ORGAN-V’s public process content strategy.
Integration Queue: The fork classification data from PersonalRepoAnalyzer generated the 42-fork integration queue (context/planning/INTEGRATION_QUEUE.md) that drove the personal-to-org migration strategy, tiered into high/medium/low priority integrations.
The tribunal is designed to be re-runnable. As the eight-organ system evolves, periodic re-audits produce updated inventories that track drift, validate governance assumptions, and surface new triage recommendations.
The tribunal has been validated through two complete audit runs:
Unit tests exist for ArchiveScanner and WebBookmarkAnalyzer in the tests/ directory. The roadmap targets 80%+ test coverage by v0.5.0. Run existing tests with:
pytest tests/
The repository includes a Nervous Archaeologist agent definition — a GitHub Copilot agent configured for exhaustive data excavation with security-first governance. The agent operates under the tribunal’s scanning philosophy: zero stones unturned, every finding documented.
The tribunal follows a four-sprint development plan toward v1.0.0:
| Sprint | Focus | Key Deliverables | Status |
|---|---|---|---|
| Sprint 1 | Foundation and Governance | Metadata standards, privacy enhancement, browser tab module, data validation | In progress |
| Sprint 2 | Cross-Layer Synthesis | Synthesis engine, entity extraction, cross-layer knowledge graph, pattern detection | Planned |
| Sprint 3 | Visualisation and Testing | Cytoscape.js/D3.js visualisations, CI/CD pipeline, 80%+ test coverage, performance optimisation | Planned |
| Sprint 4 | Temporal Analysis | Concept drift detection, interest trajectory mapping, evolution visualisations | Planned |
Future backlog includes: plugin architecture, REST API (FastAPI), collaborative features, export bridges (Obsidian, Notion, Zotero, Anki), interactive web UI, and semantic search with RAG.
The full roadmap is maintained in ROADMAP.md.
CONTEXTUAL_RELAY.md — Session handoff document with current state, priorities, and quick commandsROADMAP.md — Full development roadmap with sprint breakdownscontext/planning/INTEGRATION_QUEUE.md — 42-fork integration tracker generated from tribunal outputdocs/analysis/PHASE2_REVISED_INTEGRATION_ANALYSIS.md — The critical reframing that forks are integration staging, not clutterContributions are welcome. The project uses Trunk-based linting (configuration in .trunk/trunk.yaml) with Ruff for Python, markdownlint for Markdown, and yamllint for YAML.
git checkout -b feature/your-feature)trunk checkpytest tests/For significant changes, open an issue first to discuss the approach. See CONTRIBUTING.md for detailed guidelines and docs/GOVERNANCE.md for the project’s governance framework.
This project is licensed under the MIT License. See LICENSE for the full text.
Part of ORGAN-I: Theoria within the eight-organ creative-institutional system coordinated by meta-organvm.
Portfolio · System Directory · ORGAN I · Theoria · Part of the ORGANVM eight-organ system