layered-soul/skills/codebase-knowledge-graphs/SKILL.md
Sam & Claude 4d8ce07fa7 docs: apply Prettier to current markdown (Sam & Codex)
Normalize markdown formatting after the latest main updates.\n\nChecks: python3 scripts/layered_soul.py validate .; npx --yes prettier@3 --check '**/*.md'; git diff --check.
2026-06-14 01:48:32 +02:00

243 lines
10 KiB
Markdown

---
name: codebase-knowledge-graphs
description: "Build/query persistent codebase knowledge graphs for architecture discovery, cross-repo impact analysis, and agent handoffs."
version: 1.1.0
author: Hermes Agent
license: MIT
platforms: [linux, macos, freebsd]
metadata:
hermes:
tags:
[codebase-analysis, knowledge-graphs, architecture, graphify, cross-repo]
related_skills: [writing-plans, codebase-inspection, requesting-code-review]
---
# Codebase Knowledge Graphs
## Overview
Use this skill when a task benefits from a persistent map of a repository or a set of related repositories: architecture discovery, cross-repo dependency questions, agent onboarding, impact analysis, or long-lived project navigation.
Default tool: **Graphify** (`graphifyy` package, `graphify` CLI). It creates:
```text
graphify-out/graph.json
graphify-out/graph.html
graphify-out/GRAPH_REPORT.md
```
Treat the graph as a navigation aid, not an authority. Before editing code, verify graph answers against source files.
## When to Use
Use for:
- Broad architecture questions: "how does X connect to Y?"
- Multi-agent projects where agents repeatedly lose context.
- Cross-repo flows, e.g. source repo -> build repo -> deployment repo.
- Impact analysis before changing scripts, skills, manifests, deployment contracts, or public interfaces.
- Producing a durable project map for future sessions.
Do not use as the first tool for:
- A tiny local grep/read-file task.
- Security-sensitive repositories where generated graph artifacts would leak secrets or paths.
- Critical runtime/boot paths until dependency and offline-install behavior has been tested.
## Setup and Commands
Prefer `uvx` for one-off runs without permanently installing:
```sh
uvx --from graphifyy graphify --help
```
Build a graph for a project (AST-only, no API key needed):
```sh
cd <project> && uvx --from graphifyy graphify update .
```
This does AST extraction only — parses all source files, builds the graph with function/import/define edges. No LLM required. For semantic extraction (LLM-powered chunk labeling), you need an API key and `graphify extract` — see "Backend Pitfalls" below.
Query an existing graph:
```sh
uvx --from graphifyy graphify query "how does deployment work?" --graph graphify-out/graph.json
uvx --from graphifyy graphify path "hostd" "webroot" --graph graphify-out/graph.json
uvx --from graphifyy graphify explain "iso-publish" --graph graphify-out/graph.json
```
Export an architecture/call-flow page:
```sh
uvx --from graphifyy graphify export callflow-html
```
Merge graphs for cross-repo questions:
```sh
uvx --from graphifyy graphify merge-graphs \
../repo-a/graphify-out/graph.json \
../repo-b/graphify-out/graph.json \
--out graphify-out/merged-graph.json
```
## Repository Integration Pattern
Default posture: generated graph artifacts are a temporary map, not the territory. The source code is the durable truth; regenerate graphs locally when needed instead of committing stale JSON.
For a mature repo that repeatedly uses Graphify, consider adding:
```text
.graphifyignore
docs/GRAPHIFY.md
scripts/graphify-refresh.sh
```
Do **not** add a wrapper script before Graphify has been used on at least one real debugging/navigation task in that repo. First use the tool manually, observe what was annoying, then script the recurring parts.
Recommended `.gitignore` additions:
```text
graphify-out/
*.graph.json
```
Commit `graphify-out/graph.json`, `GRAPH_REPORT.md`, `graph.html`, or other generated graph output only with an explicit project decision. Reasons to avoid committing by default:
- Graphs get stale as soon as source changes.
- Generated JSON creates noisy diffs and harder reviews.
- Checked-in graph output looks more authoritative than it is, even though Graphify can produce fake/noisy nodes or guessed edges.
Do not commit local caches, cost/mtime manifests, or generated graph output for Clawdie ISO unless the project explicitly reverses this rule.
**Clawdie-ISO policy (2026-05-23):** Graphify is prohibited entirely in the ISO repo. Do NOT add `.graphifyignore`, `docs/GRAPHIFY.md`, skills, or any graph-related files to `clawdie-iso`. The repo composition (shell scripts + markdown + archived planning docs) makes the graph mislead agents toward retired decisions and a deprecated QML installer. Clawdie-AI allows Linux-local on-demand use but still avoids formal integration. The Colibri and Herdr repos have no graph policy — use at your discretion.
## `.graphifyignore` Guidance
Always exclude:
```text
.git/
tmp/
node_modules/
dist/
build/
.cache/
.env
*.key
*.pem
*.sqlite
*.db
```
For ISO/build repos, also exclude:
```text
*.img
*.img.gz
*.iso
*.sha256
packages/
downloads/
html/
webroot/
```
Include source, docs, skills, scripts, package lists, and manifest schemas.
## Agent Usage Rules
1. If `graphify-out/graph.json` exists and the task is broad, query the graph before deep grep.
2. Use scoped questions and explicit graph paths.
3. Cite graph findings as leads, not facts.
4. Verify relevant source files before proposing or making code changes.
5. Regenerate the graph when the repo has materially changed and graph freshness matters.
6. Keep graph generation out of critical boot/build paths unless tested on the target platform.
## Pitfalls
### Build with `update`, not `extract` (different commands, different requirements)
`graphify update` and `graphify extract` are **different commands** with different requirements:
| Command | API key? | What it does | Use case |
| ---------------------------- | ---------------------------- | -------------------------------------------------------------- | ------------------------------------ |
| `graphify update .` | No | AST-only: parses source files, builds import/call/define graph | Primary build command — always works |
| `graphify extract . --out .` | Yes (Gemini/DeepSeek/OpenAI) | AST + semantic: LLM-powered chunk labeling on top of AST graph | Only when semantic labels are needed |
`graphify extract` may not appear in `graphify --help` (version-dependent). Always try `graphify update .` first — it produces a fully usable graph with zero configuration. A 971-file TypeScript repo produced 11,277 nodes and 16,820 edges with `update` alone.
If you accidentally run `extract` without an API key, you'll get:
```
error: no LLM API key found. Set GEMINI_API_KEY or GOOGLE_API_KEY (gemini), ... or pass --backend.
```
Fix: use `graphify update .` instead.
### Create `.graphifyignore` BEFORE the first build
Without `.graphifyignore`, graphify walks every file in the repo including build artifacts, caches, and vendored dependencies. On a JS/TS project this means parsing `node_modules/`, `dist/`, `tmp/`, etc. — thousands of irrelevant files that bloat the graph and slow extraction.
**Always create `.graphifyignore` before the first `graphify update`.** Use the templates in the "`.graphifyignore` Guidance" section above. A 971-file repo with proper exclusions produces a clean graph; without exclusions it's easily 10x larger with noise nodes.
### DeepSeek / OpenAI-compatible backends need the `openai` package (semantic extraction only)
When using `graphify extract` with `--backend deepseek` (or any OpenAI-compatible backend), the semantic phase requires the `openai` Python package. `uvx` sandboxes don't include it:
```
Gemini/Kimi/Ollama/OpenAI-compatible extraction requires the openai package. Run: pip install openai
```
**Fix:** Install system-wide:
```sh
pip install --break-system-packages openai
```
Or use a venv:
```sh
python3 -m venv /tmp/graphify-venv
/tmp/graphify-venv/bin/pip install openai
PATH="/tmp/graphify-venv/bin:$PATH" uvx --from graphifyy graphify extract . --out . --backend deepseek --model deepseek-chat
```
AST-only (`graphify update`) does not need this. Only the LLM-powered semantic chunk labeling in `extract` does.
### Cross-repo AST-only graphs cannot trace data flow across repo boundaries
`graphify update` builds a graph from AST edges: imports, function calls, defines, contains. These edges only exist within a single codebase. Cross-repo data flow — e.g., TypeScript config values in one repo consumed by shell scripts in another — does **not** produce graph edges because the AST parser cannot see across repository boundaries.
`graphify path` will return "No path found" for cross-repo queries. The merged graph still has value (you can query both repos in one graph), but use `graphify explain` on individual nodes and trace the human-documented contracts (AGENTS.md, handoff docs) for cross-repo connections. The graph shows you _what exists where_; it does not replace cross-repo documentation.
## FreeBSD and Offline Caution
Graphify is Python-based and depends on packages such as `tree-sitter-*`, `networkx`, `rapidfuzz`, and related scientific/runtime dependencies. On FreeBSD or offline images:
- Start with optional host/developer tooling, not mandatory runtime dependency.
- Investigate target-platform package availability before proposing runtime/ISO integration.
- Test `uvx --from graphifyy graphify --help` and a small extraction on FreeBSD before adding it to an ISO.
- If offline use is required, cache/test wheels or ports on the target FreeBSD version first.
- Do not block image build, boot, USB flashing, or webroot publishing on graph generation.
For Clawdie/FreeBSD-specific package findings and the recommended smoke test, see `references/graphify-freebsd-clawdie.md`.
## Validation
```sh
sh -n scripts/graphify-refresh.sh
uvx --from graphifyy graphify query "what are the main deployment paths?" --graph graphify-out/graph.json
python3 -m json.tool graphify-out/graph.json >/dev/null
```
For generated docs, open or serve `graphify-out/graph.html` only after confirming it does not expose secrets.
## References
- `references/clawdie-graphify-integration.md` — concrete Clawdie-AI/Clawdie-ISO integration plan and boundaries.
- `references/graphify-freebsd-clawdie.md` — FreeBSD package-availability findings, smoke-test commands, and Clawdie ISO/runtime boundary.
- `references/clawdie-multi-repo-agents-structure.md` — Clawdie's four-repo layout (AI, ISO, Colibri, herdr), AGENTS.md conventions per repo, platform split, and cross-repo update procedures.
- `references/clawdie-cross-repo-graph-sizes.md` — concrete node/edge counts from the 2026-05-27 Clawdie cross-repo graph build; merge command and cross-repo visibility limitations.