5 KiB
Builtin Knowledge Spec
Date: 14.mar.2026 Status: Draft planning spec
Purpose
Define the technical contract for Clawdie's built-in local knowledge package.
This document covers:
- what content belongs in built-in knowledge
- how that content is chunked
- how embeddings are versioned
- what metadata the package must carry
- what bootstrap import depends on
This document does not define user memory.
Scope
Builtin knowledge is the shipped local knowledge layer used during install, onboarding, and operator support before any production LLM provider is added.
It should include:
- install/setup guides
- onboarding locale/help text used by the FreeBSD
bsddialogpath - operator/system docs needed during bootstrap
- built-in product skills
- curated support/reference material that Clawdie ships intentionally
It should not include:
- user notes
- user preferences
- runtime conversation history
- agent-specific personal memory
Source Allowlist
Initial source allowlist:
.agent/skills/docs/files used directly by install/onboarding/operator flowssetup/locale/onboarding metadata that the FreeBSD-native selector needs- generated or checked-in locale inventory metadata derived from FreeBSD
sources such as
locale -a,login.conf, and/usr/share/locale/when onboarding consumes that data directly
Initial exclusions:
docs/sessions/- generated logs
- screenshots and media-only files
- examples unrelated to core Clawdie install/runtime behavior
The allowlist should stay explicit. Do not index the entire repo by default.
Package Structure
The built-in knowledge package should be shipped as:
- SQL import artifact
- metadata JSON
Recommended location:
bootstrap/skills-memory/artifact.sqlbootstrap/skills-memory/metadata.json
This can later move to a release asset if size becomes a problem.
Metadata Contract
The metadata file must include at least:
artifact_versionschema_versionsource_snapshotchunker_namechunker_versionchunk_sizechunk_overlapembedding_providerembedding_modelembedding_dimensionsgenerated_at
Optional but useful:
document_countchunk_countembedding_countgit_commit
Chunking Contract
Chunking is part of the package definition and must be versioned.
Initial policy:
- split by heading/section first
- enforce maximum chunk size after structural split
- keep overlap minimal
Do not make chunking user-configurable during install.
If chunking changes, build a new package version.
Embedding Contract
Embeddings are generated before user install.
Allowed generation paths:
- maintainer machine
- CI pipeline
- later: local rebuild path on target host
Initial product rule:
- install must not depend on live embedding generation
Preferred long-term local rebuild direction:
llama.cpp
Database Contract
The db jail stores and serves built-in knowledge.
The db jail should not be treated as the primary place where embeddings are
generated during first install.
Bootstrap must be able to:
- create schema
- import SQL package
- read metadata
- verify row counts
Current schema file:
docs/sql/builtin-knowledge-base.sql
Current base tables:
builtin_knowledge_artifactsbuiltin_knowledge_documentsbuiltin_knowledge_chunksbuiltin_knowledge_embeddings
Current base lookup function:
search_builtin_knowledge(query_text, max_results)
Import Contract
Default bootstrap flow:
- create
{agent}-db - install PostgreSQL + pgvector
- apply schema
- import built-in knowledge SQL
- verify metadata and row counts
If import fails:
- stop bootstrap
- report the missing or invalid package
- do not silently fall back to live embedding generation
Separation From User Memory
Built-in knowledge and user memory must remain distinct.
At minimum, keep separate lifecycle and provenance.
Later implementation options:
- separate tables
- separate namespaces
- base + overlay model for built-in knowledge updates
Update Model
Built-in knowledge starts as a shipped package.
Locale-aware onboarding adds one more requirement: the packaged install
knowledge must stay aligned with the reviewed locale set and the bsdinstall-
style onboarding screens. If the locale selector or onboarding copy changes,
the built-in package should refresh alongside it.
Later updates can come from:
- local git pull
- local Gitea mirror
- maintainer rebuild and shipped package refresh
Recommended future model:
- detect changed source files
- re-chunk changed files
- re-embed changed chunks
- update built-in knowledge without mixing with user memory
Initial Non-Goals
- no mandatory online provider during install
- no mandatory local model during install
- no indexing of all repo content
- no committing raw PostgreSQL data directories to git
Next Decisions
- Finalize the exact source allowlist
- Choose the first chunk size and overlap
- Decide the initial embedding provider for package generation
- Decide whether built-in knowledge starts as
base onlyorbase + overlay