clawdie-ai/docs/BUILTIN-KNOWLEDGE-SPEC.md

5 KiB

Builtin Knowledge Spec

Date: 14.mar.2026 Status: Draft planning spec

Purpose

Define the technical contract for Clawdie's built-in local knowledge package.

This document covers:

  • what content belongs in built-in knowledge
  • how that content is chunked
  • how embeddings are versioned
  • what metadata the package must carry
  • what bootstrap import depends on

This document does not define user memory.

Scope

Builtin knowledge is the shipped local knowledge layer used during install, onboarding, and operator support before any production LLM provider is added.

It should include:

  • install/setup guides
  • onboarding locale/help text used by the FreeBSD bsddialog path
  • operator/system docs needed during bootstrap
  • built-in product skills
  • curated support/reference material that Clawdie ships intentionally

It should not include:

  • user notes
  • user preferences
  • runtime conversation history
  • agent-specific personal memory

Source Allowlist

Initial source allowlist:

  • .agent/skills/
  • docs/ files used directly by install/onboarding/operator flows
  • setup/ locale/onboarding metadata that the FreeBSD-native selector needs
  • generated or checked-in locale inventory metadata derived from FreeBSD sources such as locale -a, login.conf, and /usr/share/locale/ when onboarding consumes that data directly

Initial exclusions:

  • docs/sessions/
  • generated logs
  • screenshots and media-only files
  • examples unrelated to core Clawdie install/runtime behavior

The allowlist should stay explicit. Do not index the entire repo by default.

Package Structure

The built-in knowledge package should be shipped as:

  • SQL import artifact
  • metadata JSON

Recommended location:

  • bootstrap/skills-memory/artifact.sql
  • bootstrap/skills-memory/metadata.json

This can later move to a release asset if size becomes a problem.

Metadata Contract

The metadata file must include at least:

  • artifact_version
  • schema_version
  • source_snapshot
  • chunker_name
  • chunker_version
  • chunk_size
  • chunk_overlap
  • embedding_provider
  • embedding_model
  • embedding_dimensions
  • generated_at

Optional but useful:

  • document_count
  • chunk_count
  • embedding_count
  • git_commit

Chunking Contract

Chunking is part of the package definition and must be versioned.

Initial policy:

  • split by heading/section first
  • enforce maximum chunk size after structural split
  • keep overlap minimal

Do not make chunking user-configurable during install.

If chunking changes, build a new package version.

Embedding Contract

Embeddings are generated before user install.

Allowed generation paths:

  • maintainer machine
  • CI pipeline
  • later: local rebuild path on target host

Initial product rule:

  • install must not depend on live embedding generation

Preferred long-term local rebuild direction:

  • llama.cpp

Database Contract

The db jail stores and serves built-in knowledge.

The db jail should not be treated as the primary place where embeddings are generated during first install.

Bootstrap must be able to:

  1. create schema
  2. import SQL package
  3. read metadata
  4. verify row counts

Current schema file:

  • docs/sql/builtin-knowledge-base.sql

Current base tables:

  • builtin_knowledge_artifacts
  • builtin_knowledge_documents
  • builtin_knowledge_chunks
  • builtin_knowledge_embeddings

Current base lookup function:

  • search_builtin_knowledge(query_text, max_results)

Import Contract

Default bootstrap flow:

  1. create {agent}-db
  2. install PostgreSQL + pgvector
  3. apply schema
  4. import built-in knowledge SQL
  5. verify metadata and row counts

If import fails:

  • stop bootstrap
  • report the missing or invalid package
  • do not silently fall back to live embedding generation

Separation From User Memory

Built-in knowledge and user memory must remain distinct.

At minimum, keep separate lifecycle and provenance.

Later implementation options:

  • separate tables
  • separate namespaces
  • base + overlay model for built-in knowledge updates

Update Model

Built-in knowledge starts as a shipped package.

Locale-aware onboarding adds one more requirement: the packaged install knowledge must stay aligned with the reviewed locale set and the bsdinstall- style onboarding screens. If the locale selector or onboarding copy changes, the built-in package should refresh alongside it.

Later updates can come from:

  • local git pull
  • local Gitea mirror
  • maintainer rebuild and shipped package refresh

Recommended future model:

  1. detect changed source files
  2. re-chunk changed files
  3. re-embed changed chunks
  4. update built-in knowledge without mixing with user memory

Initial Non-Goals

  • no mandatory online provider during install
  • no mandatory local model during install
  • no indexing of all repo content
  • no committing raw PostgreSQL data directories to git

Next Decisions

  1. Finalize the exact source allowlist
  2. Choose the first chunk size and overlap
  3. Decide the initial embedding provider for package generation
  4. Decide whether built-in knowledge starts as base only or base + overlay