clawdie/clawdie-ai

Fork 0

Sam & Claude 4a61b08ca5 feat(skills): make built-in knowledge part of default db setup

2026-03-14 17:59:20 +01:00

5 KiB

Raw Blame History

Builtin Knowledge Spec

Date: 14.mar.2026 Status: Draft planning spec

Purpose

Define the technical contract for Clawdie's built-in local knowledge package.

This document covers:

what content belongs in built-in knowledge
how that content is chunked
how embeddings are versioned
what metadata the package must carry
what bootstrap import depends on

This document does not define user memory.

Scope

Builtin knowledge is the shipped local knowledge layer used during install, onboarding, and operator support before any production LLM provider is added.

It should include:

install/setup guides
onboarding locale/help text used by the FreeBSD bsddialog path
operator/system docs needed during bootstrap
built-in product skills
curated support/reference material that Clawdie ships intentionally

It should not include:

user notes
user preferences
runtime conversation history
agent-specific personal memory

Source Allowlist

Initial source allowlist:

.agent/skills/
docs/ files used directly by install/onboarding/operator flows
setup/ locale/onboarding metadata that the FreeBSD-native selector needs
generated or checked-in locale inventory metadata derived from FreeBSD sources such as locale -a, login.conf, and /usr/share/locale/ when onboarding consumes that data directly

Initial exclusions:

docs/sessions/
generated logs
screenshots and media-only files
examples unrelated to core Clawdie install/runtime behavior

The allowlist should stay explicit. Do not index the entire repo by default.

Package Structure

The built-in knowledge package should be shipped as:

SQL import artifact
metadata JSON

Recommended location:

bootstrap/skills-memory/artifact.sql
bootstrap/skills-memory/metadata.json

This can later move to a release asset if size becomes a problem.

Metadata Contract

The metadata file must include at least:

artifact_version
schema_version
source_snapshot
chunker_name
chunker_version
chunk_size
chunk_overlap
embedding_provider
embedding_model
embedding_dimensions
generated_at

Optional but useful:

document_count
chunk_count
embedding_count
git_commit

Chunking Contract

Chunking is part of the package definition and must be versioned.

Initial policy:

split by heading/section first
enforce maximum chunk size after structural split
keep overlap minimal

Do not make chunking user-configurable during install.

If chunking changes, build a new package version.

Embedding Contract

Embeddings are generated before user install.

Allowed generation paths:

maintainer machine
CI pipeline
later: local rebuild path on target host

Initial product rule:

install must not depend on live embedding generation

Preferred long-term local rebuild direction:

llama.cpp

Database Contract

The db jail stores and serves built-in knowledge.

The db jail should not be treated as the primary place where embeddings are generated during first install.

Bootstrap must be able to:

create schema
import SQL package
read metadata
verify row counts

Current schema file:

docs/sql/builtin-knowledge-base.sql

Current base tables:

builtin_knowledge_artifacts
builtin_knowledge_documents
builtin_knowledge_chunks
builtin_knowledge_embeddings

Current base lookup function:

search_builtin_knowledge(query_text, max_results)

Import Contract

Default bootstrap flow:

create {agent}-db
install PostgreSQL + pgvector
apply schema
import built-in knowledge SQL
verify metadata and row counts

If import fails:

stop bootstrap
report the missing or invalid package
do not silently fall back to live embedding generation

Separation From User Memory

Built-in knowledge and user memory must remain distinct.

At minimum, keep separate lifecycle and provenance.

Later implementation options:

separate tables
separate namespaces
base + overlay model for built-in knowledge updates

Update Model

Built-in knowledge starts as a shipped package.

Locale-aware onboarding adds one more requirement: the packaged install knowledge must stay aligned with the reviewed locale set and the bsdinstall- style onboarding screens. If the locale selector or onboarding copy changes, the built-in package should refresh alongside it.

Later updates can come from:

local git pull
local Gitea mirror
maintainer rebuild and shipped package refresh

Recommended future model:

detect changed source files
re-chunk changed files
re-embed changed chunks
update built-in knowledge without mixing with user memory

Initial Non-Goals

no mandatory online provider during install
no mandatory local model during install
no indexing of all repo content
no committing raw PostgreSQL data directories to git

Next Decisions

Finalize the exact source allowlist
Choose the first chunk size and overlap
Decide the initial embedding provider for package generation
Decide whether built-in knowledge starts as base only or base + overlay

5 KiB Raw Blame History