hermes-bsd/scripts/LIVETEST_README.md

# Tool Search live test harness

Runs five scenarios against a real model (Claude Haiku 4.5 via OpenRouter) to
verify that the bridge tools work end-to-end. Records transcripts in
`scripts/out/`.

## Running

```bash
cd <repo root>
python3 scripts/tool_search_livetest.py        # runs all 5 scenarios x 2 modes
python3 scripts/analyze_livetest.py            # side-by-side report
```

Requires `OPENROUTER_API_KEY` set or present in `~/.hermes/.env`.

## What it verifies

| Scenario | Tests |
|----------|-------|
| A obvious_single | BM25 retrieval on an obvious tool name (github_create_issue) |
| B vague_paraphrased | Retrieval when the model has to paraphrase ("schedule meeting" → evt_create) |
| C multi_tool_chain | Multi-step task chaining two deferred tools (GitHub + Slack) |
| D core_plus_deferred | Mixed: core tool (read_file) called directly, deferred tool (Slack) via bridge |
| E no_tool_needed | Pure-knowledge prompt; verify no spurious tool_search invocations |

Each scenario runs with `tool_search.enabled = on` and again with `off` for an
A/B baseline. The harness records:

- bridge_calls (the tool_search / tool_describe / tool_call sequence the model emitted)
- underlying_tool_calls (what actually ran through the registry dispatcher)
- final_response, iteration count, elapsed time, any errors

## Output structure

```
scripts/out/
  <scenario>__enabled.json    # tool_search ON
  <scenario>__disabled.json   # tool_search OFF
  _summary.json               # one-line summary across all runs
```

The 2026-05 baseline run is checked in for reference. Re-running may produce
slightly different transcripts (the model is non-deterministic) but the
expected_underlying_tools assertions should remain satisfied.
test(tool-search): add live A/B harness, drop checked-in transcripts Brings in the tool_search live-test harness from the original PR but leaves out the 11 checked-in scripts/out/*.json transcript files — those are non-deterministic model output that goes stale the moment the model changes and were the bulk of the diff. scripts/out/ is now gitignored so a harness run never re-commits them. Fixes on top: - API-key loading goes through hermes_cli.env_loader.load_hermes_dotenv instead of hand-parsing ~/.hermes/.env and assigning the value to a local. The canonical loader never materializes the secret in a local variable in this module, which clears the four CodeQL high alerts (py/clear-text-storage / py/clear-text-logging-sensitive-data at the transcript write/print sites — they were tracing the key from the hand-rolled parser into the records) and removes a hand-rolled parser. - encoding='utf-8' on every write_text/read_text in both harness scripts (Windows-footgun hygiene). Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com> 2026-05-29 01:28:22 -07:00			`# Tool Search live test harness`

			`Runs five scenarios against a real model (Claude Haiku 4.5 via OpenRouter) to`
			`verify that the bridge tools work end-to-end. Records transcripts in`
			`scripts/out/`.

			`## Running`

			```bash
			`cd <repo root>`
			`python3 scripts/tool_search_livetest.py # runs all 5 scenarios x 2 modes`
			`python3 scripts/analyze_livetest.py # side-by-side report`
			```

			Requires `OPENROUTER_API_KEY` set or present in `~/.hermes/.env`.

			`## What it verifies`

			`\| Scenario \| Tests \|`
			`\|----------\|-------\|`
			`\| A obvious_single \| BM25 retrieval on an obvious tool name (github_create_issue) \|`
			`\| B vague_paraphrased \| Retrieval when the model has to paraphrase ("schedule meeting" → evt_create) \|`
			`\| C multi_tool_chain \| Multi-step task chaining two deferred tools (GitHub + Slack) \|`
			`\| D core_plus_deferred \| Mixed: core tool (read_file) called directly, deferred tool (Slack) via bridge \|`
			`\| E no_tool_needed \| Pure-knowledge prompt; verify no spurious tool_search invocations \|`

			Each scenario runs with `tool_search.enabled = on` and again with `off` for an
			`A/B baseline. The harness records:`

			`- bridge_calls (the tool_search / tool_describe / tool_call sequence the model emitted)`
			`- underlying_tool_calls (what actually ran through the registry dispatcher)`
			`- final_response, iteration count, elapsed time, any errors`

			`## Output structure`

			```
			`scripts/out/`
			`<scenario>__enabled.json # tool_search ON`
			`<scenario>__disabled.json # tool_search OFF`
			`_summary.json # one-line summary across all runs`
			```

			`The 2026-05 baseline run is checked in for reference. Re-running may produce`
			`slightly different transcripts (the model is non-deterministic) but the`
			`expected_underlying_tools assertions should remain satisfied.`