hermes-bsd/agent
teknium1 592a4ffb6b fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747)
Reporter diagnosed three independent gaps that together allowed infinite
'unblock → re-stuck' loops with no surfacing or escalation:

GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked`
event, so a task that cycles every few minutes is invisible to it
regardless of how many times it cycles.

Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`)
that counts block→unblock cycles in a sliding window. Default threshold
3 cycles within 24h, configurable via `block_cycle_threshold` /
`block_cycle_window_seconds`. Walks events in arrival order (event id)
since multiple events can share the same `created_at` second. Fires as a
warning with a CLI hint to inspect the block reasons.

GAP 2: Iteration-budget-exhausted runs in kanban workers map to
`kanban_block` (status=blocked, but a clean exit from the kernel's
perspective). `_rule_repeated_failures` reads `consecutive_failures`,
which `_record_task_failure` increments only for crashed/timed_out/
spawn_failed — `blocked` outcome bypasses the failure counter, so the
`kanban.failure_limit` circuit breaker never trips on budget-exhaustion
loops.

Fix: `agent/conversation_loop.py` budget-exhaustion path now calls
`_record_task_failure(outcome="timed_out")` instead of `kanban_block`.
Budget exhaustion is genuinely a timeout-shaped failure (the task ran out
of allowed iterations), so this is more honest semantics; it also routes
through the unified failure counter, so repeated budget exhaustions trip
the circuit breaker and the task auto-blocks with `gave_up` after
`failure_limit` retries.

GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and
ignores `last_heartbeat_at`. Reporter observed a 91-min run that held
its claim with frozen heartbeat because the worker entered a logic loop
with no tool calls — `_pid_alive` kept returning True so the claim was
extended every 15 minutes indefinitely.

Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older
than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim
even if the PID is alive. NULL `last_heartbeat_at` preserves backward
compatibility (no heartbeat yet = extend, as before). The reclaim event
payload now includes a `heartbeat_stale` boolean so operators see why a
live-PID worker was reclaimed.

This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat
bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a
side effect of normal API traffic, the backstop only fires for genuinely
wedged workers (no chunks, no tool results, no progress at all).

Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>
2026-05-29 00:13:29 -07:00
..
lsp chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
secret_sources chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
transports chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
__init__.py fix(agent): preload jiter native parser 2026-05-28 00:20:11 -07:00
account_usage.py
agent_init.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
agent_runtime_helpers.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
anthropic_adapter.py feat: add claude-opus-4.8 and claude-opus-4.8-fast (#34003) 2026-05-28 10:31:59 -07:00
async_utils.py fix(async): close unscheduled coroutines in all threadsafe bridges (#26584) 2026-05-15 14:00:01 -07:00
auxiliary_client.py fix(xai-sanitize): deepcopy tools_for_api before in-place mutation (#27907) 2026-05-28 23:29:59 -07:00
azure_identity_adapter.py feat(azure-foundry): add Microsoft Entra ID auth 2026-05-18 10:14:38 -07:00
background_review.py test: cover ci-unblocker production regressions 2026-05-27 22:14:53 -07:00
bedrock_adapter.py chore(deps): lazy-install boto3/botocore for bedrock adapter 2026-05-17 02:31:18 -07:00
browser_provider.py fix(browser): self-review pass — dead-import, log levels, future-proofing 2026-05-17 04:04:15 -07:00
browser_registry.py fix(browser): self-review pass — dead-import, log levels, future-proofing 2026-05-17 04:04:15 -07:00
chat_completion_helpers.py fix(xai-sanitize): deepcopy tools_for_api before in-place mutation (#27907) 2026-05-28 23:29:59 -07:00
codex_responses_adapter.py feat(prompt): universal task-completion guidance + local Python toolchain probe (#34340) 2026-05-28 22:26:09 -07:00
codex_runtime.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
context_compressor.py test: cover fallback dropped-turn handoff 2026-05-28 20:34:40 -07:00
context_engine.py feat(context-engine): host contract for external context engines 2026-05-28 01:45:30 -07:00
context_references.py
conversation_compression.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
conversation_loop.py fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747) 2026-05-29 00:13:29 -07:00
copilot_acp_client.py fix: guard yaml.safe_load, flock unlock, TOCTOU races, and atomic writes 2026-05-19 00:12:41 -07:00
credential_persistence.py fix: avoid persisting borrowed credential secrets (#31416) 2026-05-25 00:32:08 -07:00
credential_pool.py fix(credential-pool): STATUS_DEAD for terminal OAuth failures (#32849) (#34412) 2026-05-28 23:45:42 -07:00
credential_sources.py docs(auth): replace stale 'hermes login' references with 'hermes auth add' 2026-05-26 15:41:11 -07:00
curator.py fix: preserve skill packages during curator consolidation 2026-05-27 13:39:58 -07:00
curator_backup.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
display.py chore(web): remove web_crawl tool + provider crawl plumbing (#33824) 2026-05-28 04:52:42 -07:00
error_classifier.py fix(agent): fallback immediately on provider content-policy blocks (#33883) 2026-05-28 07:28:24 -07:00
file_safety.py fix(security): add bws_cache.json to file_safety read guard 2026-05-28 23:31:20 -07:00
gemini_cloudcode_adapter.py
gemini_native_adapter.py
gemini_schema.py
google_code_assist.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
google_oauth.py docs(auth): replace stale 'hermes login' references with 'hermes auth add' 2026-05-26 15:41:11 -07:00
i18n.py
image_gen_provider.py fix(image_gen): cache xAI ephemeral URL responses to disk (#26942) (#31759) 2026-05-24 18:10:47 -07:00
image_gen_registry.py
image_routing.py feat(kanban): attach images referenced in task bodies to worker vision (#34210) 2026-05-28 17:50:42 -07:00
insights.py
iteration_budget.py refactor(run_agent): extract OpenAI proxy, safe stdio, IterationBudget 2026-05-16 17:59:32 -07:00
jiter_preload.py fix(agent): preload jiter native parser 2026-05-28 00:20:11 -07:00
lmstudio_reasoning.py
manual_compression_feedback.py
markdown_tables.py
memory_manager.py feat: expose completed-turn message context to memory providers 2026-05-29 02:16:43 +05:30
memory_provider.py feat: expose completed-turn message context to memory providers 2026-05-29 02:16:43 +05:30
message_sanitization.py refactor(run_agent): extract message sanitization to agent/message_sanitization.py 2026-05-16 17:41:09 -07:00
model_metadata.py fix: stop probe stepdown without provider context limit 2026-05-28 12:26:53 -07:00
models_dev.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
moonshot_schema.py fix(moonshot): strip $ref siblings and collapse tuple items in tool schemas (#27104) 2026-05-16 13:02:19 -07:00
nous_rate_guard.py
onboarding.py
plugin_llm.py
portal_tags.py
process_bootstrap.py refactor(run_agent): extract OpenAI proxy, safe stdio, IterationBudget 2026-05-16 17:59:32 -07:00
prompt_builder.py fix(kanban): tell workers not to use clarify; route to kanban_block instead (#32167) 2026-05-28 23:57:20 -07:00
prompt_caching.py
rate_limit_tracker.py
redact.py fix(redact): pass web URLs through unchanged (#34029) 2026-05-28 11:32:39 -07:00
retry_utils.py
shell_hooks.py fix: guard yaml.safe_load, flock unlock, TOCTOU races, and atomic writes 2026-05-19 00:12:41 -07:00
skill_bundles.py feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373) 2026-05-18 21:38:05 -07:00
skill_commands.py fix(skills): load symlinked skill slash commands 2026-05-18 00:34:29 -07:00
skill_preprocessing.py fix: treat inline-shell timeout guard as timeout 2026-05-18 19:36:04 -07:00
skill_utils.py fix(skills): load Linux-tagged skills on Termux (android sys.platform) 2026-05-21 19:08:38 -07:00
stream_diag.py feat(agent): buffer retry/fallback status, surface only on terminal failure (#33816) 2026-05-28 04:53:27 -07:00
subdirectory_hints.py fix(subdirectory_hints): prevent loading AGENTS.md outside workspace 2026-05-25 23:17:33 -07:00
system_prompt.py feat(prompt): universal task-completion guidance + local Python toolchain probe (#34340) 2026-05-28 22:26:09 -07:00
think_scrubber.py
title_generator.py
tool_dispatch_helpers.py feat(security): promptware defense — shared threat patterns + memory load-time scan + tool-result delimiters (#32269) 2026-05-25 14:52:24 -07:00
tool_executor.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
tool_guardrails.py fix: add recovery hints to loop guard warnings 2026-05-19 00:12:12 -07:00
tool_result_classification.py
trajectory.py
transcription_provider.py feat(stt): add register_transcription_provider() plugin hook 2026-05-25 01:41:19 -07:00
transcription_registry.py feat(stt): add register_transcription_provider() plugin hook 2026-05-25 01:41:19 -07:00
tts_provider.py feat(tts): add register_tts_provider() plugin hook (closes #30398) 2026-05-24 18:04:54 -07:00
tts_registry.py feat(tts): add register_tts_provider() plugin hook (closes #30398) 2026-05-24 18:04:54 -07:00
usage_pricing.py feat: add claude-opus-4.8 and claude-opus-4.8-fast (#34003) 2026-05-28 10:31:59 -07:00
video_gen_provider.py
video_gen_registry.py
web_search_provider.py chore(web): remove web_crawl tool + provider crawl plumbing (#33824) 2026-05-28 04:52:42 -07:00
web_search_registry.py chore(web): remove web_crawl tool + provider crawl plumbing (#33824) 2026-05-28 04:52:42 -07:00