fix(gateway): return tuple from voice transcription on placeholder caption (#42090)

## What does this PR do? The voice-during-active-run feature (#41984) changed `_enrich_message_with_transcription` so that it returns a `(enriched_text, successful_transcripts)` tuple instead of a bare string, which lets callers echo the raw transcript back to the user. The signature and every other return path were updated to match, but one branch was missed: when a successfully transcribed clip arrives with the Discord "empty content" placeholder as its caption, the method still returned the prefix string on its own. All four call sites unpack the result with `text, transcripts = await self._enrich_message_with_transcription(...)`, so that path raised `ValueError: too many values to unpack (expected 2)` and the inbound voice message was dropped instead of reaching the agent. This is a real user-facing path rather than a corner case: a Discord voice note sent without a caption is delivered as exactly that placeholder, so a captionless voice message that transcribed correctly would crash the handler precisely when transcription had worked. The fix returns the proper tuple from that branch so the placeholder is still stripped while the transcripts continue to flow back to the caller for the echo. ## Related Issue N/A ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [ ] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - `gateway/run.py`: in `_enrich_message_with_transcription`, return `(prefix, successful_transcripts)` instead of a bare `prefix` from the empty-content-placeholder branch, so the contract matches the signature and the other return paths. - `tests/gateway/test_stt_config.py`: add `test_enrich_message_with_transcription_returns_tuple_for_empty_content_placeholder`, which drives a successful transcription with the placeholder caption and asserts the placeholder is stripped while the transcript is still returned. ## How to Test 1. Check out `main` and run the new test — it fails with `ValueError: too many values to unpack (expected 2)`, reproducing the crash a captionless Discord voice note would trigger. 2. Apply this change and re-run `pytest tests/gateway/test_stt_config.py -q` — all tests pass. 3. `ruff check gateway/run.py tests/gateway/test_stt_config.py` and `python scripts/check-windows-footguns.py gateway/run.py tests/gateway/test_stt_config.py` both pass. ## Checklist ### Code - [x] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [x] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [x] I searched for [existing PRs](https://github.com/NousResearch/hermes-agent/pulls) to make sure this isn't a duplicate - [x] My PR contains **only** changes related to this fix/feature (no unrelated commits) - [x] I've run `pytest tests/ -q` and all tests pass - [x] I've added tests for my changes (required for bug fixes, strongly encouraged for features) - [x] I've tested on my platform: macOS 15 (Darwin 25.5) ### Documentation & Housekeeping - [x] I've updated relevant documentation (README, `docs/`, docstrings) — or N/A - [x] I've updated `cli-config.yaml.example` if I added/changed config keys — or N/A - [x] I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture or workflows — or N/A - [x] I've considered cross-platform impact (Windows, macOS) per the [compatibility guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md#cross-platform-compatibility) — or N/A - [x] I've updated tool descriptions/schemas if I changed tool behavior — or N/A
2026-06-10 09:16:23 +03:00 · 2026-06-10 09:16:23 +03:00 · 9ca9697342
commit 9ca9697342
parent 63a421d4c0
2 changed files with 40 additions and 1 deletions
--- a/gateway/run.py
+++ b/gateway/run.py
@ -11538,7 +11538,7 @@ class GatewayRunner(GatewayAuthorizationMixin, GatewayKanbanWatchersMixin, Gatew
            # when we successfully transcribed the audio — it's redundant.
            _placeholder = "(The user sent a message with no text content)"
            if user_text and user_text.strip() == _placeholder:
-                return prefix
+                return prefix, successful_transcripts
            if user_text:
                return f"{prefix}\n\n{user_text}", successful_transcripts
            return prefix, successful_transcripts
--- a/tests/gateway/test_stt_config.py
+++ b/tests/gateway/test_stt_config.py
@ -102,6 +102,45 @@ async def test_enrich_message_with_transcription_avoids_bogus_no_provider_messag
    assert transcripts == []


+@pytest.mark.asyncio
+async def test_enrich_message_with_transcription_returns_tuple_for_empty_content_placeholder():
+    """A successful transcription whose caption is the empty-content placeholder
+    must still return the ``(text, transcripts)`` tuple.
+
+    The Discord adapter delivers a captionless voice note as the literal
+    ``"(The user sent a message with no text content)"`` placeholder. When STT
+    succeeds we strip that redundant placeholder and return just the transcript
+    prefix — but the method's contract (and every caller, which unpacks the
+    result as ``text, transcripts = ...``) requires a 2-tuple. Returning a bare
+    string here raised ``ValueError: too many values to unpack`` and dropped the
+    whole voice message on the floor.
+    """
+    from gateway.run import GatewayRunner
+
+    runner = GatewayRunner.__new__(GatewayRunner)
+    runner.config = GatewayConfig(stt_enabled=True)
+    runner._has_setup_skill = lambda: False
+
+    with patch(
+        "tools.transcription_tools.transcribe_audio",
+        return_value={
+            "success": True,
+            "transcript": "hello from a captionless voice note",
+            "provider": "local_command",
+        },
+    ):
+        result, transcripts = await runner._enrich_message_with_transcription(
+            "(The user sent a message with no text content)",
+            ["/tmp/voice.ogg"],
+        )
+
+    # The redundant placeholder is stripped, leaving only the transcript prefix.
+    assert "hello from a captionless voice note" in result
+    assert "(The user sent a message with no text content)" not in result
+    # Crucially, the transcripts are still surfaced so callers can echo them.
+    assert transcripts == ["hello from a captionless voice note"]
+
+
@pytest.mark.asyncio
 async def test_prepare_inbound_message_text_transcribes_queued_voice_event():
    from gateway.run import GatewayRunner