layered-soul/skills/systematic-debugging/references/network-live-diagnostics.md
Hermes & Sam 5c5df32101 Populate layered-soul: identity, memories, skills, plan (Hermes & Sam)
- SOUL.md: full agent identity, operating principles, voice
- IDENTITY.md: runtime identity, hosts, boundaries
- USER.md: operator context imported from hermes-soul
- AGENTS.md: actual operating rules, infrastructure, quick reference
- memories/curated/: 5 topics (tailscale, forgejo, agents, projects, vaultwarden)
- skills/: 9 cross-harness skills imported from hermes-soul after review
- docs/PLAN-CONFIGURE-PRIVATE-REPO.md: configuration plan
- Validate: passes clean
2026-06-14 00:21:26 +02:00

5.3 KiB
Raw Blame History

Network live diagnostics patterns

Use this reference for recurring Wi-Fi/Tailscale/SSH/tmux lag investigations, especially when the user is switching networks or running a large download.

Log location and clutter rule

Prefer a single bounded log under:

~/.local/state/hermes/net-tests/

Avoid placing diagnostic files on the Desktop unless the user explicitly asks. Desktop files become clutter quickly; dashboards and generated artifacts fit better under:

~/.local/share/hermes/net-dashboard/

Safe live monitoring during large downloads

When a large download is active and disk space is limited, do not start unbounded packet captures. First collect lightweight evidence:

  • df -h / /home /tmp for disk headroom
  • ss -tinp for per-TCP socket RTT, retransmits, reorder, bytes_received
  • ip -s link show <iface> for interface counters
  • short ping samples to gateway, public internet, and relevant Tailscale peers
  • nmcli ... dev wifi for SSID/channel/signal

Use JSONL or compact text summaries. Bound collection by all of:

  • max runtime
  • sample interval
  • max log size
  • disk free-space stop threshold

Example safety shape:

mkdir -p ~/.local/state/hermes/net-tests
MON_INTERVAL=10 MON_MAX_SECONDS=1800 MON_MAX_BYTES=2097152 \
MON_WARN_FREE_GB=15 MON_STOP_FREE_GB=10 \
  ~/.local/share/hermes/net-dashboard/live_download_monitor.py

What to inspect in ss -tinp

Relevant fields:

  • rtt:<avg>/<variance> — path latency and jitter
  • bytes_received — confirms a download is progressing
  • bytes_retrans and retrans:<active>/<total> — TCP loss/retry evidence
  • reord_seen / rcv_ooopack — packet reordering/out-of-order delivery
  • Send-Q — a stuck send queue can explain a frozen SSH session
  • stale local IPs after a Wi-Fi switch — old public SSH sessions may die while Tailscale sessions survive

Large download interpretation

If gateway ping stays clean but internet/Tailscale pings jump to hundreds or thousands of ms during a large download, suspect uplink/downlink saturation or bufferbloat on the hotspot/ISP path, not local Wi-Fi driver failure.

Useful pattern observed:

  • hotspot gateway: ~13 ms, 0% loss
  • internet/Tailscale peers: high latency / intermittent ping loss during download
  • active HTTPS socket bytes_received increasing
  • SSH sockets remain established but interactive use feels laggy

This indicates the download is filling the pipe and interactive packets wait behind it.

Wireshark/tshark inclusion

Only add packet capture after lightweight counters show what to focus on. Avoid full unfiltered captures during large downloads. If tshark is available, prefer short, filtered, capped captures and extract summaries into the dashboard/log:

sudo tshark -i wlp1s0 -a duration:60 -w capture.pcapng \
  -f 'host <download-ip> or host <tailnet-ip> or port 22'

tshark -r capture.pcapng \
  -Y 'tcp.analysis.retransmission or tcp.flags.reset==1 or dns or tcp.port==22' \
  -T fields -e frame.time -e ip.src -e ip.dst \
  -e tcp.analysis.retransmission -e tcp.flags.reset -e dns.qry.name

Keep raw pcaps under ~/.local/state/hermes/net-tests/ and summarize them; do not dump large packet logs into chat.

Local dashboard pattern

For recurring investigations, a static dashboard is often simpler than a full app. A starter implementation is available at scripts/network_story_dashboard.py:

  • JSONL/text logs in ~/.local/state/hermes/net-tests/
  • generated HTML in ~/.local/share/hermes/net-dashboard/dashboard.html
  • served by python3 -m http.server <port> --directory ~/.local/share/hermes/net-dashboard

This gives history progression without requiring Node, npm, databases, or MCP. MCP is optional later if the workflow becomes a reusable tool API (start capture, summarize latest, append event, etc.).

Make dashboards understandable to non-technical viewers

When the dashboard is meant for a roommate/family member or another non-technical observer, do not lead with numeric tables. Lead with a story:

  • one big chart that answers "did the network spike?"
  • color-coded lines: local gateway/Wi-Fi hop, public internet, Tailscale peer, download/progress, free disk
  • checkboxes to turn lines on/off
  • plain-language event cards: "Phone hotspot stayed clean", "Internet got laggy under load"
  • hide raw ping/SSH/TCP tables under <details><summary>technical details</summary>
  • filters for event classes: Download stress, Projector/Epson, Tailscale/SSH, Wi-Fi changes

For interference testing such as "turn on Epson/projector", run at least two comparable bounded windows:

  1. before event / baseline
  2. event active / projector on
  3. optional after event / projector off recovery

Then visualize the event as a clear marker or separate cards so the viewer can say either "lag started exactly when Epson turned on" or "Epson did not correlate with the spikes."

Static HTML JSON pitfall

If embedding event data in HTML, use raw JSON in an inert script block:

data_json = json.dumps(data, ensure_ascii=False).replace("</", "<\\/")
html = f'<script id="data" type="application/json">{data_json}</script>'

Do not wrap the JSON with html.escape(). In <script type="application/json">, textContent will still contain literal entities such as &quot;, so JSON.parse() fails and the dashboard may show only static summary counts with a blank chart/timeline.