fix/skills-pf-validate-cleanup #250

Merged
clawdie merged 2 commits from fix/skills-pf-validate-cleanup into main 2026-06-28 00:23:58 +02:00
2 changed files with 160 additions and 2 deletions

View file

@ -0,0 +1,123 @@
---
name: fail2ban-tailscale
description: "Prevent fail2ban from banning fleet SSH traffic. Root cause: password auth enabled triggers password-fallback failures during key negotiation. Fix: disable password auth or whitelist fleet IPs."
platforms: [linux]
---
# fail2ban & Fleet SSH Reliability
## Root cause
When a fleet node connects via SSH and the key doesn't match on first
attempt, `sshd` falls back to password authentication. Those password
failures accumulate in fail2ban's counters. After `maxretry = 5`, the
source Tailscale IP is banned — breaking all fleet SSH to that node.
The trigger is NOT a brute-force attack. It's the key negotiation
sequence between trusted nodes during normal fleet operation.
## Fix — choose one path
### Path A: Disable password auth (recommended if key-only)
One line, permanent. Removes the attack surface entirely — no password
attempts means no fail2ban bans:
```sh
sudo sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sudo systemctl reload sshd
```
Pros: Zero ongoing maintenance. Works for all hosts, known or unknown.
No IP lists to update. fail2ban becomes irrelevant for SSH.
Cons: Password login is disabled. If a node loses its private key,
physical/console access is needed. Not suitable for OOTB setups that
need password auth.
Verification:
```sh
ssh -o PreferredAuthentications=password localhost
# Should fail: "Permission denied (publickey)"
```
### Path B: Whitelist specific fleet IPs (if password auth must stay on)
For nodes that need password auth (OOTB state, temporary access, shared
machines). Whitelist only known fleet nodes — do NOT whitelist the
entire `100.64.0.0/10` (that trusts every Tailscale device on any
tailnet):
```sh
# Get fleet IPs from any node:
tailscale status | awk '/active|idle/{print $1}'
echo '[DEFAULT]
ignoreip = 127.0.0.1/8 ::1 100.72.229.63 100.103.255.41 100.73.44.93 100.108.235.54
[sshd]
enabled = true' | sudo tee /etc/fail2ban/jail.local && sudo systemctl reload fail2ban
```
Pros: Password auth stays usable for operators.
Cons: Manual maintenance — add new node IPs on join. IP changes
require updates. Forgetting to update → ban returns.
### Path C: Both (production hardening)
Two independent controls — if someone accidentally re-enables passwords,
the whitelist still protects; if the whitelist misses a node, key-only
auth still blocks brute-force. Apply both Path A and Path B.
## What happens without this
The symptom is `Connection refused` on port 22, even when:
- `sshd` is running and listening on `0.0.0.0:22`
- `ufw`/`iptables` allows port 22
- `tailscale ping` works (35ms pong)
The fail2ban ban targets the Tailscale IP — the node appears reachable
but SSH is silently dropped at the kernel level.
## FreeBSD equivalent — PF rate limiting
FreeBSD nodes don't use fail2ban. The equivalent is PF SSH rate limiting
with `max-src-conn-rate` and an overload table:
```pf
# /etc/pf.conf
table <ssh_brutes> persist
pass in quick on tailscale0 proto tcp from any to any port = ssh \
flags S/SA keep state \
(max-src-conn-rate 5/60, overload <ssh_brutes> flush global)
block quick from <ssh_brutes>
```
5 new connections per 60 seconds per source IP. Exceeding adds the
source to `<ssh_brutes>` (blocked for 10 minutes). Established
connections aren't counted — only new TCP handshakes.
Manual unban:
```sh
sudo pfctl -t ssh_brutes -T delete 100.72.229.63
```
## Platform summary
| Platform | Tool | Fix |
| ------------ | -------- | ---------------------------------------------- |
| Linux | fail2ban | Path A (password off) or Path B (IP whitelist) |
| FreeBSD | PF | `max-src-conn-rate` + overload table |
| Mother (osa) | PF | `max-src-conn-rate` on tailscale0 SSH rule |
## Related
- `freebsd-admin` — PF rule management, `max-src-conn-rate` SSH rate limiting
- `mother-hive` wiki — per-node SSH key strategy, forced-command confinement
- `hive-routing` wiki — fleet communication reliability

View file

@ -56,11 +56,46 @@ For update-status questions, use the existing read-only hostd audit ops
the sysadmin update-report path. Do not expose `freebsd-update fetch` or run
mutating update commands for a status report.
## Tailscale controlplane exposure
## SSH & service exposure (PF rules)
When the controlplane API/dashboard is only exposed on Tailscale:
### Controlplane service ports
When the controlplane API/dashboard is exposed on Tailscale:
- allow `tailscale0` ingress to ports `3100` (direct API) and `443` (nginx proxy)
### SSH rate limiting (FreeBSD equivalent of fail2ban)
FreeBSD doesn't use fail2ban. PF handles SSH brute-force protection with
`max-src-conn-rate` and an overload table:
```pf
# /etc/pf.conf
table <ssh_brutes> persist
pass in quick on tailscale0 proto tcp from any to any port = ssh \
flags S/SA keep state \
(max-src-conn-rate 5/60, overload <ssh_brutes> flush global)
block quick from <ssh_brutes>
```
- `5/60`: 5 new connections per 60 seconds per source IP
- `overload`: source added to `<ssh_brutes>` table on exceed
- `flush global`: entries expire after 600 seconds (10 min)
- `keep state`: only new TCP handshakes count; existing sessions are free
Manual operations:
```sh
sudo pfctl -t ssh_brutes -T show # list banned IPs
sudo pfctl -t ssh_brutes -T delete 100.72.229.63 # unban specific IP
sudo pfctl -t ssh_brutes -T flush # clear all bans
```
For the Linux fleet fail2ban equivalent, see
[fail2ban-tailscale skill](../fail2ban-tailscale/SKILL.md).
- validate PF before reload (`sudo pfctl -nf /etc/pf.conf`) and then `sudo service pf reload`
## Workflow