tstest/natlab/vmtest: add TestDiscoKeyChange

Add a vmtest that brings up two gokrazy nodes A and B behind two One2OneNAT networks (so direct UDP works in both directions and any slowness can't be blamed on NAT traversal), establishes a WireGuard tunnel A → B with TSMP, then rotates B's disco key four times and asserts that the data plane recovers in both directions after each rotation. All pings are TSMP (the data-plane ping; disco pings would not exercise the WireGuard tunnel itself). The five pings: 1. A → B (initial; brings up the tunnel; 30s budget) 2. B → A after rotate (LocalAPI rotate-disco-key debug action) 3. A → B after rotate (LocalAPI) 4. B → A after restart (SIGKILL; gokrazy supervisor respawns) 5. A → B after restart (SIGKILL) Each post-rotation ping gets a 15-second budget. Two unavoidable multi-second waits dominate today: - The rotate-then-a→b phase takes ~10s on main because of LazyWG. After B's WantRunning bounce, B's wgengine resets its sentActivityAt/recvActivityAt maps and trims A out of the wireguard-go config as an "idle peer"; B only re-adds A on inbound activity, by which point A's first few TSMP packets have been silently dropped at B's tundev. The bradfitz/rm_lazy_wg branch removes that trimming entirely (verified locally: this phase drops to <100ms there). - The restart phases take ~5s for wireguard-go's RekeyTimeout handshake retry. After SIGKILL+respawn the first WG handshake init from the restarted node sometimes goes into the void (likely the brief peer-removed window in the receiver's two-step maybeReconfigWireguardLocked reconfig during which the peer is absent from wireguard-go), and wg-go's 5s+jitter retransmit timer is the next opportunity to retry. That retry succeeds and the staged TSMP packet flushes. Intrinsic to the protocol's retransmit policy. Once LazyWG is removed and the first-handshake-after-reconfig race is fixed, the budget should drop to 5s. Supporting changes: ipn/ipnlocal: DebugRotateDiscoKey now toggles WantRunning off and back on after rotating the disco key. magicsock.Conn.RotateDiscoKey only resets local disco state; without also dropping wireguard-go session keys, peers keep encrypting with their stale per-peer session against us until their rekey timer fires (WireGuard has no data-plane signaling to invalidate sessions). Bouncing WantRunning runs the engine through Reconfig(empty) → authReconfig, which drops every peer's WG session so the next packet either way triggers a fresh handshake. ipn/ipnlocal, ipn/localapi: add a debug-only "peer-disco-keys" LocalAPI action ([LocalBackend.DebugPeerDiscoKeys]) that returns a map[NodePublic]DiscoPublic from the current netmap. Tests reach it via [local.Client.DebugResultJSON]. We do not surface disco keys via [ipnstate.PeerStatus] because adding a non-comparable [key.DiscoPublic] field there breaks reflect-based test helpers (e.g. TestFilterFormatAndSortExitNodes' use of cmp.Diff), and general LocalAPI clients have no need for disco keys. Since the debug LocalAPI is gated behind the ts_omit_debug build tag, this endpoint is automatically stripped from small binaries. cmd/tta: add /restart-tailscaled handler (Linux-only, via /proc walk) to drive the SIGKILL phase. On gokrazy the supervisor respawns tailscaled within a second. tstest/integration/testcontrol: add Server.AllOnline. When set, every peer entry in MapResponses is marked Online=true. Several disco-key handling fast paths in controlclient and wgengine (removeUnwantedDiscoUpdates, removeUnwantedDiscoUpdatesFromFull NetmapUpdate, the wgengine tsmpLearnedDisco fast path) only fire for online peers; without this flag, tests exercising disco-key rotation only hit the offline-peer code paths, which mask issues and are several seconds slower in this scenario. Finer-grained per-node online tracking can be added later. tstest/natlab/vmtest: add Env.RotateDiscoKey, Env.RestartTailscaled, Env.PeerDiscoKey, Node.Name, an [AllOnline] EnvOption that plumbs through to testcontrol.Server.AllOnline, and an exported Env.Ping(from, to, type, timeout). Ping replaces the unexported helper so callers can specify both a ping type (PingDisco for warming peer state, PingTSMP for asserting end-to-end connectivity) and a deadline. PeerDiscoKey returns its LocalAPI error so callers inside tstest.WaitFor can retry transient failures rather than fataling the test. Updates #12639 Updates #13038 Change-Id: I3644f27fc30e52990ba25a3983498cc582ddb958 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-04-28 21:29:05 +00:00
parent 22ff402da9
commit 15cba0a3f6
7 changed files with 519 additions and 11 deletions
@@ -19,6 +19,7 @@ import (
 	"bytes"
 	"context"
 	"encoding/base64"
+	"encoding/json"
 	"flag"
 	"fmt"
 	"io"
@@ -43,6 +44,7 @@ import (
 	"tailscale.com/ipn"
 	"tailscale.com/ipn/ipnstate"
 	"tailscale.com/tailcfg"
+	"tailscale.com/tstest"
 	"tailscale.com/tstest/integration/testcontrol"
 	"tailscale.com/tstest/natlab/vnet"
 	"tailscale.com/types/key"
@@ -85,6 +87,7 @@ type Env struct {
 	qemuProcs []*exec.Cmd // launched QEMU processes

 	sameTailnetUser bool // all nodes register as the same Tailnet user
+	allOnline       bool // mark every peer as Online=true in MapResponses

 	// Shared resource initialization (sync.Once for things multiple nodes share).
 	vnetOnce      sync.Once
@@ -346,6 +349,16 @@ func SameTailnetUser() EnvOption {
 	return envOptFunc(func(e *Env) { e.sameTailnetUser = true })
 }

+// AllOnline returns an [EnvOption] that makes the test control server mark
+// every peer as Online=true in MapResponses (testcontrol.Server.AllOnline).
+// Several disco-key handling fast paths in the controlclient and wgengine
+// only fire when the peer is reported online; without this option those
+// paths are silently skipped, which can mask bugs and slow down recovery
+// from disco-key rotations.
+func AllOnline() EnvOption {
+	return envOptFunc(func(e *Env) { e.allOnline = true })
+}
+
 // AddNetwork creates a new virtual network. Arguments follow the same pattern as
 // vnet.Config.AddNetwork (string IPs, NAT types, NetworkService values).
 func (e *Env) AddNetwork(opts ...any) *vnet.Network {
@@ -414,6 +427,11 @@ func (e *Env) AddNode(name string, opts ...any) *Node {

 // LanIP returns the LAN IPv4 address of this node on the given network.
 // This is only valid after Env.Start() has been called.
+// Name returns the node's name as set in [Env.AddNode].
+func (n *Node) Name() string {
+	return n.name
+}
+
 func (n *Node) LanIP(net *vnet.Network) netip.Addr {
 	return n.vnetNode.LanIP(net)
 }
@@ -864,33 +882,172 @@ func (e *Env) ApproveRoutes(n *Node, routes ...string) {
 	}
 }

-// ping pings from one node to another's Tailscale IP, retrying until it succeeds
-// or the timeout expires. This establishes the WireGuard tunnel between the nodes.
+// ping does a disco ping from one node to another's Tailscale IP, retrying
+// for up to 30 seconds, fataling on failure. It is used internally to wake
+// up magicsock peer state before a test runs; tests that want to assert
+// connectivity should use [Env.Ping] with the appropriate ping type and
+// timeout.
 func (e *Env) ping(from, to *Node) {
-	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	e.t.Helper()
+	if err := e.Ping(from, to, tailcfg.PingDisco, 30*time.Second); err != nil {
+		e.t.Fatal(err)
+	}
+}
+
+// Ping pings from one node to another's Tailscale IP using the given ping
+// type, retrying until it succeeds or timeout expires. It returns the error
+// from the last attempt if the timeout expires. Unlike the internal ping
+// helper, it does not fatal the test on failure; callers can check the error
+// to assert on timing.
+//
+// [tailcfg.PingTSMP] actually flows packets across the WireGuard tunnel and is
+// the right choice for asserting end-to-end connectivity.
+// [tailcfg.PingDisco] only exchanges disco messages between magicsock layers
+// and is useful for warming up peer state without requiring a working tunnel.
+func (e *Env) Ping(from, to *Node, ptype tailcfg.PingType, timeout time.Duration) error {
+	e.t.Helper()
+	ctx, cancel := context.WithTimeout(context.Background(), timeout)
 	defer cancel()

 	toSt, err := to.agent.Status(ctx)
 	if err != nil {
-		e.t.Fatalf("ping: can't get %s status: %v", to.name, err)
+		return fmt.Errorf("ping: can't get %s status: %w", to.name, err)
 	}
 	if len(toSt.Self.TailscaleIPs) == 0 {
-		e.t.Fatalf("ping: %s has no Tailscale IPs", to.name)
+		return fmt.Errorf("ping: %s has no Tailscale IPs", to.name)
 	}
 	targetIP := toSt.Self.TailscaleIPs[0]

+	var lastErr error
 	for {
-		pingCtx, pingCancel := context.WithTimeout(ctx, 3*time.Second)
-		pr, err := from.agent.PingWithOpts(pingCtx, targetIP, tailcfg.PingDisco, local.PingOpts{})
+		// Per-attempt timeout: cap at 3s but never exceed the remaining budget.
+		attemptTimeout := 3 * time.Second
+		if d := time.Until(deadline(ctx)); d < attemptTimeout {
+			attemptTimeout = d
+		}
+		if attemptTimeout <= 0 {
+			break
+		}
+		pingCtx, pingCancel := context.WithTimeout(ctx, attemptTimeout)
+		pr, err := from.agent.PingWithOpts(pingCtx, targetIP, ptype, local.PingOpts{})
 		pingCancel()
 		if err == nil && pr.Err == "" {
-			e.logVerbosef("ping: %s -> %s OK", from.name, targetIP)
-			return
+			e.logVerbosef("ping(%s): %s -> %s OK", ptype, from.name, targetIP)
+			return nil
+		}
+		switch {
+		case err != nil:
+			lastErr = err
+		case pr.Err != "":
+			lastErr = fmt.Errorf("%s", pr.Err)
 		}
 		if ctx.Err() != nil {
-			e.t.Fatalf("ping: %s -> %s timed out", from.name, targetIP)
+			break
 		}
-		time.Sleep(time.Second)
+		time.Sleep(500 * time.Millisecond)
+	}
+	if lastErr == nil {
+		lastErr = ctx.Err()
+	}
+	return fmt.Errorf("ping(%s): %s -> %s (%s) timed out after %v: %w", ptype, from.name, to.name, targetIP, timeout, lastErr)
+}
+
+// deadline returns ctx's deadline, or a zero Time if it has none.
+func deadline(ctx context.Context) time.Time {
+	d, _ := ctx.Deadline()
+	return d
+}
+
+// PeerDiscoKey returns n's view of the given peer's disco key. It returns a
+// non-nil error if the LocalAPI request fails (e.g. tailscaled briefly
+// unavailable during a restart). It returns (zero, false, nil) if n is
+// reachable but has no record of the given peer in its current netmap.
+//
+// PeerDiscoKey is suitable for use inside a [tstest.WaitFor] poll loop: it
+// does not fatal the test on transient errors.
+//
+// The disco key is fetched from the debug-only "peer-disco-keys" LocalAPI
+// action ([ipnlocal.LocalBackend.DebugPeerDiscoKeys]) rather than via
+// [ipnstate.Status], to keep the production PeerStatus struct free of disco
+// keys (and free of non-comparable fields like [key.DiscoPublic] that break
+// reflect-based test helpers).
+func (e *Env) PeerDiscoKey(n *Node, peer key.NodePublic) (key.DiscoPublic, bool, error) {
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+	defer cancel()
+	got, err := n.agent.DebugResultJSON(ctx, "peer-disco-keys")
+	if err != nil {
+		return key.DiscoPublic{}, false, err
+	}
+	// DebugResultJSON returns the result as a generic any (the body is
+	// re-decoded into any), so the map comes back keyed by string text-
+	// encoded node keys. Re-marshal+unmarshal into a typed map for cleaner
+	// lookup. (Roundtripping through JSON is fine for a test helper.)
+	raw, err := json.Marshal(got)
+	if err != nil {
+		return key.DiscoPublic{}, false, fmt.Errorf("re-marshal: %w", err)
+	}
+	var m map[key.NodePublic]key.DiscoPublic
+	if err := json.Unmarshal(raw, &m); err != nil {
+		return key.DiscoPublic{}, false, fmt.Errorf("unmarshal peer-disco-keys: %w", err)
+	}
+	d, ok := m[peer]
+	return d, ok, nil
+}
+
+// RotateDiscoKey asks tailscaled on n to rotate its discovery (magicsock) key
+// in place via the LocalAPI debug action. The node key, control connection,
+// and other tailscaled state are unaffected. It fatals the test on error.
+func (e *Env) RotateDiscoKey(n *Node) {
+	e.t.Helper()
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+	if err := n.agent.DebugAction(ctx, "rotate-disco-key"); err != nil {
+		e.t.Fatalf("RotateDiscoKey(%s): %v", n.name, err)
+	}
+}
+
+// RestartTailscaled signals tailscaled on n to die so that its supervisor
+// (gokrazy) restarts it. It then waits for tailscaled to come back to the
+// "Running" backend state. It fatals the test on error.
+//
+// Restarting tailscaled is currently only supported on gokrazy nodes.
+func (e *Env) RestartTailscaled(n *Node) {
+	e.t.Helper()
+	if !n.os.IsGokrazy {
+		e.t.Fatalf("RestartTailscaled(%s): only supported on gokrazy nodes (have %q)", n.name, n.os.Name)
+	}
+	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+	defer cancel()
+
+	req, err := http.NewRequestWithContext(ctx, "GET", "http://unused/restart-tailscaled", nil)
+	if err != nil {
+		e.t.Fatalf("RestartTailscaled(%s): %v", n.name, err)
+	}
+	res, err := n.agent.HTTPClient.Do(req)
+	if err != nil {
+		e.t.Fatalf("RestartTailscaled(%s): %v", n.name, err)
+	}
+	body, _ := io.ReadAll(res.Body)
+	res.Body.Close()
+	if res.StatusCode != 200 {
+		e.t.Fatalf("RestartTailscaled(%s): %s: %s", n.name, res.Status, body)
+	}
+	e.t.Logf("[%s] %s", n.name, strings.TrimSpace(string(body)))
+
+	// Wait for tailscaled to come back. Status calls will fail while the unix
+	// socket is gone, then return Starting/NeedsLogin briefly before settling
+	// on Running.
+	if err := tstest.WaitFor(45*time.Second, func() error {
+		st, err := n.agent.Status(ctx)
+		if err != nil {
+			return err
+		}
+		if st.BackendState != "Running" {
+			return fmt.Errorf("backend state = %q", st.BackendState)
+		}
+		return nil
+	}); err != nil {
+		e.t.Fatalf("RestartTailscaled(%s): waiting for Running: %v", n.name, err)
 	}
 }

@@ -1094,6 +1251,9 @@ func (e *Env) initVnet() {
 		if e.sameTailnetUser {
 			e.server.ControlServer().AllNodesSameUser = true
 		}
+		if e.allOnline {
+			e.server.ControlServer().AllOnline = true
+		}
 	})
 }