tstest/natlab/vmtest: add TestDiscoKeyChange

Add a vmtest that brings up two gokrazy nodes A and B behind two
One2OneNAT networks (so direct UDP works in both directions and any
slowness can't be blamed on NAT traversal), establishes a WireGuard
tunnel A → B with TSMP, then rotates B's disco key four times and
asserts that the data plane recovers in both directions after each
rotation. All pings are TSMP (the data-plane ping; disco pings would
not exercise the WireGuard tunnel itself).

The five pings:

  1. A → B  (initial; brings up the tunnel; 30s budget)
  2. B → A  after rotate (LocalAPI rotate-disco-key debug action)
  3. A → B  after rotate (LocalAPI)
  4. B → A  after restart (SIGKILL; gokrazy supervisor respawns)
  5. A → B  after restart (SIGKILL)

Each post-rotation ping gets a 15-second budget. Two unavoidable
multi-second waits dominate today:

  - The rotate-then-a→b phase takes ~10s on main because of LazyWG.
    After B's WantRunning bounce, B's wgengine resets its
    sentActivityAt/recvActivityAt maps and trims A out of the
    wireguard-go config as an "idle peer"; B only re-adds A on
    inbound activity, by which point A's first few TSMP packets
    have been silently dropped at B's tundev. The
    bradfitz/rm_lazy_wg branch removes that trimming entirely
    (verified locally: this phase drops to <100ms there).

  - The restart phases take ~5s for wireguard-go's RekeyTimeout
    handshake retry. After SIGKILL+respawn the first WG handshake
    init from the restarted node sometimes goes into the void
    (likely the brief peer-removed window in the receiver's
    two-step maybeReconfigWireguardLocked reconfig during which
    the peer is absent from wireguard-go), and wg-go's 5s+jitter
    retransmit timer is the next opportunity to retry. That retry
    succeeds and the staged TSMP packet flushes. Intrinsic to the
    protocol's retransmit policy.

Once LazyWG is removed and the first-handshake-after-reconfig race
is fixed, the budget should drop to 5s.

Supporting changes:

  ipn/ipnlocal: DebugRotateDiscoKey now toggles WantRunning off and
  back on after rotating the disco key. magicsock.Conn.RotateDiscoKey
  only resets local disco state; without also dropping wireguard-go
  session keys, peers keep encrypting with their stale per-peer
  session against us until their rekey timer fires (WireGuard has no
  data-plane signaling to invalidate sessions). Bouncing WantRunning
  runs the engine through Reconfig(empty) → authReconfig, which
  drops every peer's WG session so the next packet either way
  triggers a fresh handshake.

  ipn/ipnlocal, ipn/localapi: add a debug-only "peer-disco-keys"
  LocalAPI action ([LocalBackend.DebugPeerDiscoKeys]) that returns
  a map[NodePublic]DiscoPublic from the current netmap. Tests reach
  it via [local.Client.DebugResultJSON]. We do not surface disco
  keys via [ipnstate.PeerStatus] because adding a non-comparable
  [key.DiscoPublic] field there breaks reflect-based test helpers
  (e.g. TestFilterFormatAndSortExitNodes' use of cmp.Diff), and
  general LocalAPI clients have no need for disco keys. Since the
  debug LocalAPI is gated behind the ts_omit_debug build tag, this
  endpoint is automatically stripped from small binaries.

  cmd/tta: add /restart-tailscaled handler (Linux-only, via /proc walk)
  to drive the SIGKILL phase. On gokrazy the supervisor respawns
  tailscaled within a second.

  tstest/integration/testcontrol: add Server.AllOnline. When set,
  every peer entry in MapResponses is marked Online=true. Several
  disco-key handling fast paths in controlclient and wgengine
  (removeUnwantedDiscoUpdates, removeUnwantedDiscoUpdatesFromFull
  NetmapUpdate, the wgengine tsmpLearnedDisco fast path) only fire
  for online peers; without this flag, tests exercising disco-key
  rotation only hit the offline-peer code paths, which mask issues
  and are several seconds slower in this scenario. Finer-grained
  per-node online tracking can be added later.

  tstest/natlab/vmtest: add Env.RotateDiscoKey,
  Env.RestartTailscaled, Env.PeerDiscoKey, Node.Name, an
  [AllOnline] EnvOption that plumbs through to
  testcontrol.Server.AllOnline, and an exported
  Env.Ping(from, to, type, timeout). Ping replaces the unexported
  helper so callers can specify both a ping type (PingDisco for
  warming peer state, PingTSMP for asserting end-to-end
  connectivity) and a deadline. PeerDiscoKey returns its LocalAPI
  error so callers inside tstest.WaitFor can retry transient
  failures rather than fataling the test.

Updates #12639
Updates #13038

Change-Id: I3644f27fc30e52990ba25a3983498cc582ddb958
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This commit is contained in:
Brad Fitzpatrick
2026-04-28 21:29:05 +00:00
committed by Brad Fitzpatrick
parent 22ff402da9
commit 15cba0a3f6
7 changed files with 519 additions and 11 deletions
+171 -11
View File
@@ -19,6 +19,7 @@ import (
"bytes"
"context"
"encoding/base64"
"encoding/json"
"flag"
"fmt"
"io"
@@ -43,6 +44,7 @@ import (
"tailscale.com/ipn"
"tailscale.com/ipn/ipnstate"
"tailscale.com/tailcfg"
"tailscale.com/tstest"
"tailscale.com/tstest/integration/testcontrol"
"tailscale.com/tstest/natlab/vnet"
"tailscale.com/types/key"
@@ -85,6 +87,7 @@ type Env struct {
qemuProcs []*exec.Cmd // launched QEMU processes
sameTailnetUser bool // all nodes register as the same Tailnet user
allOnline bool // mark every peer as Online=true in MapResponses
// Shared resource initialization (sync.Once for things multiple nodes share).
vnetOnce sync.Once
@@ -346,6 +349,16 @@ func SameTailnetUser() EnvOption {
return envOptFunc(func(e *Env) { e.sameTailnetUser = true })
}
// AllOnline returns an [EnvOption] that makes the test control server mark
// every peer as Online=true in MapResponses (testcontrol.Server.AllOnline).
// Several disco-key handling fast paths in the controlclient and wgengine
// only fire when the peer is reported online; without this option those
// paths are silently skipped, which can mask bugs and slow down recovery
// from disco-key rotations.
func AllOnline() EnvOption {
return envOptFunc(func(e *Env) { e.allOnline = true })
}
// AddNetwork creates a new virtual network. Arguments follow the same pattern as
// vnet.Config.AddNetwork (string IPs, NAT types, NetworkService values).
func (e *Env) AddNetwork(opts ...any) *vnet.Network {
@@ -414,6 +427,11 @@ func (e *Env) AddNode(name string, opts ...any) *Node {
// LanIP returns the LAN IPv4 address of this node on the given network.
// This is only valid after Env.Start() has been called.
// Name returns the node's name as set in [Env.AddNode].
func (n *Node) Name() string {
return n.name
}
func (n *Node) LanIP(net *vnet.Network) netip.Addr {
return n.vnetNode.LanIP(net)
}
@@ -864,33 +882,172 @@ func (e *Env) ApproveRoutes(n *Node, routes ...string) {
}
}
// ping pings from one node to another's Tailscale IP, retrying until it succeeds
// or the timeout expires. This establishes the WireGuard tunnel between the nodes.
// ping does a disco ping from one node to another's Tailscale IP, retrying
// for up to 30 seconds, fataling on failure. It is used internally to wake
// up magicsock peer state before a test runs; tests that want to assert
// connectivity should use [Env.Ping] with the appropriate ping type and
// timeout.
func (e *Env) ping(from, to *Node) {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
e.t.Helper()
if err := e.Ping(from, to, tailcfg.PingDisco, 30*time.Second); err != nil {
e.t.Fatal(err)
}
}
// Ping pings from one node to another's Tailscale IP using the given ping
// type, retrying until it succeeds or timeout expires. It returns the error
// from the last attempt if the timeout expires. Unlike the internal ping
// helper, it does not fatal the test on failure; callers can check the error
// to assert on timing.
//
// [tailcfg.PingTSMP] actually flows packets across the WireGuard tunnel and is
// the right choice for asserting end-to-end connectivity.
// [tailcfg.PingDisco] only exchanges disco messages between magicsock layers
// and is useful for warming up peer state without requiring a working tunnel.
func (e *Env) Ping(from, to *Node, ptype tailcfg.PingType, timeout time.Duration) error {
e.t.Helper()
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
toSt, err := to.agent.Status(ctx)
if err != nil {
e.t.Fatalf("ping: can't get %s status: %v", to.name, err)
return fmt.Errorf("ping: can't get %s status: %w", to.name, err)
}
if len(toSt.Self.TailscaleIPs) == 0 {
e.t.Fatalf("ping: %s has no Tailscale IPs", to.name)
return fmt.Errorf("ping: %s has no Tailscale IPs", to.name)
}
targetIP := toSt.Self.TailscaleIPs[0]
var lastErr error
for {
pingCtx, pingCancel := context.WithTimeout(ctx, 3*time.Second)
pr, err := from.agent.PingWithOpts(pingCtx, targetIP, tailcfg.PingDisco, local.PingOpts{})
// Per-attempt timeout: cap at 3s but never exceed the remaining budget.
attemptTimeout := 3 * time.Second
if d := time.Until(deadline(ctx)); d < attemptTimeout {
attemptTimeout = d
}
if attemptTimeout <= 0 {
break
}
pingCtx, pingCancel := context.WithTimeout(ctx, attemptTimeout)
pr, err := from.agent.PingWithOpts(pingCtx, targetIP, ptype, local.PingOpts{})
pingCancel()
if err == nil && pr.Err == "" {
e.logVerbosef("ping: %s -> %s OK", from.name, targetIP)
return
e.logVerbosef("ping(%s): %s -> %s OK", ptype, from.name, targetIP)
return nil
}
switch {
case err != nil:
lastErr = err
case pr.Err != "":
lastErr = fmt.Errorf("%s", pr.Err)
}
if ctx.Err() != nil {
e.t.Fatalf("ping: %s -> %s timed out", from.name, targetIP)
break
}
time.Sleep(time.Second)
time.Sleep(500 * time.Millisecond)
}
if lastErr == nil {
lastErr = ctx.Err()
}
return fmt.Errorf("ping(%s): %s -> %s (%s) timed out after %v: %w", ptype, from.name, to.name, targetIP, timeout, lastErr)
}
// deadline returns ctx's deadline, or a zero Time if it has none.
func deadline(ctx context.Context) time.Time {
d, _ := ctx.Deadline()
return d
}
// PeerDiscoKey returns n's view of the given peer's disco key. It returns a
// non-nil error if the LocalAPI request fails (e.g. tailscaled briefly
// unavailable during a restart). It returns (zero, false, nil) if n is
// reachable but has no record of the given peer in its current netmap.
//
// PeerDiscoKey is suitable for use inside a [tstest.WaitFor] poll loop: it
// does not fatal the test on transient errors.
//
// The disco key is fetched from the debug-only "peer-disco-keys" LocalAPI
// action ([ipnlocal.LocalBackend.DebugPeerDiscoKeys]) rather than via
// [ipnstate.Status], to keep the production PeerStatus struct free of disco
// keys (and free of non-comparable fields like [key.DiscoPublic] that break
// reflect-based test helpers).
func (e *Env) PeerDiscoKey(n *Node, peer key.NodePublic) (key.DiscoPublic, bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
got, err := n.agent.DebugResultJSON(ctx, "peer-disco-keys")
if err != nil {
return key.DiscoPublic{}, false, err
}
// DebugResultJSON returns the result as a generic any (the body is
// re-decoded into any), so the map comes back keyed by string text-
// encoded node keys. Re-marshal+unmarshal into a typed map for cleaner
// lookup. (Roundtripping through JSON is fine for a test helper.)
raw, err := json.Marshal(got)
if err != nil {
return key.DiscoPublic{}, false, fmt.Errorf("re-marshal: %w", err)
}
var m map[key.NodePublic]key.DiscoPublic
if err := json.Unmarshal(raw, &m); err != nil {
return key.DiscoPublic{}, false, fmt.Errorf("unmarshal peer-disco-keys: %w", err)
}
d, ok := m[peer]
return d, ok, nil
}
// RotateDiscoKey asks tailscaled on n to rotate its discovery (magicsock) key
// in place via the LocalAPI debug action. The node key, control connection,
// and other tailscaled state are unaffected. It fatals the test on error.
func (e *Env) RotateDiscoKey(n *Node) {
e.t.Helper()
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := n.agent.DebugAction(ctx, "rotate-disco-key"); err != nil {
e.t.Fatalf("RotateDiscoKey(%s): %v", n.name, err)
}
}
// RestartTailscaled signals tailscaled on n to die so that its supervisor
// (gokrazy) restarts it. It then waits for tailscaled to come back to the
// "Running" backend state. It fatals the test on error.
//
// Restarting tailscaled is currently only supported on gokrazy nodes.
func (e *Env) RestartTailscaled(n *Node) {
e.t.Helper()
if !n.os.IsGokrazy {
e.t.Fatalf("RestartTailscaled(%s): only supported on gokrazy nodes (have %q)", n.name, n.os.Name)
}
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", "http://unused/restart-tailscaled", nil)
if err != nil {
e.t.Fatalf("RestartTailscaled(%s): %v", n.name, err)
}
res, err := n.agent.HTTPClient.Do(req)
if err != nil {
e.t.Fatalf("RestartTailscaled(%s): %v", n.name, err)
}
body, _ := io.ReadAll(res.Body)
res.Body.Close()
if res.StatusCode != 200 {
e.t.Fatalf("RestartTailscaled(%s): %s: %s", n.name, res.Status, body)
}
e.t.Logf("[%s] %s", n.name, strings.TrimSpace(string(body)))
// Wait for tailscaled to come back. Status calls will fail while the unix
// socket is gone, then return Starting/NeedsLogin briefly before settling
// on Running.
if err := tstest.WaitFor(45*time.Second, func() error {
st, err := n.agent.Status(ctx)
if err != nil {
return err
}
if st.BackendState != "Running" {
return fmt.Errorf("backend state = %q", st.BackendState)
}
return nil
}); err != nil {
e.t.Fatalf("RestartTailscaled(%s): waiting for Running: %v", n.name, err)
}
}
@@ -1094,6 +1251,9 @@ func (e *Env) initVnet() {
if e.sameTailnetUser {
e.server.ControlServer().AllNodesSameUser = true
}
if e.allOnline {
e.server.ControlServer().AllOnline = true
}
})
}