tstest/natlab, .github/workflows: add opt-in natlab CI workflow

The natlab vmtest suite (tstest/natlab/vmtest) and the integration nat
tests are gated behind --run-vm-tests because they need KVM and are
slow. Until now nothing in CI exercised them apart from a single
canary TestEasyEasy run on every PR.

Add .github/workflows/natlab-test.yml that runs the full opt-in suite
on demand (workflow_dispatch), on PRs labeled "natlab", and on main
every 12 hours via cron. The workflow has two phases:

  - "prepare" builds the gokrazy VM image, downloads the Ubuntu and
    FreeBSD cloud images once via the new natlabprep tool, and emits
    a dynamic JSON matrix of every TestX function it finds in the two
    opt-in packages.
  - "test" is a per-test matrix that depends on prepare. Each matrix
    job restores the shared caches and runs a single test, so adding
    a new TestFoo is automatically picked up on the next run without
    any workflow edits.

Rename the existing natlab-integrationtest.yml to natlab-basic.yml
since it's the small smoke variant (just TestEasyEasy on every PR);
the new natlab-test.yml is the bigger suite. The job inside is
renamed to EasyEasy for the same reason.

Move the macOS arm64 host check from vmtest.Env.Start into
vmtest.Env.AddNode so a test that adds a vmtest.MacOS node skips
immediately on a non-macOS host, and add an explicit
skipIfNotMacOSArm64 helper at the top of the two macOS-only tests
so the platform requirement is obvious to readers.

Quiet the takeAgentConnOne miss log in tstest/natlab/vnet by default
(it was the overwhelming majority of bytes in CI logs, with no signal
in healthy runs) and replace it with a periodic "still waiting" line
that only fires after 10s, so a truly stuck agent connection still
surfaces.

Updates #13038

Change-Id: I4582098d8865200fd5a73a9b696942319ccf3bf0
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This commit is contained in:
Brad Fitzpatrick
2026-05-06 20:07:45 +00:00
committed by Brad Fitzpatrick
parent 4eec4423b4
commit e062b46984
9 changed files with 454 additions and 78 deletions
+26 -1
View File
@@ -7,6 +7,7 @@ import (
"bytes"
"fmt"
"net/netip"
"runtime"
"strings"
"testing"
"time"
@@ -21,7 +22,20 @@ import (
"tailscale.com/types/netmap"
)
// skipIfNotMacOSArm64 skips the test when the host isn't a macOS arm64 host.
// macOS VM tests require Apple Virtualization.framework via tailmac.
// AddNode also enforces this when a macOS node is added, but having an
// explicit skip at the top of macOS-only tests makes the requirement
// obvious to readers.
func skipIfNotMacOSArm64(t *testing.T) {
t.Helper()
if runtime.GOOS != "darwin" || runtime.GOARCH != "arm64" {
t.Skipf("macOS VM tests require a macOS arm64 host (got %s/%s)", runtime.GOOS, runtime.GOARCH)
}
}
func TestMacOSAndLinuxCanPing(t *testing.T) {
skipIfNotMacOSArm64(t)
env := vmtest.New(t)
lan := env.AddNetwork("192.168.1.1/24")
@@ -39,6 +53,7 @@ func TestMacOSAndLinuxCanPing(t *testing.T) {
}
func TestTwoMacOSVMsCanPing(t *testing.T) {
skipIfNotMacOSArm64(t)
env := vmtest.New(t)
lan := env.AddNetwork("192.168.1.1/24")
@@ -969,8 +984,18 @@ func TestCachedNetmapAfterRestart(t *testing.T) {
}
netmapCheckStep.End(nil)
// 90s is generous on purpose. After both nodes restart with stale cached
// netmap entries, a's first WG handshake to b's pre-restart endpoint
// hits the dead NAT mapping on b's side and is silently dropped (we
// see this as "no recent outgoing packet" NAT drops in the vnet log).
// Recovery then waits on wireguard-go's REKEY_TIMEOUT (~5s) before the
// next handshake attempt, and on disco-via-DERP to teach each side the
// other's new endpoint. On an idle host this converges in well under
// 15s; on a contended host (a 14/16-CPU-loaded local repro, or any
// shared CI runner) the same sequence has been observed at 50-60s
// because every timer fires multiple times under scheduling jitter.
pingStep.Begin()
if err := env.Ping(a, b, tailcfg.PingTSMP, 30*time.Second); err != nil {
if err := env.Ping(a, b, tailcfg.PingTSMP, 90*time.Second); err != nil {
pingStep.End(err)
t.Fatal(err)
}