smoke-tests-leak

Smoke tests leak service daemons + tmp dirs — happened twice today, 70+ orphaned processes total

Metadata

Statusdone
Assignedagent-818
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-27T17:18:53.253456318+00:00
Started2026-04-27T18:40:56.230050650+00:00
Completed2026-04-27T19:02:38.322414952+00:00
Tagseval-scheduled
Tokens10180014 in / 62630 out

Description

Description

Today (2026-04-27) we found and killed two batches of leaked smoke-test daemons:

Batch 1 (15-19h old, ~67 processes): wg --dir /tmp/.tmpXXX/.workgraph service daemon --max-agents 0 --interval 600 Batch 2 (2-21h old, ~11 processes): wg --dir /tmp/wgsmoke.XXX/.wg service daemon --max-agents 0|1 [--no-coordinator-agent]

Plus 213 /tmp/.tmp*/.workgraph and 31 /tmp/wgsmoke.* directories left behind.

These are unambiguously test-fixture daemons (ephemeral tmp paths, max-agents=0 or 1, --no-coordinator-agent flag). They survive past test exit and accumulate over weeks, eventually contributing to:

  • Disk fill (each daemon's tmp dir holds graph state, logs, sockets)
  • Process clutter
  • And in one case here today, a runaway 125% CPU loop (was actually wg nex --chat coordinator-0 but still fixture-related)

Required

Find the smoke test harness(es) that spawn these daemons and ensure cleanup:

  1. Setup/teardown discipline — every test that spawns wg service daemon must register a teardown that kills the daemon and rms the tmp dir, even on test failure / panic / signal.
  2. Defense in depth — a top-level test runner that finds and kills any wg service daemon against /tmp/wgsmoke.* or /tmp/.tmp*/.workgraph paths before AND after each test session, so a leak in test N doesn't accumulate into test N+1.
  3. Fixture path discipline — make smoke test fixtures use a single well-known parent dir (e.g. $XDG_RUNTIME_DIR/wgsmoke/ or /tmp/wgsmoke/$session/) so cleanup is one rm -rf of the parent, not a glob hunt.

Files likely to touch

  • tests/smoke/ — the smoke harness; scripts/smoke/ if shell-based
  • tests/integration_*.rs — any test that spawns a daemon (look for spawn_daemon / Command::new("wg").arg("service").arg("daemon") patterns)
  • Test helper crate / module if shared setup exists

Validation

  • Failing test first: integration test that spawns a smoke daemon, kills its parent process abruptly (mid-execution panic), then asserts the daemon process and tmp dir are still cleaned up by the next test session start
  • Implementation makes test pass (likely via a session-start sweep + per-test-Drop teardown)
  • cargo build + cargo test pass with no regressions
  • Manual: run cargo test --test smoke (or whatever runs smoke), then ps -ef | grep 'service daemon.*wgsmoke\|--max-agents 0' — should return zero matches; ls /tmp/wgsmoke.* /tmp/.tmp*/.workgraph 2>/dev/null should also be empty
  • Document the cleanup contract in CONTRIBUTING / smoke test README so future test authors know to use the helper

Depends on

Required by

Log