Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
- a subscriber tries to acquire the same mutex as held by a publisher, or
- a subscriber for A events publishes B events
Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.
Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.
Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.
This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.
Further improvements can be implemented in the future as needed.
Fixes#17973Fixes#18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
Add options to the eventbus.Bus to plumb in a logger.
Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.
The log message includes the package, filename, and line number of the call
site that created the subscription.
Add tests that verify this works.
Updates #17680
Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
If any debugging hook might see an event, Publisher.ShouldPublish should
tell its caller to publish even if there are no ordinary subscribers.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
Enables monitoring events as they flow, listing bus clients, and
snapshotting internal queues to troubleshoot stalls.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
This makes the helpers closer in behavior to cancelable contexts
and taskgroup.Single, and makes the worker code use a more normal
and easier to reason about context.Context for shutdown.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
The Client carries both publishers and subscribers for a single
actor. This makes the APIs for publish and subscribe look more
similar, and this structure is a better fit for upcoming debug
facilities.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>