You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[bug] Audio capture acts as an active participant in the CoreAudio graph (hijacks default device, re-opens inputs on tap-timeout, forces BT headsets out of A2DP) — should be a passive post-virtual-device listener #3750
On macOS, screenpipe's audio capture does not behave like a passive listener. It re-opens devices and re-asserts itself in the CoreAudio graph on tap-timeouts and device changes, which causes three observable problems:
It steals the default output/system-output slot. After a coreaudiod restart (or graph rebuild), Screenpipe Capture (Undocked) ends up flagged as Default Output + Default System Output. Because that virtual device is 2-in/2-out, system audio gets routed into it and monitored back out — the user hears their own mic looped into their headphones until they manually re-select a real output device.
The capture tap silently times out and re-establishes./health shows stream_timeouts climbing and audio_status: active_no_data with the System Audio device going inactive for long stretches while only the virtual capture device stays live. Each re-establish is another device re-open — i.e. another chance to hijack the default slot / re-trigger the HFP switch.
Note on the Docked vs Undocked variants: to be clear, those two are not screenpipe internals — they are devices I built in Loopback as an attempt to auto-identify my hardware stack (e.g. dock attached vs not) and route screenpipe to the right sources/outputs depending on context. That self-built routing logic only made things worse: it multiplied the corner cases and created its own dilemmas (which variant is "current"? what happens on a dock hot-plug mid-capture? which one does screenpipe latch onto?), on top of the underlying hijack behavior. It is more evidence that users are being forced to hand-build brittle routing schemes to compensate for capture not being a passive listener — not a screenpipe feature.
Environment
screenpipe 0.3.351
macOS (Apple Silicon)
Output device: Sony BT headphones (A2DP)
Wired mic available (Shure MV7+) + MacBook mic
Also running SoundSource (per-app routing) and Loopback (Rogue Amoeba)
Why this matters / the real-world struggle
To even diagnose this, I had to build a Loopback virtual device as a stable buffer in front of screenpipe — mic + app sources mixed, monitored out to the headphones, with screenpipe listening on the Loopback device. I want to be explicit: Loopback here is a test probe, not a solution. It was a way to isolate and confirm the root cause (screenpipe stops hijacking the real devices once it's pointed at a virtual one it can't fight over), not something I want in my setup. It is not an answer to the actual bug — it only papers over it by adding a whole extra layer of routing complexity. For normal daily work I should not need Loopback at all, and the goal of this issue is to make it deletable.
As a workaround it's also genuinely fragile:
It's extremely easy to accidentally include the BT headset's mic as a Loopback source (especially when the Loopback device is named after and built on the headset), which guarantees the A2DP→HFP drop the moment anything opens it.
The monitor/mute-when-capturing matrix is error-prone and produces echo/double-audio.
It fights with SoundSource: I want screenpipe to sit downstream of SoundSource's per-app routing on the output stream, not to grab inputs ahead of it and take control away.
It will also collide with superwhisper and chat apps (Discord/Meet/etc.) that need the mic input. When screenpipe overtakes the input device, those either crash or capture nothing.
In short: Loopback was the diagnostic that proved the root cause; it should not be the long-term fix, and a correct passive-listener implementation would let me remove it entirely.
Requested behavior
Be a passive listener. Capture system audio via a process/output tap that sits downstream of the user's existing chain (post-SoundSource) and never sets itself as the default output / system-output device.
Never re-assert default device on tap-timeout or device-change. On a CoreAudio tap timeout, reconnect to the current default output without mutating the graph or claiming the default slot.
Never open a physical/BT input that forces A2DP→HFP. Don't acquire the BT headset's microphone for system-audio capture. Mic capture should be an explicitly chosen input only, and ideally avoid HFP-forcing devices unless the user picks them.
Co-exist with other audio apps. Capture should not exclusively claim input/output devices so that superwhisper, Discord, Meet, SoundSource, and Loopback can all run concurrently.
Nice-to-have: first-class SoundSource / Loopback integration / docs so users don't have to hand-build a virtual device just to get a stable capture point. A documented "capture from this virtual device, output stays on your real device" recipe would remove most of this pain — though the priority is that capture works without any such workaround.
If capture were a well-behaved passive listener, the entire Loopback workaround could be deleted — which is the point: it was only ever a test rig to find this bug, never a setup I want to keep.
The manual-pin path is not a resolution either
The obvious suggestion is "just turn off Auto-select audio devices and pin specific devices." That stops screenpipe from wandering onto the BT mic / stealing the default slot, but it does not solve the problem — it trades random hijacking for a permanent manual-routing burden:
I now have to decide which virtual device variant screenpipe listens on (docked vs undocked), and re-decide it every time I dock/undock.
I have to hand-manage the monitor / self-passthrough so I still hear app audio without doubling or echo, while making sure my own mic is monitored nowhere.
All of this has to be re-jiggered on every hardware-stack change.
In other words, device-selection is the wrong abstraction. Forcing the user to express capture intent as "pick these exact devices" pushes a docked/undocked + monitor-routing decision matrix onto them that has to be re-managed continuously. The correct abstraction is intent: "listen to whatever I am already hearing, downstream of my routing, passively" — no device modelling required. A passive post-chain listener is the only thing that removes the burden entirely; pinning just freezes one fragile snapshot of it.
Reproduction / testing
Steps to reproduce the hijack on macOS (Apple Silicon, BT headset as default output):
Set a Bluetooth headset (e.g. Sony WH-series) as the default output; confirm it is in A2DP (stereo, full-quality).
In screenpipe settings, enable Auto-select audio devices.
Let screenpipe run; trigger a graph rebuild (toggle output device, sleep/wake, or sudo killall coreaudiod).
Observe: the BT headset drops A2DP → HFP/SCO (audio goes mono/muffled) the moment screenpipe re-acquires inputs, without the user starting any call. And/or the screenpipe virtual capture device grabs the Default-Output slot, looping mic into the headphones.
Health-endpoint probe to confirm the timeout/re-acquire churn (no call active):
# watch the tap re-establishing itself over time — stream_timeouts climbs,# System Audio goes 'inactive' for long stretches while only the virtual cap device stays liveforiin 1 2 3 4 5 6;do
curl -s http://localhost:3030/health -H "Authorization: Bearer \$SCREENPIPE_LOCAL_API_KEY" \
| python3 -c "import sys,json;d=json.load(sys.stdin);ap=d['audio_pipeline'];print(d.get('audio_status'),'| stream_timeouts=',ap.get('stream_timeouts'),'|',d.get('device_status_details'))"
sleep 30
done
What a fixed build should show under the same probe: audio_status stays active, stream_timeouts does not climb, the System Audio tap never goes inactive while audio is playing, the BT headset stays in A2DP across a graph rebuild, and the screenpipe virtual device never appears as Default Output. Bonus check: superwhisper / Discord can hold the mic input concurrently with screenpipe capture without either crashing or going silent.
Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)
The repro above describes the churn case (re-acquire / HFP drop / stream_timeouts climbing). But with the clean no-Loopback, system-audio-only config (all mics unchecked, only System Audio ticked), a graph rebuild produces a quieter and arguably worse failure: screenpipe keeps a dead System Audio tap forever and never reconnects. The liveness watchdog does not fire — stream_timeouts does not even increment, so nothing notices the tap died.
Captured test run (screenpipe v0.3.351, macOS Apple Silicon, config = audio_devices: ['System Audio (output)'], no mic in pipeline):
Run sudo killall coreaudiod to force a graph rebuild (step 3 above).
Poll /health every 10s. The tap never re-attaches; "last activity" just keeps climbing while stream_timeouts stays frozen:
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 171s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 181s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 191s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 201s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 211s ago)
A follow-up poll ~8 minutes later still showed inactive (last activity: 471s ago), stream_timeouts: 21, rms: 0.0 — i.e. capture was effectively off until screenpipe was restarted; it never self-healed.
Recovery confirmed only via full restart. After a manual restart of screenpipe, /health returned to status: healthy, audio_status: ok, System Audio (output): active, stream_timeouts: 0. Capture was then verified to be genuinely live (not just reporting active with silence): with a YouTube video playing, per_device_audio_level_rms for System Audio (output) read 0.0105 — a clear non-zero signal, vs. the flat 0.0 held the entire 8 minutes the tap was dead. So the only recovery path that worked was a full app restart; the watchdog never performed the equivalent reconnect on its own.
Interpretation: this is not "the tap re-acquires and trips." It's "after a CoreAudio graph rebuild, screenpipe sits on a stale/dead tap handle, never detects it's dead, never retries, and never increments its own timeout counter." The watchdog/liveness check does not catch a post-rebuild dead handle. A fixed build must detect the dead tap, attempt reconnection to the current graph, and recover to active within seconds — without a manual restart.
Proposed integration (concrete mechanisms — not yet decided which)
The "first-class integration" nice-to-have above can be made concrete. Any of the following would let screenpipe be fed passively by an existing audio-routing layer instead of acquiring devices itself. Listing options without committing to one — the maintainers are best placed to judge feasibility:
Read from a user-designated virtual capture device (SoundSource + Loopback). The least-new-code path: keep capture as "listen on a chosen input," but make it rock-solid (never re-assert default, never re-enumerate, reconnect to the current device on timeout). The user routes per-app output via SoundSource, mixes into a Loopback virtual device, and screenpipe just reads that one device. This already works today as a workaround; the ask is that it stay stable. Note this is the user's preferred minimal stack — no Audio Hijack required, and nothing screenpipe needs to run itself.
Register as an ARK capture target. Rogue Amoeba's shared engine (com.rogueamoeba.ARK.driver) underlies SoundSource, Loopback, and Audio Hijack. If screenpipe could appear as an ARK capture endpoint, the Rogue Amoeba engine would do the tapping and stream audio in — zero device hijacking on screenpipe's side. This is the cleanest "feeds in naturally" option.
Be an Audio Hijack output target. For users who run Audio Hijack, screenpipe could appear as an output block so a session pipes audio straight to it. (Optional path — should not be a requirement; most users won't and shouldn't have to run a third app just to capture.)
All three converge on the same principle: screenpipe consumes a stream handed to it, downstream of the user's routing, and never grabs or re-asserts physical devices. The default experience should require none of these — but supporting one or more would make capture robust for power users with existing audio stacks.
Summary
On macOS, screenpipe's audio capture does not behave like a passive listener. It re-opens devices and re-asserts itself in the CoreAudio graph on tap-timeouts and device changes, which causes three observable problems:
It steals the default output/system-output slot. After a
coreaudiodrestart (or graph rebuild),Screenpipe Capture (Undocked)ends up flagged as Default Output + Default System Output. Because that virtual device is 2-in/2-out, system audio gets routed into it and monitored back out — the user hears their own mic looped into their headphones until they manually re-select a real output device.It forces Bluetooth headsets out of A2DP into HFP/SCO. When screenpipe (re-)opens a device for input that references the BT headset's microphone, macOS drops the headset from A2DP (stereo) to HFP (mono call-mode). This happens without the user starting a call, purely from screenpipe re-acquiring inputs. Cf. Audio recording silently stops mid-session when Bluetooth device (AirPods) is reused by another app, despite "ON" status #3144 (closed, AirPods-specific) and Windows: audio stream build fails with E_INVALIDARG (0x80070057) on Bluetooth headsets #3020 (Windows BT) — those are symptoms of the same root behavior.
The capture tap silently times out and re-establishes.
/healthshowsstream_timeoutsclimbing andaudio_status: active_no_datawith the System Audio device goinginactivefor long stretches while only the virtual capture device stays live. Each re-establish is another device re-open — i.e. another chance to hijack the default slot / re-trigger the HFP switch.Evidence from /health (this machine)
Note on the Docked vs Undocked variants: to be clear, those two are not screenpipe internals — they are devices I built in Loopback as an attempt to auto-identify my hardware stack (e.g. dock attached vs not) and route screenpipe to the right sources/outputs depending on context. That self-built routing logic only made things worse: it multiplied the corner cases and created its own dilemmas (which variant is "current"? what happens on a dock hot-plug mid-capture? which one does screenpipe latch onto?), on top of the underlying hijack behavior. It is more evidence that users are being forced to hand-build brittle routing schemes to compensate for capture not being a passive listener — not a screenpipe feature.
Environment
Why this matters / the real-world struggle
To even diagnose this, I had to build a Loopback virtual device as a stable buffer in front of screenpipe — mic + app sources mixed, monitored out to the headphones, with screenpipe listening on the Loopback device. I want to be explicit: Loopback here is a test probe, not a solution. It was a way to isolate and confirm the root cause (screenpipe stops hijacking the real devices once it's pointed at a virtual one it can't fight over), not something I want in my setup. It is not an answer to the actual bug — it only papers over it by adding a whole extra layer of routing complexity. For normal daily work I should not need Loopback at all, and the goal of this issue is to make it deletable.
As a workaround it's also genuinely fragile:
In short: Loopback was the diagnostic that proved the root cause; it should not be the long-term fix, and a correct passive-listener implementation would let me remove it entirely.
Requested behavior
If capture were a well-behaved passive listener, the entire Loopback workaround could be deleted — which is the point: it was only ever a test rig to find this bug, never a setup I want to keep.
The manual-pin path is not a resolution either
The obvious suggestion is "just turn off Auto-select audio devices and pin specific devices." That stops screenpipe from wandering onto the BT mic / stealing the default slot, but it does not solve the problem — it trades random hijacking for a permanent manual-routing burden:
In other words, device-selection is the wrong abstraction. Forcing the user to express capture intent as "pick these exact devices" pushes a docked/undocked + monitor-routing decision matrix onto them that has to be re-managed continuously. The correct abstraction is intent: "listen to whatever I am already hearing, downstream of my routing, passively" — no device modelling required. A passive post-chain listener is the only thing that removes the burden entirely; pinning just freezes one fragile snapshot of it.
Reproduction / testing
Steps to reproduce the hijack on macOS (Apple Silicon, BT headset as default output):
sudo killall coreaudiod).Health-endpoint probe to confirm the timeout/re-acquire churn (no call active):
What a fixed build should show under the same probe:
audio_statusstaysactive,stream_timeoutsdoes not climb, the System Audio tap never goesinactivewhile audio is playing, the BT headset stays in A2DP across a graph rebuild, and the screenpipe virtual device never appears as Default Output. Bonus check:superwhisper/ Discord can hold the mic input concurrently with screenpipe capture without either crashing or going silent.Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)
The repro above describes the churn case (re-acquire / HFP drop /
stream_timeoutsclimbing). But with the clean no-Loopback, system-audio-only config (all mics unchecked, onlySystem Audioticked), a graph rebuild produces a quieter and arguably worse failure: screenpipe keeps a dead System Audio tap forever and never reconnects. The liveness watchdog does not fire —stream_timeoutsdoes not even increment, so nothing notices the tap died.Captured test run (screenpipe v0.3.351, macOS Apple Silicon, config =
audio_devices: ['System Audio (output)'], no mic in pipeline):audio_status: stale,System Audio (output): inactive,stream_timeouts: 21,audio_level_rms: 0.0.sudo killall coreaudiodto force a graph rebuild (step 3 above)./healthevery 10s. The tap never re-attaches; "last activity" just keeps climbing whilestream_timeoutsstays frozen:A follow-up poll ~8 minutes later still showed
inactive (last activity: 471s ago),stream_timeouts: 21,rms: 0.0— i.e. capture was effectively off until screenpipe was restarted; it never self-healed.Recovery confirmed only via full restart. After a manual restart of screenpipe,
/healthreturned tostatus: healthy,audio_status: ok,System Audio (output): active,stream_timeouts: 0. Capture was then verified to be genuinely live (not just reportingactivewith silence): with a YouTube video playing,per_device_audio_level_rmsforSystem Audio (output)read0.0105— a clear non-zero signal, vs. the flat0.0held the entire 8 minutes the tap was dead. So the only recovery path that worked was a full app restart; the watchdog never performed the equivalent reconnect on its own.Interpretation: this is not "the tap re-acquires and trips." It's "after a CoreAudio graph rebuild, screenpipe sits on a stale/dead tap handle, never detects it's dead, never retries, and never increments its own timeout counter." The watchdog/liveness check does not catch a post-rebuild dead handle. A fixed build must detect the dead tap, attempt reconnection to the current graph, and recover to
activewithin seconds — without a manual restart.Related
--audio-deviceoverridden by persisted store (related capture-config control gap).Proposed integration (concrete mechanisms — not yet decided which)
The "first-class integration" nice-to-have above can be made concrete. Any of the following would let screenpipe be fed passively by an existing audio-routing layer instead of acquiring devices itself. Listing options without committing to one — the maintainers are best placed to judge feasibility:
Read from a user-designated virtual capture device (SoundSource + Loopback). The least-new-code path: keep capture as "listen on a chosen input," but make it rock-solid (never re-assert default, never re-enumerate, reconnect to the current device on timeout). The user routes per-app output via SoundSource, mixes into a Loopback virtual device, and screenpipe just reads that one device. This already works today as a workaround; the ask is that it stay stable. Note this is the user's preferred minimal stack — no Audio Hijack required, and nothing screenpipe needs to run itself.
Register as an ARK capture target. Rogue Amoeba's shared engine (
com.rogueamoeba.ARK.driver) underlies SoundSource, Loopback, and Audio Hijack. If screenpipe could appear as an ARK capture endpoint, the Rogue Amoeba engine would do the tapping and stream audio in — zero device hijacking on screenpipe's side. This is the cleanest "feeds in naturally" option.Be an Audio Hijack output target. For users who run Audio Hijack, screenpipe could appear as an output block so a session pipes audio straight to it. (Optional path — should not be a requirement; most users won't and shouldn't have to run a third app just to capture.)
All three converge on the same principle: screenpipe consumes a stream handed to it, downstream of the user's routing, and never grabs or re-asserts physical devices. The default experience should require none of these — but supporting one or more would make capture robust for power users with existing audio stacks.