We built an open-source macOS menu bar app that turns speech into text
and pastes it into the active app — using SpeechAnalyzer for on-device
transcription, ScreenCaptureKit + Vision for screen-aware context,
and FluidAudio for speaker diarization in meeting mode.
Here's what we learned shipping it on macOS 26.
GitHub: github.com/Marvinngg/ambient-voice
Architecture
The app has two modes: hotkey dictation (press to talk, release to inject)
and meeting recording (continuous transcription with a floating panel).
Dictation Mode
Audio capture uses AVCaptureSession (more on why below).
The captured audio feeds into SpeechAnalyzer via an
AsyncStream:
let transcriber = SpeechTranscriber(
locale: locale,
transcriptionOptions: [],
reportingOptions: [.volatileResults, .alternativeTranscriptions],
attributeOptions: [.audioTimeRange, .transcriptionConfidence]
)
let analyzer = SpeechAnalyzer(modules: [transcriber])
let (inputSequence, inputBuilder) =
AsyncStream.makeStream()
try await analyzer.start(inputSequence: inputSequence)
While recording, we capture a screenshot of the focused window using
ScreenCaptureKit, run Vision OCR (VNRecognizeTextRequest), extract keywords,
and inject them into SpeechAnalyzer as contextual bias:
let context = AnalysisContext()
context.contextualStrings[.general] = ocrKeywords
try await analyzer.setContext(context)
This improves accuracy for technical terms and proper nouns visible on
screen.
If your screen shows "SpeechAnalyzer", saying it out loud is more likely
to be transcribed correctly.
After transcription, an optional L2 step sends the text through
a local LLM (ollama) for spoken-to-written cleanup,
then CGEvent simulates Cmd+V to paste into the active app.
Meeting Mode
Meeting mode forks the same audio stream to two consumers:
SpeechAnalyzer — real-time streaming transcription,
displayed in a floating NSPanel
FluidAudio buffer — accumulates 16kHz Float32 mono samples
for batch speaker diarization after recording stops
When the user ends the meeting, FluidAudio's performCompleteDiarization()
runs on the accumulated audio. We align transcription segments with
speaker segments using audioTimeRange overlap matching —
each transcription segment gets assigned the speaker ID
with the most time overlap. Results export to Markdown.
Pitfalls We Hit on macOS 26
1. AVAudioEngine installTap doesn't fire with Bluetooth devices
We started with AVAudioEngine.inputNode.installTap() for audio capture.
It worked fine with built-in mics but the tap callback never fired
with Bluetooth devices (tested with vivo TWS 4 Hi-Fi).
Fix: switched to AVCaptureSession. The delegate callback
captureOutput(_:didOutput:from:) fires reliably regardless of audio device.
The tradeoff is you get CMSampleBuffer instead of AVAudioPCMBuffer,
so you need a conversion step.
2. NSEvent addGlobalMonitorForEvents crashes
Our global hotkey listener used NSEvent.addGlobalMonitorForEvents.
On macOS 26, this crashes with a Bus error inside
GlobalObserverHandler — appears to be a Swift actor runtime issue.
Fix: switched to CGEventTap. Works reliably,
but the callback runs on a CFRunLoop context,
which Swift doesn't recognize as MainActor.
3. CGEventTap callbacks aren't on MainActor
If your CGEventTap callback touches any @MainActor state,
you'll get concurrency violations. The callback runs on
whatever thread owns the CFRunLoop.
Fix: bridge with DispatchQueue.main.async {}
inside the tap callback before touching any MainActor state.
4. CGPreflightScreenCaptureAccess doesn't request permission
We used CGPreflightScreenCaptureAccess() as a guard before
calling ScreenCaptureKit. If it returned false, we'd bail out.
The problem: this function only checks — it never triggers macOS
to add your app to the Screen Recording permission list.
Chicken-and-egg: you can't get permission
because you never ask for it.
Fix: call CGRequestScreenCaptureAccess() at app startup.
This adds your app to System Settings → Screen Recording.
Then let ScreenCaptureKit calls proceed without the preflight guard —
SCShareableContent will also trigger the permission prompt on first use.
5. Ad-hoc signing breaks TCC permissions on every rebuild
During development, codesign --sign - (ad-hoc) generates
a different code directory hash on every build.
macOS TCC tracks permissions by this hash,
so every rebuild = new app identity = all permissions reset.
Fix: sign with a stable certificate. If you have an Apple Development
certificate, use that. The TeamIdentifier stays constant across rebuilds,
so TCC permissions persist. We also discovered that launching via
open WE.app (LaunchServices) instead of directly executing the binary
is required — otherwise macOS attributes TCC permissions
to Terminal, not your app.
Benchmarks
We ran end-to-end benchmarks on public datasets
(Mac Mini M4 16GB, macOS 26):
Transcription (SpeechAnalyzer, AliMeeting Chinese):
• Near-field CER 34% (excluding outliers ~25%)
• Far-field CER 40% (single channel, no beamforming, >30% overlap)
• Processing speed 74-89x real-time
Speaker diarization (FluidAudio offline):
• AMI English 16 meetings: avg DER 23.2% (collar=0.25s, ignoreOverlap=True)
• AliMeeting Chinese 8 meetings: DER 48.5% (including overlap regions)
• Memory: RSS ~500MB, peak 730-930MB
Full evaluation methodology, scripts, and raw results are in the repo.
Open Source
The project is MIT licensed: github.com/Marvinngg/ambient-voice
It includes the macOS client (Swift 6.2, SPM),
server-side distillation/training scripts (Python),
and a complete evaluation framework with reproducible benchmarks.
Feedback and contributions welcome.
Topic:
Machine Learning & AI
SubTopic:
Apple Intelligence
Tags:
macOS
Speech
ScreenCaptureKit
Apple Intelligence
0
0
76