Hardening Speech-to-Text in Your Terminal
Obsessing over UX and how Option + Space should feel really, really boring.
Hey folks!
There is a version of speech-to-text that looks great in a demo and still feels wrong in daily use. You hold Option + Space. You speak. You release.
Wow. So much wow.
And it worked. YEN has had a strong speech-to-text foundation for a while now. It is on-device. It is native. It works across macOS apps. On macOS 26 and newer it uses Apple’s modern speech stack with unlimited sessions.
That part was never the problem.
The problem was simpler and more important. The workflow itself needed to be tightened. So, that’s what I’ve been working on since we’re coming up on our official v1.000 release — everything needs to be perfect.
But, there were some over-complications that I’ve wanted to fix and those fixes should be in as the right version of speech-to-text in a terminal-first IDE should not feel clever. It should feel dependable.
It should feel boring in the best possible way.
And, the rest of this post may be boring for most (a bit engineering and tech-heavy, so forgive me) but at the very least it showcases how much I care about this feature that’s a model for what and how I think about a Terminal-first IDE (platform).
If you want to geek out, continue on fellow nerd.
Hotkeys and Session Contracts
The first part of this hardening was the hotkey session contract.
The old problem space here was subtle. Global hotkeys on macOS are not just about catching one keyDown and one keyUp. You have to care about modifier-release ordering, event-tap rebuilds, trust loss, key repeat noise, and stop or restart churn when the user presses again quickly.
That means a sloppy implementation can accidentally treat keyboard-state drift as a real release. It can stop capture because polling noticed a mismatch instead of because the user actually let go. That is exactly the kind of thing that makes a feature feel haunted.
I tightened that.
Hotkey-up synthesis is now restricted to explicit recovery paths instead of loose session-state drift. The exact hold contract is deterministic whether Option or Space releases first. Event-tap rebuilds while the keys are still physically down no longer get to silently terminate a valid capture. Re-press during stop is treated as a queued restart instead of an invitation for stale state to bleed across sessions.
That sounds like implementation trivia until you use the feature all day.
Then it becomes the whole feature.
Finalization Contract for Acceptable Input
The second part was the finalization contract.
Speech-to-text gets weird when people talk about “the transcript” as if there is always only one obvious version of it. In reality there are partials, finalized segments, stop callbacks, error paths, watchdogs, and all the little ways asynchronous systems try to complete more than once or complete in the wrong order.
That is not acceptable for input.
YEN now treats stop completion as a proper per-session contract with one, single authoritative result. There is an explicit stop token. Completion is exactly once across normal release, recognizer error, results-stream failure, watchdog fallback, and forced reset paths. I also added more explicit stop telemetry so transcript loss can be diagnosed from logs instead of guessed from UI behavior.
Just as importantly, I removed “experimental” work from the default release-to-paste critical path. That part needed to happen.
If the user’s baseline expectation is “hold Option + Space, speak, release, paste,” then the core path cannot be hostage to optional experiments.
At that stage, Live Transcript Preview and Translate-on-Dictate were clearly labeled EXPERIMENTAL in Settings, and the default stop path was no longer forced to wait behind translation behavior when translation was off.
That product boundary was the right intermediate step: stabilize the core workflow first, then graduate optional surfaces only after the contract is clear.
Oh, fun.
The Contract for The Clipboard
The third part was the clipboard and paste contract. This was the most important user-facing fix.
The easy version of cross-app dictation is: Put text on the clipboard, fire Cmd + V, restore the old clipboard on a timer, hope for the best.
That is not a serious delivery contract.
If the paste target changes, if the synthetic paste does nothing, if the target app is slow, or if delivery simply cannot be proven, a timer-based clipboard restore turns into a data-loss risk. You get the worst combination: No inserted text and no recoverable text. No bueno, so, I tightened that too.
YEN now captures the insertion target at stop time and treats three cases separately: direct terminal-surface paste inside YEN, same-app responder-chain paste, and cross-app synthetic Cmd + V. Those are not the same operation and they should not pretend to be.
Inside YEN, the old clipboard can be restored after paste with a guarded delay. Same-app responder behavior stays explicit. Cross-app synthetic paste is treated more conservatively because delivery cannot actually be proven by the sender.
That means when YEN pastes into another app through synthetic Cmd + V, the dictated text now stays recoverable on the clipboard until the user replaces it.
That is a much better failure mode.
The goal is not to hide every edge case. The goal is to make the safe path the default path and the recovery path obvious when automation cannot honestly guarantee delivery.
I also tightened stale-target handling so the transcript is bound to the app or split that was focused when you released the hotkey. If that target changes or becomes invalid before insertion, YEN fails closed into manual recovery instead of pasting into the wrong place.
That is another example of the kind of boring behavior I want.
Wrong-field paste is worse than no paste.
Naturally.
Truth-Seeking in Great, Simple Product
The fourth part of this pass was simply telling the truth in the product and docs.
There was some confusion around Apple’s newer speech APIs, so I rechecked the recognizer shape carefully against Apple’s own documentation. Apparently, on macOS 26+, YEN still used SpeechAnalyzer with DictationTranscriber for the main dictation path.
That is the right pairing for what YEN is doing.
Apple also exposes SpeechTranscriber as part of the broader Speech framework stack, and it is useful, but it is not some magic replacement for session-logic bugs. SpeechAnalyzer is the container.
The real question is which speech module belongs in the workflow you are building. For YEN’s hold-to-talk dictation flow, DictationTranscriber remains the correct transcription surface.
That distinction matters because it keeps the engineering work honest.
If the problem is race conditions, stop semantics, target routing, and clipboard durability, the answer is not to wave at a nearby API name and hope the complexity disappears. The answer is to harden the workflow contract itself, so, that’s what the latest release has done.
And that ties directly back to YEN’s larger mission.
I keep saying YEN is a Terminal-first IDE, and I mean it in a very literal way.
The terminal should not just be fast. It should be trustworthy.
If a feature belongs in the terminal workflow, it has to clear a higher bar than novelty. It has to feel like something you can build real muscle memory around. It has to be explicit about what is experimental, explicit about what is guaranteed, and conservative when the system cannot actually prove success.
That is especially true for speech-to-text. Speech-to-text inside a terminal is easy to market as a fun trick. But in practice it is an input primitive.
It becomes part of how you move through the machine. Part of how you write. Part of how you reply in chat, search, command, annotate, and capture thoughts without breaking flow.
When something occupies that layer, reliability is the feature and everything else is secondary. So, this hardening pass did not add a dramatic launch trailer bullet point.
It did something better: It made the core path legible.
Hold Option + Space. Speak. Release. Done.
YEN should either paste the right final text into the right place exactly once, or leave that text recoverable so nothing is lost.
That is the bar. That is the contract.
And that is the kind of boring I want more of in software. Boring software is great software because it does precisely what I, and the community, want.
Who would have thought.
— 8




