The Chief of Staff: Building a Local Voice Agent as a Personal Operating System
A physician's guide to offloading the back-of-mind juggling act: eight domain agents, a 5:30 a.m. dependency chain, and a voice assistant that reads exactly what it is told to.
Corporate work. Clinical work. Family budget. Business management. Household needs. Health commitments. Personal life. For years I tracked all of it the same way most professionals do: in the back of my mind, simultaneously, all the time. Nothing was breaking. That was almost the problem. The cognitive cost was the quiet, constant load of holding every domain at once, even on days when none of them needed me.
So I built a chief of staff. A system of eight AI agents, each owning one domain of my life, coordinated by an executive layer that briefs me every morning and a voice assistant in my kitchen that answers to "Sinabot." The point was being present with my kids instead of tracking moving pieces in the back of my mind. Every architecture decision below should be read against that goal, because what this system actually produces is mental space.
The Inventory Is the Architecture
The system started as a list of everything I was holding in my head. That list became the agent fleet, one project per domain: clinical notes, family budget, clinical business management, work assistant, reputation management, and so on, eight in all. The architecture diagram and the cognitive load inventory are the same diagram. That is the design insight I would keep if I had to throw away everything else: you transcribe a personal operating system from the one already running on expensive wetware.
Each agent lives in its own Claude Cowork project with its own context, its own files, and one shared contract: a machine-parseable task file (pending_actions.md) where every line is a date and a commitment in a fixed format. An executive layer, which I call the Notification Agent, reads all eight. It consolidates them into a master calendar, syncs bidirectionally with Google Tasks, places protected focus blocks on my work calendar, runs weekly reports, and checks the health of every project folder. The agents do the domain thinking. The Notification Agent does what a chief of staff does: it sees everything, surfaces what matters, and processes nothing itself.
The Morning Chain
The system's heartbeat is a dependency-ordered chain of scheduled runs that completes before I wake up:
05:30 daily backup
05:45 reputation engine dashboard check
06:00 morning briefing
06:20 daily task sync
06:35 meeting prep
The order is load-bearing. Writers of the task files must finish before the sync reads them; the sync must place focus blocks before meeting prep schedules around them. An 11:00 catchup runner walks the same chain in dependency order for mornings when the laptop stayed closed. None of this is exotic. It is cron jobs and file formats. The discipline is in the ordering and in what each stage is forbidden to touch.
The Accidental Brain
Here is the part I did not plan. Months into running this, I noticed the fleet had organized itself into a shape I recognized from medical school.
| System component | What it resembles | Why the resemblance is structural |
|---|---|---|
| Notification Agent | Frontal lobe, executive function | Plans, sequences the morning chain, and inhibits: a "no cross-project mutation" rule plus four loop-prevention primitives are inhibition implemented as network protocol |
| Domain agents | Specialized cortical regions | Each owns one domain. The visual cortex does not do language; the clinical notes agent does not do budgeting |
| Voice agent (an 8B local model) | Motor and auditory cortex | Speech planning and comprehension with no reasoning. Its capability ceiling shapes the whole interface |
| Workflow relay (n8n) | Thalamus | Relays signals to the right region and decides nothing |
| Shared mailbox | Corpus callosum | Structured lateral connections between regions, mediated by the executive layer |
I want to be careful with this claim, because it is stronger when it is honest: I did not design any of this to mimic a brain. The resemblances are post hoc. The interesting conclusion is not that I am clever but that the same coordination pressures produce the same patterns. Put one planner above many specialists with limited bandwidth between them, and you converge on executive inhibition, regional specialization, and a relay layer, whether the substrate is neurons or cron jobs.
The convergence isn't limited to my house. A week before this piece, OpenAI shipped a feature it calls Dreaming: an asynchronous background process that synthesizes memory from many conversations at once, captures context that arose naturally, and updates older memories as circumstances change. Factual recall on their internal benchmark jumped from 41.5 percent to 82.8 percent across two iterations of it. Different team, different scale, same pressure. Consolidating state across many sessions without blocking the live path lands you on a background loop. They reached for the word "dream" for the same reason I reached for "brain". Convergent vocabulary follows convergent architecture.
The analogy also breaks where it should. Brains have no single state authority; memory is distributed, and the hippocampus indexes rather than stores. My system keeps canonical state in one durable database, which is nothing like anatomy and exactly why it is the right engineering call. The analogy's real value is as a design test: if a proposed flow has the mouth making judgment calls, the brain doing domain work, or one arm reaching into another, the design is wrong.
Sinabot: Voice Is a Terminal, Not a Reasoner
The voice layer runs entirely on one consumer GPU box in my house, driven by a custom Python orchestrator, about 2,900 lines, written because nothing off the shelf could drive my hardware end to end. There is no Home Assistant in the voice path, and that was a measurement, not an aesthetic: HA's assist pipeline assembles the entire LLM response before handing it to speech synthesis, which added roughly 650 milliseconds, so the orchestrator exists to stream sentence by sentence instead. It speaks two protocols, the Wyoming protocol to a Raspberry Pi satellite and the ESPHome native API to three Voice PE pucks, four satellites covering the rooms where life happens. Everything answers to "Sinabot," via two custom-trained wake word detectors: an on-device microWakeWord model on the pucks' ESP32-S3 chips (60 kilobytes on flash) and a server-side equivalent on the Pi, trained on the same phrase. Speech-to-text is WhisperLive running a TensorRT-compiled Whisper small.en on GPU. The language model is Llama 3.1 8B quantized to INT4, served by vLLM at roughly 89 tokens per second in about 7 GiB of VRAM. Text-to-speech is Kokoro, an 82 million parameter model with a British male voice that sounds appropriately like a chief of staff. From the moment I stop talking to the first audible word of the reply is about 300 milliseconds, and the rest of the answer streams sentence by sentence while the language model is still generating it.
The VRAM budget deserves to be a character in this story: 11.7 of 12.3 GiB committed with all three engines warm, about 155 MiB free. That kind of headroom is a threat, and it made most of the decisions. A fancier speech model ran the card out of memory and was sent away. FP8 Llama benched both slower and larger than INT4, which is not how that comparison is supposed to go, and lost. A multimodal model could not fit at any quantization because its encoders refuse to shrink. The single biggest latency win of the project, moving text-to-speech from CPU to GPU, cut time-to-first-audio from 370 to 108 milliseconds, and was only possible because everything else had already been starved small enough to leave it room. Constraint-driven design sounds like deprivation, but it produced the system's best rule:
Voice is a terminal, not a reasoner. The 8B model classifies intent, extracts arguments, and speaks short acknowledgments. It never composes content. The supporting pattern: a plain Python job walks every project's task file and compiles a voice context package (daily.json) of pre-baked, TTS-ready strings, zero LLM tokens spent, so that for any query the model receives only the one string it should speak, thirty to sixty tokens, and reads it verbatim. It never sees the underlying data. This eliminates an entire failure class, hallucination by paraphrase, by moving formatting upstream to the single state authority. A small model that only reads is more trustworthy than a large model that improvises. (Full disclosure for fellow builders: as I write this, the refresh job behind that package is paused mid-rewiring, so the pattern is shipped but dormant. The lesson it taught survives below.)
{
"generated_at": "2026-05-17T04:30:41Z",
"projects": {
"family-budget": {
"open_count": 43,
"overdue_count": 1,
"top_priority": "Reconcile June statements against the cashflow tracker",
"top_priority_category": "overdue",
"top_priority_meta": { "days_late": 2 }
}
},
"tts": {
"by_project": {
"family-budget": "Top priority for Family Budget: Reconcile June statements, 2 days late. 43 items open, 1 overdue."
},
"summary": "186 items open across 8 projects, 6 with overdue."
}
}
The schema itself records a lesson. The first version answered "what's on my list" with the list: a wall-of-text monologue, nearly six hundred characters, URLs read aloud and all. Nobody wants that in their kitchen at 7 a.m. The redesign primes exactly one task per project (overdue beats today beats this week, most days late first), caps the cross-project answer at the worst three, and waits to be asked for more. "What's my top priority" was never really a request for an enumeration. It was a request for a verdict.
How Agents Talk Without Stepping on Each Other
Multi-agent systems fail at the seams, so the seams got the most design attention. Agents exchange messages through a unified mailbox on the GPU box's network share: one markdown file per message, named YYYY-MM-DDTHH-MM-SS_, written to a .tmp path and renamed into place so a message either exists completely or not at all. The frontmatter exists almost entirely to prevent loops:
---
message_id: <uuid> # also returned to voice as job_id
parent_message_id: <uuid or null> # null only if a human started this
from: <agent_slug>
to: <agent_slug>
intent: <optional verb> # informational only, never used for routing
hop: <int> # n8n stamps 1 on voice escalations; +1 per forward
created_at: <iso8601>
idempotency_key: <optional>
---
# Body
Free-form markdown. For voice-bound messages: TTS-ready plain
language, one to four sentences, no markup.
Four primitives keep the fleet from talking itself into a storm, and they will look familiar if you have ever read about how internet routers avoid the same fate (BGP solves loop prevention with an AS_PATH list that does roughly what my causality chain does):
- Causality chain. Every message carries its ancestry (
parent_message_id, with every state mutation journaled against the message that caused it). A recipient that finds itself in a candidate's ancestry rejects it. You cannot argue with your own echo. - Hop counter. Messages attenuate. Anything past three hops is dropped, no matter how interesting it thinks it is.
- Idempotency keys. Mutations carry a key enforced by database unique constraints; processing the same message twice is a no-op, which makes retries safe and duplicate delivery boring.
- Agent of record. Every task has exactly one owner. If a message's sender matches the task's creator, the recipient rejects it rather than route the loop back home. Self versus non-self recognition: the difference between an immune system and an autoimmune disease.
Enforcement is deliberately distributed: each agent runs these checks on its own inbox at session start, and the executive layer runs a weekly sweep as the safety net for anything that slipped through. A centralized gatekeeper would have been one more component and one more timing dependency; the hop cap bounds the blast radius regardless of where it is checked.
On top of the plumbing sit two social rules. Surface but don't process: the Notification Agent sees all eight domains and flags stuck items, but never does another agent's domain work. Actions as proposals: no agent ever mutates another's state. It writes a proposal to the recipient's mailbox, and the recipient's own agent decides. Cross-project visibility without cross-project authority. Autonomy is what makes the loop prevention meaningful rather than decorative.
The Observability Layer
Here is the angle I have not seen in other personal-AI writeups. Most of these systems monitor the human. A real chief of staff also monitors the staff: my other automations, the n8n workflows, the cron schedules, the webhooks, are themselves fallible, and the Notification Agent treats them as peers to be checked, not infrastructure to be trusted.
One day made the case. On May 2nd, the daily check flagged a publishing schedule slip on one of my websites. False alarm: the tracker did not know that site was intentionally gated by a ramp-up rule. We taught the system to check the gating flags before alarming. The same morning, the same check caught a real one: a publish event had bypassed the logging webhook, and the scheduling state had silently drifted. Without the check, the next month of publish timing would have been quietly wrong. One false alarm and one real catch, same day. The false alarm taught the system; the real catch earned my trust. That pair closed the gap that matters most in personal automation, the gap between "I built a system" and "I trust the system."
The same posture runs all the way down the stack. The voice box refuses to start its orchestrator until a fifteen-check pre-flight passes: is the microphone actually producing audio, are all three models actually answering, is the VRAM headroom actually there. It fixes what it can fix (a stuck USB mic, a muted mixer, a dead wake-word service) and declines to launch on what it cannot. Nothing in this system is trusted because it was configured. It is trusted because it was checked.
This is also where the capability ceiling stops being about hardware. Once agents can watch agents, propose to agents, and teach agents, the system is limited less by compute than by what you can specify. The Green Lantern rule: the ring is only as good as the imagination holding it.
What I've Learned (Mostly by Breaking It)
The brochure version ends above. The engineering version is below, with dates, because the reliability lessons are the actual product.
Idempotency is the whole game
In April, the task sync started multiplying my Google Tasks. A cosmetic change (urgency prefixes on titles) broke matching, and a dictionary keyed by title collapsed duplicates so the code could not even see what it had done. The fix is unglamorous and load-bearing: every comparison goes through a normalizer that strips every prefix the system has ever used, plus a cleanup pass every cycle that asserts the world looks the way the ledger says it should.
def _strip_urgency_prefix(title):
# Strips ALL legacy tier tags (TODAY/URGENT/SOON/SCHEDULED) in
# addition to the current set, so existing Google Tasks reconcile
# cleanly across the 2026-04-15 simplification.
return re.sub(
r'^\[(OVERDUE|TODAY|URGENT|SOON|SCHEDULED|SEO ALERT)\]\s*',
'', title).strip()
def _normalize_desc(text, length=50):
stripped = _strip_urgency_prefix(text)
return stripped[:length].lower()
Note what the comment admits: the matcher must understand every naming scheme the system has *ever* had, not just the current one, because the external mirror remembers your old mistakes even after you have reformed. If an agent writes to a system twice, the second write must cost nothing. Everything else is built on that.
Guards must fail loud
A one-line guard (if cal_id:) silently fell through past the success path, so candidate events "looked placed" and both counters in the report were confidently wrong. Two weeks later I found five task entries had been invisibly dropped for weeks because they used a date token (Future) the parser did not recognize and discarded without comment. Same lesson twice: in a machine-parsed format that humans edit, anything unparseable must be an error, never a skip. The validator I added caught its first real corruption within sixty seconds of being installed.
An empty scan is a config error, not a no-op
A path mis-resolution made the engine scan zero projects. The cleanup pass then did its job perfectly: it deleted sixteen valid calendar blocks that no longer appeared in the (empty) scan, and recreated none. Destructive phases now gate on evidence of work actually found. When a system that expects eight of something finds zero, the correct interpretation is "I am broken," not "everything is done."
Append first, destroy after
The same bug class bit me at two layers a week apart: a calendar rebuild that deleted everything before re-inserting (and died in the middle), and an archive routine that trimmed the source file before the archive append landed. Write the new state first, verify it, then destroy the old. In that order, a crash costs you a duplicate. In the other order, it costs you the data.
The day the assistant read its own JSON aloud
One evening in May I asked the voice assistant to handle something, and it replied, in its dignified British baritone, with the literal text of a tool call: brace, quote, utterance, colon. The 8B had emitted escalate({...}) as plain text instead of a structured call, and the pipeline did exactly what it is built to do with text: it spoke it. The fix came in layers. Unknown tool names are now caught and treated as speech rather than crashed on. More importantly, tools are keyword-gated before the model ever sees them: if your utterance contains nothing that could plausibly want a tool ("look into," "draft," "have Claude"), the tool simply is not in the model's tool list for that turn. You cannot hallucinate a tool you were never offered. Small models do not need more scolding in the system prompt. They need fewer opportunities.
Agents reviewing each other's specs is the multi-agent payoff in miniature
In May, the Notification Agent's spec assumed a transport the voice system does not speak and an n8n instance that did not exist. The Voice Hub agent, reviewing the spec, knew both facts cold. Catching assumption rot at spec stage cost a paragraph; catching it after integration would have cost a weekend. The cheapest QA I have is one agent reading another agent's plan.
A sidebar for the connoisseurs: I once watched cp zero out a 935-line engine because two mount paths resolved to the same inode. The seams between agent runtimes, sandbox mounts, and "obvious" shell idioms are where the dragons live.
And one honest note on autonomy. This system runs unattended every morning, but destructive operations have exactly one owner: the supervised canonical morning run. The catchup runner keys off observed state rather than declared flags, which makes it robust to missed runs and means it would happily ignore an intentional pause. I know that. The strength and the footgun are the same design choice, and I chose it with eyes open.
What It Costs
In April I estimated the daily task sync at about 67,000 tokens per run. In June I measured it properly: 64,000. I am still framing that one. The monthly estimate, though, was wrong in an instructive way, because "tokens per month" turns out to be three different questions.
Footprint, the final context plus outputs of each run, totals about 8 million tokens a month across the whole fleet of eleven scheduled tasks. That is the number that matches my old guess of 6.2 million, and it is the least true. Cumulative processed is the agentic reality: every one of a run's API turns re-reads the entire conversation so far, so the daily sync's 66 turns process about 2.6 million tokens per run, and the fleet processes roughly 190 million a month. Billed-equivalent is what prompt caching makes of that: cached context re-reads cost about a tenth of fresh tokens, which tames 190 million processed to roughly 26 million. The 24x gap between processed and billed is the entire economic case for prompt caching, measured on my own kitchen-table workload.
Three findings from the measurement surprised me. First, cost concentrates brutally: two tasks (the daily task sync and the publishing-system health check) are 80 percent of total spend. Optimizing anything else is theater. Second, instructions are a per-turn tax: one project's 32KB context file rides along on all 48 turns of every health check, roughly 12 million tokens a month of pure repeated instructions. Trimming that file is the cheapest optimization in the system. Third, turn count beats prompt size: cumulative cost scales with the square of the number of turns, so halving a 66-turn run saves more than any prompt diet ever will.
The levers I actually pulled, in order of effect: rewrote the voice-context generation as pure Python (zero tokens, every hour, forever); added a skip-when-equal comparison that turned roughly 88 no-op task updates per cycle into actual no-ops; cut an oversight cron from 24 firings a day to 15, then paused it entirely when its value did not justify its spend; and adopted a standing rule that anything 90 percent deterministic and 10 percent phrasing gets rewritten as Python with a thin prompt, or dropped. The voice path itself costs nothing by design: the local 8B never calls a cloud API at all.
What's Next
Three things, none of which exist yet, all of which the architecture has a slot for. A web dashboard, once the mechanics are boring enough to deserve pixels (the current friction is the spec). Push notifications, for the rare item that should not wait for morning. And the one I keep circling: a therapist agent as the system's missing amygdala. Today the system detects logical urgency only, deadlines and overdue counts. It has no analog for salience: "this task got deferred four times this week," or "your stated thesis is mental space, but your calendar shows eighteen focus blocks." A region that does not act but biases attention, flagging the patterns the rational planner fails to weight. The brain analogy keeps earning its keep by telling me what is missing.
Open Source
The coordination layer is published at github.com/sinabarimd/chief-of-staff: the full voice orchestrator (about 2,900 lines, including the ESPHome-to-Whisper bridge, the sentence-streaming pipeline, and the boot-time pre-flight), the four loop-prevention primitives, the mailbox schema, the morning-chain layout, the voice-context package generator, the custom wake-word models, and sanitized excerpts of the sync engine, plus the token cost methodology so you can measure your own fleet instead of guessing. Same spirit as the Reputation Engine I published earlier this year: a reference implementation. Names, tokens, and anything resembling client or patient information are stripped.
Who Is This For?
For people who recognize the feeling in the first paragraph: nothing is on fire, and you are still spending a meaningful fraction of your mind keeping the inventory. The technology here is unremarkable on purpose: cron, files, one mid-sized GPU, small models doing small jobs. What changed my life was custody. Every domain has an owner that is not me. The briefing happens whether I open the laptop or not. The kitchen answers questions so I do not have to go check.
The juggling act still exists. It is just no longer running on me. That is the entire return on investment, and I would not trade it for any benchmark score: mornings where the only thing I am tracking is breakfast.
Dr. Sina Bari, MD is a Stanford-trained plastic and reconstructive surgeon and VP of Medical AI at iMerit. He writes about medicine, technology, and building things at sinabarimd.com.