mic

Type: Software Status: In Progress Deadline: TBD Created: 2026-04-24

What is this?

A lightweight native iOS app for on-device voice transcription. Tap a Lock Screen control to start recording, tap again to stop. Transcriptions are timestamped, stored locally as markdown files, and synced via iCloud Documents. Works fully offline — no internet required. Automatically identifies speakers when multiple people are talking.

Motivation

Quick, frictionless voice capture with zero cloud dependency. The flashlight metaphor: always available, one tap, no app to open. Transcriptions persist and sync silently across devices.

What does success look like?

One-tap recording from Lock Screen (no app launch needed)
Accurate on-device transcription via WhisperKit (Whisper Small)
Speaker diarization — identifies and labels different speakers
Transcriptions stored as timestamped markdown files in iCloud Documents
Works completely offline
Nothing-inspired UI: monochrome, industrial minimalism, Space Grotesk/Mono typography
Battery-efficient recording

Technical

Stack: SwiftUI, WhisperKit (Whisper Small ~460MB), SpeakerKit (Pyannote ~30MB), AVFoundation
Repo: github.com/djt53/mic
Deploy target: iOS 18+ (Lock Screen controls require ControlWidget API)
Intended users: Personal use, power users who want fast voice-to-text

Architecture

Storage Format

Transcription files stored in iCloud Documents container (iCloud~com~davidtingle~mic):

Documents/
  2026-04-24T14-30-00.md
  2026-04-24T15-45-22.md

Single speaker:

---
date: 2026-04-24T14:30:00Z
duration: 45s
title: Groceries
---

Transcribed text goes here.

Multi-speaker:

---
date: 2026-04-24T10:15:00Z
duration: 2m 8s
title: Launch Sync
speakers: true
---

**Speaker 1:** Let's push the launch to next Thursday.

**Speaker 2:** Sure, I'll have a draft by Monday.

App Structure

mic/
├── micApp.swift                    # Entry point
├── Models/
│   └── Transcription.swift         # Model + markdown serialization + speaker turns
├── Services/
│   ├── AudioRecorder.swift         # AVAudioEngine recording (native rate, post-convert)
│   ├── TranscriptionEngine.swift   # WhisperKit + SpeakerKit wrapper
│   ├── StorageService.swift        # iCloud Documents read/write + local eviction
│   ├── RecordingManager.swift      # Lock Screen control ↔ app bridge
│   └── SharedContainer.swift       # iCloud container paths
├── Views/
│   ├── HomeView.swift              # List + search + record button
│   ├── TranscriptionView.swift     # Detail with speaker turn rendering
│   ├── TranscriptionRow.swift      # List row
│   ├── RecordingCard.swift         # Active recording state
│   └── ModelDownloadView.swift     # First-launch model download
├── Theme/
│   └── Theme.swift                 # Nothing-inspired design tokens
├── micControl/
│   ├── RecordControl.swift         # Lock Screen ControlWidget
│   └── micControlBundle.swift      # Widget bundle
└── Resources/Fonts/
    ├── SpaceGrotesk-Variable.ttf
    ├── SpaceMono-Regular.ttf
    └── SpaceMono-Bold.ttf

Key Components

Lock Screen Control — ControlWidget (iOS 18+) toggles recording via App Group UserDefaults
AudioRecorder — Records at native sample rate, converts to 16kHz mono WAV after stop (battery optimization)
TranscriptionEngine — WhisperKit (transcription) + SpeakerKit (diarization), both on-device
StorageService — FileManager reads/writes markdown to iCloud Documents; delete = local eviction (file stays in iCloud)
Search — Filters transcriptions by title and body text

Battery Optimizations

Record at native sample rate → single batch conversion after stop (no real-time resampling)
Timer at 1Hz (not 10Hz) for elapsed time
Audio level computed every 4th buffer via vDSP (not every buffer, not manual loop)
Larger audio buffer (8192 frames) for fewer callbacks

Speaker Diarization

Uses SpeakerKit (Pyannote) — bundled with WhisperKit, ~30MB additional models
Models download in background after WhisperKit loads (non-blocking)
WhisperKit runs with word-level timestamps + VAD chunking
SpeakerKit diarizes the audio array, then aligns speaker segments with transcription using subsegment strategy
Consecutive segments from the same speaker are merged into turns
Speaker turns stored in markdown as **Speaker N:** text blocks

Design System (Nothing-inspired)

Typography: Space Grotesk (body), Space Mono (timestamps/metadata)
Palette: Monochrome — OLED black backgrounds, white/gray text
Hierarchy: Display → Body → Metadata (three layers, strictly enforced)
Accent: Signal red (#D71921) — recording state, speaker 1 label
Speaker colors: Red, Blue, Green, Amber, Purple (cycled for 5+ speakers)