mic

Type: Software Status: In Progress Deadline: TBD Created: 2026-04-24

What is this?

A lightweight native iOS app for on-device voice transcription. Tap a Lock Screen control to start recording, tap again to stop. Transcriptions are timestamped, stored locally as markdown files, and synced via iCloud Documents. Works fully offline — no internet required. Automatically identifies speakers when multiple people are talking.

Motivation

Quick, frictionless voice capture with zero cloud dependency. The flashlight metaphor: always available, one tap, no app to open. Transcriptions persist and sync silently across devices.

What does success look like?

  • One-tap recording from Lock Screen (no app launch needed)
  • Accurate on-device transcription via WhisperKit (Whisper Small)
  • Speaker diarization — identifies and labels different speakers
  • Transcriptions stored as timestamped markdown files in iCloud Documents
  • Works completely offline
  • Nothing-inspired UI: monochrome, industrial minimalism, Space Grotesk/Mono typography
  • Battery-efficient recording

Technical

  • Stack: SwiftUI, WhisperKit (Whisper Small ~460MB), SpeakerKit (Pyannote ~30MB), AVFoundation
  • Repo: github.com/djt53/mic
  • Deploy target: iOS 18+ (Lock Screen controls require ControlWidget API)
  • Intended users: Personal use, power users who want fast voice-to-text

Architecture

Storage Format

Transcription files stored in iCloud Documents container (iCloud~com~davidtingle~mic):

Documents/
  2026-04-24T14-30-00.md
  2026-04-24T15-45-22.md

Single speaker:

---
date: 2026-04-24T14:30:00Z
duration: 45s
title: Groceries
---

Transcribed text goes here.

Multi-speaker:

---
date: 2026-04-24T10:15:00Z
duration: 2m 8s
title: Launch Sync
speakers: true
---

**Speaker 1:** Let's push the launch to next Thursday.

**Speaker 2:** Sure, I'll have a draft by Monday.

App Structure

mic/
├── micApp.swift                    # Entry point
├── Models/
│   └── Transcription.swift         # Model + markdown serialization + speaker turns
├── Services/
│   ├── AudioRecorder.swift         # AVAudioEngine recording (native rate, post-convert)
│   ├── TranscriptionEngine.swift   # WhisperKit + SpeakerKit wrapper
│   ├── StorageService.swift        # iCloud Documents read/write + local eviction
│   ├── RecordingManager.swift      # Lock Screen control ↔ app bridge
│   └── SharedContainer.swift       # iCloud container paths
├── Views/
│   ├── HomeView.swift              # List + search + record button
│   ├── TranscriptionView.swift     # Detail with speaker turn rendering
│   ├── TranscriptionRow.swift      # List row
│   ├── RecordingCard.swift         # Active recording state
│   └── ModelDownloadView.swift     # First-launch model download
├── Theme/
│   └── Theme.swift                 # Nothing-inspired design tokens
├── micControl/
│   ├── RecordControl.swift         # Lock Screen ControlWidget
│   └── micControlBundle.swift      # Widget bundle
└── Resources/Fonts/
    ├── SpaceGrotesk-Variable.ttf
    ├── SpaceMono-Regular.ttf
    └── SpaceMono-Bold.ttf

Key Components

  1. Lock Screen Control — ControlWidget (iOS 18+) toggles recording via App Group UserDefaults
  2. AudioRecorder — Records at native sample rate, converts to 16kHz mono WAV after stop (battery optimization)
  3. TranscriptionEngine — WhisperKit (transcription) + SpeakerKit (diarization), both on-device
  4. StorageService — FileManager reads/writes markdown to iCloud Documents; delete = local eviction (file stays in iCloud)
  5. Search — Filters transcriptions by title and body text

Battery Optimizations

  • Record at native sample rate → single batch conversion after stop (no real-time resampling)
  • Timer at 1Hz (not 10Hz) for elapsed time
  • Audio level computed every 4th buffer via vDSP (not every buffer, not manual loop)
  • Larger audio buffer (8192 frames) for fewer callbacks

Speaker Diarization

  • Uses SpeakerKit (Pyannote) — bundled with WhisperKit, ~30MB additional models
  • Models download in background after WhisperKit loads (non-blocking)
  • WhisperKit runs with word-level timestamps + VAD chunking
  • SpeakerKit diarizes the audio array, then aligns speaker segments with transcription using subsegment strategy
  • Consecutive segments from the same speaker are merged into turns
  • Speaker turns stored in markdown as **Speaker N:** text blocks

Design System (Nothing-inspired)

  • Typography: Space Grotesk (body), Space Mono (timestamps/metadata)
  • Palette: Monochrome — OLED black backgrounds, white/gray text
  • Hierarchy: Display → Body → Metadata (three layers, strictly enforced)
  • Accent: Signal red (#D71921) — recording state, speaker 1 label
  • Speaker colors: Red, Blue, Green, Amber, Purple (cycled for 5+ speakers)