Peak rate of speech is ~220 WPM. A 64-core TPUv4 can decode 1920 context tokens + 64 prompt tokens, then generate 64 response tokens, all within 1.9s for a 540B param model (About ~1516 WPM?).

I think this will be possible soon:

  1. Wear a directional microphone during a conversation
  2. Filter out noise, transcribe speech into text with Whisper
  3. Predict completion of text based on context "chunks"
  4. Compress essential details into key points
  5. Use key points and context to infer suggested reply
  6. Wait till you reach a certain confidence threshold (or trigger signal)
  7. Play reply (or personalized summary) into earphones at 2-3x speed

Short term memory is ~18s without rehearsal so the reply would need to fit into that length. Baseline memory capacity is 4  1 items so it's easy to lose track of things. Recording and replaying voice notes is possible but is generally only feasible for post-discussion review. 

With low-latency transformer inference, you can organize/predict/compress/synthesize information while talking to someone, which augments human intelligence. The value is not in the transcription that can be saved for later, but in the immediate processing into a simpler representation for live access.

Auditory signals reach central processing systems within 8-10ms, but visual stimulus can take around 20-40ms. We'd likely see latency improvements of 1-2 orders of magnitude before widespread adoption, even if it requires distillation into smaller models. Real-time translation has been a thing for a while, but LLM capabilities are now broader.

Baddeley's model of working memory is split into the phonological loop and the visuo-spatial sketchpad. Later additions also add the central executive and episodic buffer. Most people talking about VR think about vision, but it seems plausible that wearable tech might disrupt language first.

What does this look like?

Not sure yet. Some possibilities:

  • AirPods become AIPods.
  • Less notetaking in schools.
  • Initial outrage/controversy, concerns around free will, control, atrophy.
  • Eventually considered as benign as using glasses to correct myopia.
  • Reduced social stigma for temp. audio recordings, modulo wiretapping laws.
  • We say "Sorry, wasn't paying attention, could you repeat what you said?" less often.
  • A few people use this to improve their effective IQ/communication skills.
  • Some people parrot what the AI says verbatim, even when it's obviously wrong.

See Also

New Comment