How AI works · seeing and hearing

Multimodal AI

30-second gist~30s read

Multimodal AI can read text, look at pictures, and listen — sometimes all in the same conversation. You can show it a photo of your fridge and ask "what can I cook?" or hold a voice chat with it on your morning walk.

Up until ~2023, AI was mostly single-mode: text in, text out, OR image in, image label out. Multimodal models combine those skills in one brain.

If you want more

Why this is a bigger deal than it sounds~1 min

A lot of useful tasks span senses. Reading a doctor's note and a chart. Translating a menu while looking at the dish. Helping a blind person navigate a station. Walking through a contract while explaining clauses out loud.

A multimodal model handles those without bouncing between separate apps — the same model sees, hears, and writes. That makes the AI feel less like a chatbot and more like a quiet assistant on your shoulder.

A real example

In May 2024 OpenAI demoed GPT-4o by having it tutor a teenager through algebra in real time — looking at his handwriting on paper, listening to his voice, and gently nudging him toward the answer. Google's Gemini was multimodal from launch. By 2026, "voice + vision" is standard in most consumer AI apps.