How AI works · seeing and hearing
Multimodal AI
30-second gist~30s read
Multimodal AI can read text, look at pictures, and listen — sometimes all in the same conversation. You can show it a photo of your fridge and ask "what can I cook?" or hold a voice chat with it on your morning walk.
Up until ~2023, AI was mostly single-mode: text in, text out, OR image in, image label out. Multimodal models combine those skills in one brain.
If you want more
Why this is a bigger deal than it sounds
A lot of useful tasks span senses. Reading a doctor's note and a chart. Translating a menu while looking at the dish. Helping a blind person navigate a station. Walking through a contract while explaining clauses out loud.
A multimodal model handles those without bouncing between separate apps — the same model sees, hears, and writes. That makes the AI feel less like a chatbot and more like a quiet assistant on your shoulder.