How do LLM see?
LLMs understand text, VLMs let them see. Discover how images become tokens AI can process and reason about.
Refresher on LLMs and VLMs
Large Language Models (LLMs) like ChatGPT and Gemini have quickly become household names. But have you ever wondered what’s happening under the hood?
At the core, these models are built on the breakthrough architecture introduced in the paper “Attention Is All You Need.” Here’s the simple idea:
Text is broken down into tokens (think of them as small pieces of words).
Each token is turned into a vector, which are just a set of numbers that capture its meaning.
The model then uses attention and other layers to connect these tokens, enrich their meaning, and ultimately predict the next word.
This is how an LLM “understands” language.
But what happens when we ask these models to process an image instead of text? That’s where Vision-Language Models (VLMs) come in.
What is a VLM?
A Vision-Language Model is essentially an LLM with eyes. It can take in images as well as text. But here’s the tricky part: while words are already neatly converted into tokens and vectors, images don’t come pre-packaged that way.
So, the challenge is: how do we turn an image into vectors the model can understand?
From Pixels to Vectors: The Vision Encoder
If you’ve ever played with simple computer vision projects, like training a neural network to recognize handwritten digits (MNIST). Convolutional Neural Networks (CNNs) take an image, process it, and turn it into numerical representations (vectors).
The same idea can be used here:
A CNN, like ResNet, encodes an image into vectors.
These vectors can then be passed to the language model.
But there’s a catch. The “language” of image vectors (from CNNs) doesn’t naturally align with the “language” of word vectors (from LLMs). In other words, the two speak different dialects.
Bridging the Gap: Projecting Vision into Text Space
To make the image vectors understandable to the LLM, we need a projection layer. A projection layer is a simple learnable matrix that maps image concepts into the same space as text concepts. Once aligned, the LLM can treat these image vectors as if they were just another sequence of word tokens.
However, real-world images are messy. Take a photo of a bus station: there might be buses, cars, people, and signs, all in one frame. Capturing this richness in a single vector isn’t enough.
Patches, Attention, and Spatial Awareness
To solve this, images are divided into small patches (like cutting the image into a grid). Each patch gets its own vector.
But what if an object, say a tall person, spans multiple patches? This is where attention comes in. The model allows patches to “communicate,” sharing context so that each patch knows about the others.
Next, we add positional embeddings, which give each patch a sense of location. This helps the model answer questions like: “Is there a person to the right of the car?”
Putting It All Together
Here’s the pipeline of a Vision-Language Model in simple terms:
Break the image into patches.
Use CNNs or transformers to turn each patch into vectors.
Let patches share information through attention.
Add positional information so the model knows where things are.
Project these image vectors into the same space as text tokens.
Hand everything off to the LLM, which now treats the image just like text.
And just like that, the model can “see” and “talk” about images.
Vision-Language Models are a big step toward making AI more versatile, bridging words and images into one shared understanding.
Thanks for reading!
For more such insights, subscribe to Clinically Speaking, your weekly Health AI digest from HAINet.


