Text to Vector to Match: How Embedding-Based Targeting Works

Type an audience description into a box and a model turns it into 3,072 numbers. To decide whether a user matches, you compare that list against another one. Everything new in audience targeting runs on what happens in between.

Targeting used to be one question: is this user in segment 12345? A label matched a label, and the user was in the bucket or not. The IAB Tech Lab's Agentic Audiences work (formerly UCP, donated by LiveRamp) changes the question to: how close is this user to what the campaign wants? The label is gone, the user is a list of numbers, and the relationship between the two is a distance.

That list of numbers is an embedding. This post walks the whole path: what the model is, the five steps it runs to turn an audience description into a vector, how you compare two vectors, and the one rule that makes any of it work. I ran each step on real models and pulled out the actual figures, so you can see what the model keeps and what it drops at every stage.

What an embedding model actually is

The best embedding models today are the same kind of model that powers a chatbot, with the final step removed.

This is recent. Until a couple of years ago, embedding models were smaller systems built for one job: turning text into numbers. Then researchers showed you get better numbers by starting from a full LLM (the kind that runs a chatbot) and repurposing it. Google's current embedding model is built from Gemini, others from Qwen or Llama. They are ranked against each other on a public benchmark called MTEB, and its leaders today are almost all repurposed LLMs.

Underneath, turning text into an embedding runs as a fixed sequence of steps. Picture a data provider's audience description going in: "luxury car buyers looking at a Jaguar lease." Here is what the model does with it, one step at a time.

Step 1: the model never sees "Saturn"

Before the model can make sense of anything, it breaks your words into pieces.

First comes the tokenizer. At heart it is a fixed dictionary: a long list of text-pieces, each paired with a number. It breaks your text into pieces drawn only from that dictionary, then swaps each piece for its number, because the model works in numbers, not words. The tokenizer does no thinking and it is not the model. How big the dictionary is varies: e5-mistral, a free model anyone can download, has about 32,000 pieces; Qwen's has about 150,000.

I ran "car buyers shopping for a used Saturn" through e5-mistral's tokenizer. Seven words go in, eight pieces come out. The extra piece is "Saturn," the defunct GM car brand. It is not in the dictionary as a whole word, so the tokenizer builds it from "Sat" and "urn," which are. The brand never reaches the model as one word.

What stays in one piece comes down to how often the word appeared in the model's training text, not how much it matters to you. "Car," "buyers," and "used" pass through whole. The terms a targeting team leans on often do not: "lookalike" comes back as "look," "al," "ike," and "HHI" splits into "H" and "HI."

This is the most basic reason a buyer and a seller doing embedding-based targeting have to run the exact same model. If two models do not cut your audience into the same pieces, nothing downstream can line up. The gap opens right here, before any of the clever math.

Step 2: which "Coach" you mean

Is "Coach" a designer label or a sports trainer? The model starts blind. The word goes in as a single number, identical for both senses. Attention is the step that settles which one you meant.

Every word in the sentence looks at the others. "Coach" sends out what the model calls a query, which is really a question: which of you tells me what I mean here? Every other word answers with a key, a short summary of what it offers. "Coach" compares its query against each key, scores how relevant each word is, and then rebuilds itself by pulling in a little of each word's content, weighted by those scores. The words that score high get pulled in hardest; filler like "the" and "to" barely registers. Query, key, value: those three are the heart of attention.

I ran two sentences through e5-mistral and pulled out its internal vector for "Coach" in each, once before the first layer and once after the last:

A: the boutique sells designer purses by Coach to wealthy shoppers
B: the football team listened closely to their Coach before the match

Before the layers, the two "Coach" vectors are identical, scoring 1.0, the same arrow exactly. After the model has run, they score 0.534. The word moved. "Boutique, designer, purses" pulled it toward the fashion brand; "football, team, listened" pulled it toward the sports trainer. Nobody labelled which Coach was meant. The neighbours settled it.

This is why "luxury handbag shoppers" and "premium leather goods buyers" can share almost no words and still land in the same place, and why a lone tag carries far less than a full description. One word gives the model almost no context to read, and the context is what builds the meaning.

Step 3: many layers

The model does not read your text once. It stacks the same read-and-adjust step into a deep pile of layers, 32 in the model I ran, each working on the output of the one below.

Each layer does a different kind of job. The early layers handle the surface: which pieces join into a word, whether each is a noun or a verb. The middle layers assemble the sentence. The higher layers hold the meaning of the whole phrase, the part you actually care about. The model climbs from raw pieces to real meaning.

This is why depth, not size alone, sets how fine your targeting can get. Telling "mentions a car" apart from "in-market for a car" is built up across many layers, not decided in one pass. A deeper model can draw that line; a shallower one blurs the two together, however carefully you word the audience. So the model you pick puts a ceiling on how precise the match can be.

Step 4: squash it all into one

After the last layer, every piece of your description sits at its own spot. That is still a whole row of spots, one per piece. To compare two descriptions you want one vector each, not a row against a row.

Step four is the squash. Its real name is pooling, which is a word for combining many values into one. There are a few ways to do it; the two you meet on today's models are to average every piece's spot into one, or to take the last piece's, which in a left-to-right model has already read everything before it. Either way, many become one, and the result is a fixed-length list of numbers: the embedding. That length never changes. Two words and a full paragraph come out exactly the same size, 4,096 numbers for e5-mistral, 3,072 for Gemini's model.

I pooled three versions of the same audience through e5-mistral and scored them against the original. Reword it completely, "wealthy buyers who like Coach bags," and it lands at 0.962. Shuffle the words into a jumble, "Coach handbags interested in luxury shoppers," and it lands at 0.986. The single vector barely moves either way. The squash holds on to what the phrase means and lets go of the exact words and the order they came in.

That is also what frees you from a fixed taxonomy. Instead of picking from buckets someone defined in advance, you can aim at anything you can put into a sentence, at whatever level of detail the words carry.

Step 5: read off the embedding

That single fixed-length list is the embedding. A position on the model's hidden map of meaning, where things that mean similar things sit close together and things that mean different things sit far apart.

So an audience description has become one vector, a few thousand numbers, marking where it sits on the map. That is the object that rides in the bid stream and gets scored against the campaign's vector. Text in, context applied, one location out.

How you compare two vectors

You have two vectors. The standard way to compare them is cosine similarity. Picture each vector as an arrow in space. The direction the arrow points captures what the audience is about. Cosine measures the angle between two arrows: arrows pointing the same way score 1, arrows at right angles score 0, arrows pointing opposite score -1. One math operation, fast enough to run at bid time (I clocked one comparison at well under a microsecond).

I took a library of 24 audience descriptions covering eight concepts (luxury auto, EV lease, NFL fans, NFL bettors, B2B CTOs, data platform buyers, luxury travel, adventure), each written three ways: IAB taxonomy path, provider segment, prose. Embedded with gemini-embedding-001, three tiers come out cleanly.

0.96

Same audience, reworded

0.88

Same vertical, different audience

0.82

Different verticals

A buyer brief like "Adults 40-65, HHI $300K+, booking premium international vacations with elite frequent-flyer status," embedded against that library, returns all three phrasings of luxury travel in the top three (0.91, 0.88, 0.87). With a taxonomy the match is in or out. With vectors the match is a number, and the number ranks every candidate by closeness.

The one rule that makes it work

That number only means something if both sides used the same model to produce the vectors.

Each model is trained to place related meanings close and unrelated ones far apart, and each training run produces its own arrangement of that space. Model A learns one layout, model B learns another, and the two share no common orientation. Cosine measures the angle between two arrows on one map. Change the map and the angle says nothing.

I tested it directly. Embed the same audience description with two different models, then cosine the two resulting vectors. If the spaces shared any structure, the number would come back high. It does not. Mean across 24 audiences: -0.005. That is statistical noise. The two spaces are unrelated coordinate systems, so a vector made by one model is meaningless to the other.

This is the rule that ties the whole pipeline together. The tokenizer cuts differently per model, the layers arrange meaning differently per model, the pooled vector lands on a different map per model. Agreement on the model is what lets two parties trade a single number and have it mean the same thing on both ends.

What this changes

Targeting stops being label-matching and becomes distance-comparison. An audience nobody named in advance exists the moment you describe it and embed it, and the buyer ranks inventory by closeness instead of a yes or no. The personalization stacks behind most of the consumer internet (search, recommendations, feed ranking) already work this way.

What that buys you in precision, it charges back in operations. The model is now a shared dependency both sides have to agree on and migrate together. The vector is heavier on the wire than the whole bid request it rides in. And a vector is not as anonymous as a wall of numbers looks. Those are the next posts in this series.

The shift underneath all of it is small to state and large to run: targeting used to mean both sides agreed on a list of segment IDs, and now it means both sides agree on a model.

If you are weighing embedding-based targeting for your own stack and want a sounding board on what to build versus what to integrate, we are happy to talk.

Sources

Gemini Embedding technical report (initialized from Gemini, 3,072 dimensions, mean pooling): https://arxiv.org/abs/2503.07891
E5-Mistral, "Improving Text Embeddings with Large Language Models," the model used for the tokenizer, attention, and pooling measurements: https://arxiv.org/abs/2401.00368
MTEB, the public leaderboard that ranks embedding models: https://huggingface.co/spaces/mteb/leaderboard

What an embedding model actually is

The best embedding models today are the same kind of model that powers a chatbot, with the final step removed.

Step 1: the model never sees "Saturn"

Before the model can make sense of anything, it breaks your words into pieces.

Step 2: which "Coach" you mean

Is "Coach" a designer label or a sports trainer? The model starts blind. The word goes in as a single number, identical for both senses. Attention is the step that settles which one you meant.

I ran two sentences through e5-mistral and pulled out its internal vector for "Coach" in each, once before the first layer and once after the last:

A: the boutique sells designer purses by Coach to wealthy shoppers
B: the football team listened closely to their Coach before the match

Step 3: many layers

The model does not read your text once. It stacks the same read-and-adjust step into a deep pile of layers, 32 in the model I ran, each working on the output of the one below.

Step 4: squash it all into one

Step 5: read off the embedding

How you compare two vectors

0.96

Same audience, reworded

0.88

Same vertical, different audience

0.82

Different verticals

The one rule that makes it work

That number only means something if both sides used the same model to produce the vectors.

What this changes

The shift underneath all of it is small to state and large to run: targeting used to mean both sides agreed on a list of segment IDs, and now it means both sides agree on a model.

If you are weighing embedding-based targeting for your own stack and want a sounding board on what to build versus what to integrate, we are happy to talk.

Sources

Gemini Embedding technical report (initialized from Gemini, 3,072 dimensions, mean pooling): https://arxiv.org/abs/2503.07891
E5-Mistral, "Improving Text Embeddings with Large Language Models," the model used for the tokenizer, attention, and pooling measurements: https://arxiv.org/abs/2401.00368
MTEB, the public leaderboard that ranks embedding models: https://huggingface.co/spaces/mteb/leaderboard