Rutvik Acharya

Back

Most embedding models give you one vector and you use the whole thing. If the model produces 1536 dimensions, you store 1536 floats, compute similarity over all 1536, and pay for all 1536 in your vector index. The only way to get a smaller representation is to train a different model.

Matryoshka Representation Learning changes that. It trains a single model so that the first 64 dimensions, the first 256, the first 512, and the full dimension all independently carry useful semantic information. You get one model that behaves like several, and you decide at query time how much of the vector to use.

The name comes from Russian nesting dolls. The small doll lives inside the medium doll lives inside the large doll. Same idea: the small embedding lives inside the medium, lives inside the full.

The core problem MRL solves#

Standard embedding training optimizes for one thing: the quality of the full-dimensional vector. The structure of the intermediate dimensions is ignored. If you slice off the first 64 dimensions of a standard 768-dim embedding, you get something close to random. The model never learned to put anything useful there.

This creates a hard tradeoff. You want high retrieval quality, which means large embeddings. You want low storage and fast search, which means small embeddings. The only escape was to train separate models at each target dimension, which is expensive and annoying to maintain.

MRL breaks this tradeoff by restructuring the training objective itself.

How MRL training works#

You can apply a classification or contrastive loss at multiple embedding sizes simultaneously, and backpropagate all of them through a single shared backbone.

Given an input, the model produces a full-dimensional embedding z of size d (e.g., 1024). MRL evaluates that embedding at a set of nested sizes M = {8, 16, 32, 64, 128, 256, 512, 1024}. For each size m, it takes the first m dimensions z[1:m], applies a linear classification head W_m, and computes the task loss.

The total MRL loss is:

L_MRL = Σ_{m ∈ M}  c_m · L(W_m · z[1:m], y)
plaintext

Where:

  • z[1:m] is the prefix slice of the full embedding
  • W_m is a linear head specific to dimension m (learned during training, discarded at inference)
  • L is your task loss: cross-entropy for classification, contrastive loss for retrieval
  • c_m is a per-scale weight (usually uniform: c_m = 1 for all m)

Every forward pass contributes gradients from every scale at once. The backbone learns to front-load information: the first few dimensions get pushed to capture the most discriminative signal possible, because they are the only dimensions evaluated at the smallest scales. Larger prefixes pick up progressively more detail.

The linear heads are auxiliary. They exist to create a gradient signal at each scale. You throw them away after training.

The loss in more detail#

For retrieval models, the task loss L is typically a contrastive loss like MultipleNegativesRankingLoss or InfoNCE:

L_contrastive = -log [ exp(sim(q, d+) / τ) / Σ_j exp(sim(q, dj) / τ) ]
plaintext

Where q is the query embedding prefix, d+ is the positive document embedding prefix, dj are all documents in the batch (including negatives), and τ is a temperature.

MRL wraps this: for each scale m, compute the contrastive loss using only the first m dimensions of both query and document embeddings. Sum them up. The backbone sees gradients from all scales simultaneously.

The full vector ends up as the best possible representation at that size, and so does every prefix. The training objective makes it genuinely costly to bury signal deep in the vector where small-scale heads can’t see it.

Training MRL with sentence-transformers#

The sentence-transformers library has native MRL support through MatryoshkaLoss.

The MatryoshkaLoss wrapper handles slicing z[1:m] and summing the scaled losses automatically. You write the base loss once and MRL does the rest.

Using MRL embeddings at inference#

After training, you embed your corpus once at full dimension and store the results. At query time, you can truncate to any size in your trained set.

You encoded once. The truncation happens in NumPy. No re-encoding, no second model.

MixedBread and Jina models#

MixedBread’s mxbai-embed-large-v1 and Jina’s jina-embeddings-v3 both ship MRL-trained. With sentence-transformers, truncation is a single constructor argument.

from sentence_transformers import SentenceTransformer

# MixedBread: 1024-dim model, truncate to 512
model = SentenceTransformer(
    "mixedbread-ai/mxbai-embed-large-v1",
    truncate_dim=512,
)

texts = ["What is retrieval augmented generation?"]
embeddings = model.encode(texts, convert_to_numpy=True)
print(embeddings.shape)  # (1, 512)
python

Jina’s v3 model supports even smaller sizes and works across multiple languages:

from sentence_transformers import SentenceTransformer

# Jina v3: 1024-dim, supports 32/64/128/256/512/1024
model = SentenceTransformer(
    "jinaai/jina-embeddings-v3",
    trust_remote_code=True,
    truncate_dim=256,
)

texts = ["What is retrieval augmented generation?"]
embeddings = model.encode(texts, task="retrieval.query", convert_to_numpy=True)
print(embeddings.shape)  # (1, 256)
python

The truncate_dim parameter handles the slice and renormalization inside the model. You get the same result as slicing manually, but without having to remember to renormalize.

Quality at each scale#

MRL embeddings lose quality as you reduce dimensions, but the dropoff is gradual and often surprisingly small at moderate reductions.

For most datasets, cutting from 768 to 256 costs roughly 2-5% on NDCG@10. Cutting to 64 starts hurting, but it is still meaningfully better than a random projection at the same size.

Models that ship MRL by default#

Several widely used models now come MRL-optimized out of the box. No retraining needed.

ModelProviderFull dimMRL dims available
mxbai-embed-large-v1MixedBread102464, 128, 256, 512, 1024
jina-embeddings-v3Jina AI102432, 64, 128, 256, 512, 1024
nomic-embed-text-v1.5Nomic AI76864, 128, 256, 512, 768
snowflake-arctic-embed-m-v1.5Snowflake768256, 768
bge-m3BAAI1024Partial (FlagEmbedding)

For any of these, you get the MRL property for free. Truncate the output to whatever your storage budget allows.

Adaptive retrieval#

MRL also enables two-stage retrieval with a single model. First pass uses small embeddings for fast ANN search over a large corpus. Second pass re-scores the top candidates with the full embedding.

Stage 1 operates on 64-dim vectors, so your FAISS index is 12x smaller and search is correspondingly faster. Stage 2 re-scores 100 candidates with 768-dim vectors, which is cheap. The accuracy lands close to full-dimension search at a fraction of the index cost.

This is the pattern Vespa, Weaviate, and Qdrant have started building natively: store the full vector, index a compressed version, use the full vector for final scoring.

Where MRL actually helps#

MRL is not useful in every situation. Storage-constrained vector indexes are the clearest case. If you are indexing millions of documents and cannot afford a 1536-dim index, MRL lets you use 256 or 512 without swapping models. First-stage retrieval at scale is another: smaller vectors mean faster dot products in ANN search, and the speedup compounds as corpus size grows. It is also handy when sweeping dimension sizes for ablations, since you can test 64, 128, and 256 from a single model without retraining anything.

It does not help when your performance bottleneck is somewhere else entirely. Switching from 768 to 256 dimensions saves storage but will not fix a bad chunking scheme or a reranker that is not calibrated for your domain.

What MRL does not change#

The backbone architecture is unchanged. MRL is a training strategy, not a model architecture. You can apply it on top of any transformer encoder. The only structural addition is the per-scale linear heads during training, and those are discarded afterward.

Inference code does not change either. You call model.encode() exactly as before. The only difference is you can optionally slice the output.

There is one real cost worth knowing about. Optimizing for the 64-dim prefix means the model pushes the most discriminative signal into the first 64 dimensions. This can theoretically hurt full-dimension performance slightly compared to a model trained only for 768 dimensions. In practice the difference is small: the MRL loss is a sum, and the largest scale dominates because it receives gradients from the largest linear head. The original paper puts the full-dimension gap at 1-2% versus a dedicated single-scale model, which is usually worth the tradeoff.


A 4x reduction in embedding dimension translates directly to 4x less RAM in your vector index and 4x lower storage costs, all from the same model. That is a real savings, not a paper one. The fact that MixedBread, Jina, Nomic, and Snowflake all shipped MRL in their flagship models within roughly the same two-year window suggests the field decided this was worth the training overhead. If you are picking a new embedding model and MRL is not mentioned in the model card, it is worth finding out why.