Part 4: From Representations to World Models - Writing

Part 3 trained a joint-embedding encoder on the bouncing-ball toy with VICReg, with Barlow Twins motivating the cross-correlation matrix loss. The toy was small enough to read in fifty lines. The encoder it produced is also small enough to be useless: there is nothing irrelevant in a 64x64 single-direction bouncing ball for the encoder to discard.

This part takes a production-grade joint-embedding encoder, runs it on a real image, and shows what the same architectural family learns at scale. Code for this part is in part3_production/ of the repo.

DINO

DINO (Caron et al., 2021) is a joint-embedding self-supervised method from Meta. Like Barlow Twins it produces an encoder by training two views of the same image to have similar embeddings, but it uses teacher-student distillation with centering and sharpening rather than a cross-correlation matrix penalty. Yann LeCun groups DINO under the JEPA umbrella in talks: same philosophical core (predict in embedding space, no pixel reconstruction), different recipe.

DINOv2 (April 2024) and DINOv3 (August 2025) extend the original. DINOv3 reaches 88.4% on ImageNet linear probing, the first self-supervised model to match weakly-supervised baselines. The DINOv3 weights are gated on Hugging Face at the time of writing, so the demo here uses DINOv2-base. The architecture and the demonstration are identical at this level; DINOv3 is mostly a scale and training-recipe upgrade.

The hover demo

DINO produces one embedding vector per image patch. For a Vision Transformer with patch size 16 and a 448x448 input, that is a 32x32 grid of patches, each a 768-dim vector.

Pick any patch. Compute its cosine similarity to every other patch in the same image. Plot the result as a heatmap on top of the image. The result tells you which other patches the encoder considers semantically similar to the one you picked.

patches = F.normalize(patch_tokens.squeeze(0), dim=-1)  # (N, 768)
heatmap = patches @ patches[query_idx]                  # (N,)
heatmap = heatmap.reshape(grid, grid)                   # (32, 32)

That is the whole demo. The encoder is frozen, with no fine-tuning or labels, and there’s no segmentation head on top.

What it does on a labrador

I ran the demo on a public-domain photo of a yellow labrador at 448x448. Three query patches: one on the dog’s body, one on the dog’s face, one on the grass background. Red is high similarity to the query, blue is low.

Three DINOv2 patch-similarity heatmaps: query on body, face, background

Body query: the dog’s torso and legs light up red. Face and background stay cooler.
Face query: the dog’s head, ears, and snout light up. The rest of the dog is intermediate; the background is blue.
Background query: the grass and surrounding non-dog regions go red. The dog itself goes blue.

The encoder has carved the image into “dog body”, “dog face”, and “background” without ever being told there is a dog in the image, what a face is, or what grass looks like. It learned this from the joint-embedding objective alone, applied to roughly 142 million images from LVD-142M.

Why this matters for the bouncing ball

The bouncing ball can’t show this. A 64x64 frame of a single white ball on black has no semantic structure to discover. The encoder either keeps position information or drops it; nothing else is in the image.

The labrador image has exactly the kind of structure the joint-embedding loss is designed for. Most pixels are unpredictable detail (exact fur colour, blade orientations of grass, shadow edges). A small amount of structure is consistent across views and useful downstream (this is dog-body, this is dog-face, this is grass). The DINO loss pushes the encoder to keep the latter and drop the former. Run the encoder on any new image, click a patch, and the same kind of structure falls out.

This is the JEPA-family payoff that the toy in Parts 2 and 3 cannot demonstrate: not “sharper next-frame predictions” but “the encoder discovers semantic structure from images alone, and that structure transfers to any downstream task without labels”. It is what Yann LeCun has been arguing for since the cake metaphor in 2015.

What’s left

DINOv3 produces the heatmap but does not act on it. To get an agent that plans, the encoder needs a predictor on top, and the predictor needs to know which actions change which embeddings. That is V-JEPA 2 / V-JEPA 2-AC (Meta FAIR, 2025): the same kind of encoder, trained on video, with an action-conditioned predictor that lets a robot plan pick-and-place from image goals.

Part 5 ports the action-conditioned predictor idea back to the bouncing-ball toy. Single-frame input was the wrong test for JEPA; (frame, action) input is the right one.