LLM vs JEPA

Part 1: Yann LeCun's Bet Against LLMs

Yann LeCun was Meta’s Chief AI Scientist and head of FAIR for a decade. In November 2025 he left to found his own lab, publicly calling LLMs “a dead end when it comes to superintelligence” and arguing that the field’s bet on scaling them is the wrong architecture for everything outside language. He has been making versions of this argument since at least 2015, and the papers are getting more specific.

I previously built a tiny LLM trained on the writings of Edgar Allan Poe. In this series I do a similar exploration of Yann LeCun’s recommended architecture in five parts. Part 1 maps the argument; Parts 2-5 build what he recommends. The companion repo is at github.com/danieljohnmorris/tiny-bouncing-jepa, with folders for each blog part.

What he is claiming

Three claims, in his words and in the order he tends to make them:

  1. LLMs are good in domains where language itself is the substrate of reasoning, and not much else. Quoted from Welch Labs’ “$1B Bet Against LLMs”, which is the most accessible recent summary of his position.
  2. Agentic systems cannot plan reliably without a world model. From the same talk: “I do not understand how you can even think of building an agentic system without it having the ability of predicting the consequences of its actions.”
  3. The right architecture for that world model is a Joint-Embedding Predictive Architecture (JEPA): an encoder that predicts the embedding of the next frame instead of its pixels. Pixel-level prediction goes blurry on ambiguous futures; embedding-level prediction does not. Set out in his 2022 position paper, listed below.

The series tests claim 3 directly (Parts 2 and 3), claim 1 by implication (the LLM/JEPA contrast runs through the whole series), and claim 2 in the final post (Part 5, action-conditioned planning).

Where the argument comes from

Yann LeCun’s 2015 cake metaphor was the early version: most of intelligence is self-supervised learning (the cake), supervised learning is the icing, reinforcement learning is the cherry on top. The next decade vindicated the cake half. GPT-1 trained next-token prediction without labels and broke through every supervised baseline. Self-supervised learning for language went mainstream.

The same did not happen for vision. Generative video models from 2015-2019 produced blurry frames that compounded into nothing on long horizons. Pixel-level next-frame prediction is mathematically an averaging problem when the future is ambiguous, and ambiguity is everywhere in real video. Yann LeCun’s 2022 paper formalised the alternative: skip the pixels, predict in embedding space, train an encoder that throws away the unpredictable detail and keeps what matters.

That is JEPA. The paper introduced the architecture as a position; subsequent papers turned it into trained models.

The papers

Seven papers carry most of the weight, in publication order:

  • Barlow Twins (2021), Zbontar, Jing, Misra, LeCun, Deny. The cross-correlation trick that fixed a long-standing problem with joint-embedding training: collapse, where the encoder learns to output a constant vector. Yann LeCun calls this his “epiphany” in the talk.
  • VICReg (2022), Bardes, Ponce, LeCun. Replaces the cross-correlation matrix with three explicit terms: variance per dimension, invariance between views, covariance off-diagonal. Simpler, more stable.
  • A Path Towards Autonomous Machine Intelligence (2022), LeCun. The position paper. Introduces JEPA as the proposed architecture for autonomous machine intelligence and sets out the broader hierarchical model in which it sits.
  • I-JEPA (2023), Assran et al. The first JEPA model trained on images. Predicts embeddings of masked image regions. Vision Transformer encoder.
  • V-JEPA 2 (2025), Meta FAIR. Video version, trained on a million hours of internet video. The action-conditioned variant V-JEPA 2-AC enables zero-shot robot pick-and-place from image goals.
  • LeJEPA (Nov 2025), Balestriero, LeCun. Drops the heuristics that previous JEPA recipes accumulated (stop-gradient, teacher-student, learning-rate schedulers). One regularizer (SIGReg) pushes embeddings toward an isotropic Gaussian distribution. ~50 lines of loss code, single hyperparameter.
  • LeWorldModel (2026), Maes et al. The first paper from LeCun’s post-Meta lab. Builds a world model end-to-end from pixels using LeJEPA’s regularization. 15M parameters, planning up to 48x faster than the previous DINO-based world model.

Welch Labs’ video covers this same trajectory through Barlow Twins, VICReg, DINO, I-JEPA, V-JEPA 2, and previews world models in part two.

The JEPA family today

A recent r/newAIParadigms thread counts: JEPA, H-JEPA, I-JEPA, MC-JEPA, V-JEPA, Audio-JEPA, Point-JEPA, 3D-JEPA, ACT-JEPA, V-JEPA 2, LeJEPA, Causal-JEPA, V-JEPA 2.1, LeWorldModel, ThinkJEPA. DINO/DINOv2/DINOv3 sit in the same family without the name - distillation instead of predictor-on-masks; Yann LeCun groups them with JEPA in talks, and DINOv3 is the production encoder Part 4 uses.

Almost every entry comes from Meta or has Yann LeCun as a co-author. The recipes are converging: LeJEPA replaces the older heuristic stack (stop-gradient, teacher-student, EMA) with a single regularizer, which LeWorldModel then ties into an end-to-end world-model objective.

Next

Part 2: Why Pixel Prediction Goes Blurry tests claim 3 directly: a 700K-parameter PyTorch model trained on a synthetic bouncing ball produces the smear Yann LeCun predicts, while the JEPA-trained version does not.