LLM vs JEPA

Part 2: Why Pixel Prediction Goes Blurry

Part one of this series maps Yann LeCun’s argument and the papers behind it. In this part I test his most concrete claim about generative prediction with a 700K-parameter PyTorch model on a synthetic bouncing ball.

The claim is that pixel-space generative prediction goes blurry on ambiguous futures, and that the blur is not a model-size problem. Code for this part is in part1_pixel/ of the repo.

A bouncing ball with no velocity

The dataset is one-dimensional bouncing-ball frames. 64x64 grayscale, ball radius 8, deterministic bouncing physics. Each training pair is (frame_t, frame_t+1) with single-frame input.

def simulate_episode(steps, rng):
    x = rng.uniform(BALL_RADIUS + 1, FRAME_SIZE - BALL_RADIUS - 1)
    vx = float(rng.choice([-SPEED, SPEED]))
    frames = np.empty((steps, FRAME_SIZE, FRAME_SIZE), dtype=np.float32)
    for t in range(steps):
        frames[t] = render_frame(x, FRAME_SIZE // 2)
        x_next = x + vx
        if x_next - BALL_RADIUS < 0:
            vx = abs(vx); x_next = x + vx
        elif x_next + BALL_RADIUS >= FRAME_SIZE:
            vx = -abs(vx); x_next = x + vx
        x = x_next
    return frames

A single frame at position x does not tell you which direction the ball is moving. The training set contains many trajectories: some passing through x going left, others going right. So the next-frame distribution conditioned on a single input frame is bimodal.

This is a 700K-parameter toy on synthetic data. Model size is not the variable being tested; the principle is.

Predicting the next frame in pixel space

Encoder + decoder, trained end-to-end with mean-squared error against the next frame.

class PixelPredictor(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = Encoder()   # 4-layer ConvNet, 1 -> 16 -> 32 -> 64 -> 128
        self.decoder = Decoder()   # mirror image, 128 -> 64x8x8 -> 64x64

    def forward(self, x):
        return self.decoder(self.encoder(x))

# Training loop, 2000 steps, ~25s on M3 MacBook Pro (MPS)
loss = F.mse_loss(model(x_t), x_next)

Loss converges to about 0.018, well below the all-zeros floor of 0.048 for a frame with ~200 ball pixels. The model is learning something.

The smear

Input frame, actual next frame, pixel-space prediction

Three panels, left to right: the input frame, the actual next frame, and the pixel-space prediction. Look at the third panel.

The ball is in the middle, but it is brighter on the side the ball moved to and trails a faint halo in the opposite direction. The model has produced a weighted average of two ball positions: one going left, one going right. The brightness ratio reflects the training data’s bias for that specific input position.

The same pattern shows up across the whole episode:

Pixel-space prediction across an 80-frame episode

The third panel smears most at mid-frame positions where both directions are equally plausible in the training set, and tightens up near the walls where the bounce constrains direction.

The smear is the optimal output for this loss function on a bimodal target distribution.

Why the smear is mathematical

Let f(x) be the model’s prediction for input frame x. The training objective is E[||f(x) - y||²] where y is the true next frame. Take the gradient with respect to f(x), set to zero. The MSE-optimal output for any input is the conditional mean of y given x:

f*(x) = E[y | x]

When y | x is bimodal (ball goes left or right with roughly equal probability from this position), the conditional mean is the pixelwise average of the two outcomes. Two ball-shaped peaks of intensity 0.5 instead of one peak of intensity 1.

What does not fix the blur

The blur is not a function of model capacity. The conditional mean argument holds for any sufficient-capacity predictor on this loss. So:

  • Bigger encoder. Same blur.
  • Longer training. Same blur, just better-converged.
  • Different architecture (transformer in place of ConvNet). Same blur.
  • Diffusion-style outputs. Side-step the issue by sampling instead of point-predicting, but introduce their own trade-offs LeCun discusses elsewhere.

The output target is wrong. Training a model to commit to a single pixel-level future under MSE is asking it to commit to something the input does not contain enough information to specify.

Where this leads

If you cannot specify the next pixel because the future is ambiguous, the natural move is to specify less. Predict the embedding of the next frame instead, and let the encoder throw away the unpredictable details (which way the ball bounced) while keeping what is predictable.

That is the JEPA recipe and the subject of Part 3, which trains the same encoder with three loss functions in succession: Barlow Twins, VICReg, and the November 2025 LeJEPA paper.