Tagged: attention
2 articles
Three Bets on Long-Context Attention Gemma 4, Qwen 3.6, and DeepSeek V3 each take a different path to long context. A reverse-engineering test shows where the trade hurts. Read article DeepSeek V4: Don't Look at What You Don't Need DeepSeek V4 reads a million tokens on roughly a quarter of V3.2's compute. It does this by selectively attending to the parts of context the prompt asks about, the same way humans skim a long book. Read article