todai — Friday, 5 June

TODAY'S ITEMS

1. Anthropic Publishes Recursive Self-Improvement Data via New Institute

The newly established Anthropic Institute published "When AI Builds Itself," revealing that Mythos Preview achieved a ~52× speedup on a code-training benchmark where skilled humans take 4-8 hours — up from ~3× with Claude Opus 4 in May 2025. On research judgment calls, Mythos Preview improved on human researchers 64% of the time versus 51% with Claude Opus 4.5 in November 2025.
Source: Anthropic Institute
Why it matters: First hard numbers Anthropic has published on AI self-improvement velocity — the new Institute signals they consider this trajectory significant enough to warrant its own research body.
Verified

2. NVIDIA Drops Nemotron 3 Ultra: 550B Open-Weights Model With 1M Context

NVIDIA released Nemotron 3 Ultra, a 550B parameter model with 55B active parameters (the rest are dormant experts activated on demand), supporting 1M token contexts at 300+ tokens/sec. Available now on HuggingFace under a commercial-OK open licence, but requires 8× H200 GPUs minimum.
Source: NVIDIA / r/LocalLLaMA (227 upvotes)
Why it matters: Tops US open-weights intelligence rankings — if your team evaluates self-hosted alternatives to Claude for agentic work, this is the new ceiling.
Verified

3. Berkeley CS Failure Rates Triple as Students Lean on AI

UC Berkeley's Daily Californian reports 35.3% of CS 10 students received F's in Spring 2026, up from under 10% in 2024 and 2025. CS 61A saw 10.6% F's. Instructors point to increased AI reliance and declining mathematical preparedness.
Source: Daily Californian via HN (624 points)
Why it matters: The gap between "Claude can do this" and "I understand why" is where debugging fails — a reminder that AI-assisted work still needs underlying skills.
Verified

YOUR STACK — UPDATES

CC v2.1.162: explicit WebFetch deny/allow rules now override preapproved-domain auto-allow (security fix), MCP timeout values under 1 second no longer abort every tool call, claude agents --json includes waitingFor field, /effort persists as default for new sessions, Windsurf renamed to Devin Desktop | https://github.com/anthropics/claude-code/releases/tag/v2.1.162

REDDIT SIGNAL

Tips — 9 practical lessons from 8 months of daily Claude use: editing beats generating, long context needs pre-filtering, and "what's the strongest argument against this" gets sharper pushback than "is this right?" — 226 upvotes, 22 comments | r/ClaudeAI | https://old.reddit.com/r/ClaudeAI/comments/1twkht5/

Workflow — Run Claude Code in Docker with your existing subscription: mount your project + ~/.claude login, one shell alias, full isolation with Node/Python/Go/Rust in the container — 56 upvotes, 19 comments | r/ClaudeCode | https://old.reddit.com/r/ClaudeCode/comments/1tw01pp/

open-notebook — 482★ today (24,802★ total) — Open-source NotebookLM alternative: upload docs, generate AI podcast episodes, search across notebooks | docker compose up | https://github.com/lfnovo/open-notebook

NEW TOOL / PRODUCT SPOTLIGHT

last30days — AI agent skill that searches Reddit, X, YouTube, HN, Polymarket, and the web in parallel, then synthesises a grounded summary scored by real engagement data. Zero config for Reddit/HN/GitHub; X and YouTube unlock in 30 seconds. 27,457★ | Claude Code: /plugin marketplace add mvanhorn/last30days-skill | Others: npx skills add mvanhorn/last30days-skill -g | https://github.com/mvanhorn/last30days-skill

PROMPT OF THE DAY

Before responding, identify the strongest argument against what I'm about to propose.
Then tell me:
1. The counterargument in 2-3 sentences
2. What evidence would change your mind
3. Your recommendation given both sides

Here's what I'm thinking: [paste your idea, plan, or decision]

Forces Claude to push back instead of agreeing — turns "yes and" into genuine evaluation. Use in Chat or Code when making architecture, hiring, or strategy decisions. Source: r/ClaudeAI https://old.reddit.com/r/ClaudeAI/comments/1twkht5/

LANDSCAPE NOTES

Google employees internally share memes about their AI quality issues — Google then revised a press statement to 404 Media, removing "it's critical that we maintain humans in the loop." https://simonwillison.net/2026/Jun/4/a-slightly-different-version/
Researcher spent $1,500 testing LLMs on a vulnerable app: Claude scored lower than competitors because its guardrails blocked exploitation — not a capability gap, a safety design choice. HN 351 points. https://news.ycombinator.com/item?id=48392343