Mojo AI Brief

Friday, April 17, 2026
300 scanned
16 sources
8 curated
scroll down
No alerts.
Thermodynamic Liquid Manifold Networks for off-grid solar forecasting — Four buzzwords stapled together to forecast sunshine on a cabin roof. Pass.
arXiv paper on Thermodynamic Liquid Manifold Networks for off-grid solar forecasting — Five buzzwords stapled together to forecast sunshine. If the title needs a decoder ring, the method probably does too.
Thermodynamic Liquid Manifold Networks for solar forecasting on off-grid micro-setups (arXiv) — Four buzzwords stapled together to forecast sunshine on a cabin roof. Cute paper, zero bearing on anything you ship Monday.
🎯
Download Qwen3.6 onto the laptop this weekend, run Simon Willison's pelican-on-a-bike test against our current Claude Opus 4.7 pipeline, and write up which tasks we can yank out of Anthropic's cloud by Friday.
Download Qwen3.6 on the MacBook this weekend, run Simon Willison's pelican-on-a-bike prompt against our Claude Opus 4.7 baseline, and post the side-by-side to the team Slack by Monday standup.
🎧 Generate Podcast with NotebookLM tap to expand
# Cup of Mojo -- Daily AI Brief -- Friday, April 17, 2026 Broadcast-ready source document. Read the COLD OPEN verbatim, cover every numbered beat in the RUNDOWN in tier-appropriate airtime, close with the MOJO TAKE verbatim. APPENDIX is reference only. **Articles analyzed:** 8 **Sources monitored:** 4 **Beats to cover:** 8 --- ## EDITORIAL GUIDANCE FOR PODCAST NARRATION Read these rules before narrating. They are non-negotiable. 1. **Read the COLD OPEN verbatim.** Do not rewrite or paraphrase it. It is the episode's signature opening. 2. **Cover every numbered beat.** Beats are numbered 1 through 8. Do not skip any beat. Do not consolidate. Each beat gets its own moment. 3. **Honor the tier structure.** `deep` beats get longer treatment with full context. `standard` beats are structured but concise. `rapid_fire` beats are short and punchy. Use roughly 2 minutes for the deep beat, 1 minute per standard beat, 20-30 seconds per rapid-fire beat. 4. **Cite sources by name** when presenting a claim. Say "OpenAI announced..." not "a company announced". 5. **Use only the plain-English text in each beat.** Do not pull technical jargon from the APPENDIX. The appendix is reference material for context, not script content. If a beat does not mention a term, do not introduce it. 6. **Only use numbers that appear in a beat's own text.** Do not import statistics from the appendix. Omit rather than fabricate. 7. **Reference earlier beats when topics connect.** Each beat has a `callbacks` field listing earlier beat numbers it relates to. When narrating, explicitly link back: "Remember that supply chain attack from Beat 1? This next one shows how the downstream risk compounds." Callbacks create cohesion and prevent the episode from feeling like a list. 8. **Introduce one skeptical angle per deep or standard beat.** Phrases like "one caveat", "critics will point out", or "this is not yet peer-reviewed" create credibility. Rapid-fire beats can skip this. 9. **Use the pronunciation guide for every named person or company.** Do not guess pronunciations. 10. **Close with the MOJO TAKE outro.** Read it as the host's editorial perspective, not as a summary. --- ## PRONUNCIATION GUIDE The following names appear in today's content. Use these phonetic pronunciations: - **Anthropic** — pronounced *an-THROP-ik* - **Qwen** — pronounced *CHWEN* --- ## COLD OPEN -- Read This Verbatim Read the HOOK line first, pause for a beat, then the TEASE. Do not rewrite. Do not paraphrase. Do not add any preamble. > **Hook:** Simon Willison just had a 35B Qwen model on his laptop draw a better pelican than Claude Opus 4.7. Let that sink in. > **Tease:** Coming up: why your local GPU is suddenly embarrassing Anthropic, a new paper saying LLMs flunk abstract meaning harder than anyone thought, reasoning models making multi-agent negotiations worse, and MCP design patterns for shipping agents that do not fall over in production. --- ## TODAY'S RUNDOWN Cover every beat in order. Do not skip. Tier labels tell you how much airtime each beat deserves. ### Beat ? [DEEP] — Alibaba's Qwen3.6 drew a better pelican on a bike than Claude Opus 4.7, running on Simon Willison's laptop **Source:** Simon Willison | https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-everything **Hook (open with this):** Simon Willison just embarrassed Anthropic with a pelican on a bicycle. And the model that won? It was running on his MacBook, not in some billion-dollar data center. **Plain English:** Willison runs this goofy benchmark where he asks models to draw a pelican riding a bike in SVG code. This morning he compared Anthropic's brand new Claude Opus 4.7 against Alibaba's Qwen3.6-35B, a 21 gigabyte open-weights model he ran locally through LM Studio on an M5 MacBook Pro. The open model drew the better pelican. **Stakes:** If you're still assuming frontier quality only lives behind an API, you're about to get outmaneuvered by competitors running models on a laptop for free. **Twist:** A quantized Chinese open-weights model that fits on a consumer Mac just out-drew the flagship release from the company everyone calls the quality leader. **Takeaway:** The gap between what's locked in Anthropic's cloud and what's sitting on your hard drive is now basically a rounding error. ### Beat ? [STANDARD] — GPT-4o flunks the abstract thinking test on SemEval's ReCAM benchmark **Source:** arXiv cs.CL | https://arxiv.org/abs/2604.12018 **Hook (open with this):** GPT-4o just got a C-minus on a reading comprehension test built for high schoolers. SemEval's ReCAM benchmark hands models a passage and five abstract word choices, and the big names keep whiffing. **Plain English:** ReCAM is a fill-in-the-blank test where the answers are abstract concepts like 'freedom' or 'consequence' instead of concrete nouns like 'dog'. Researchers ran GPT-4o and friends through it. The models handle concrete stuff fine, but when the meaning gets fuzzy and high-level, accuracy tanks. **Stakes:** If you're using an LLM to summarize contracts, therapy notes, or strategy docs, the abstract nouns are exactly where it quietly hallucinates. **Twist:** Scaling didn't fix it. GPT-4o, the biggest kid in the room, still trips on the same abstraction gap the smaller models do. **Takeaway:** LLMs read the words, not the meaning behind them. Spot-check anything abstract before you ship it. ### Beat ? [STANDARD] — MCP hit 97 million monthly downloads but still can't tell your agent who it is or when to stop **Source:** arXiv cs.MA | https://arxiv.org/abs/2603.13417 **Hook (open with this):** Anthropic's Model Context Protocol is crushing it on adoption: 10,000 active servers, 97 million SDK downloads a month. Now here's the part nobody's talking about. **Plain English:** A new arXiv paper flags three holes in MCP that bite you the second you go to production. No identity propagation, so the tool doesn't know which user is asking. No tool budgeting, so agents happily burn 400 calls on one task. And no structured errors, so failures come back as mystery strings. **Stakes:** Ship an MCP agent without patching these yourself and you get runaway costs, audit nightmares, and retry loops that look healthy until the invoice lands. **Twist:** The protocol everyone treats as the grown-up standard is missing the three things you actually need to run it like a grown-up. **Takeaway:** MCP is the USB-C of tools, but you're still supplying your own fuse, your own nametag, and your own error codes. ### Beat ? [STANDARD] — Smarter reasoning models make worse fake humans in multi-agent negotiation sims **Source:** arXiv cs.MA | https://arxiv.org/abs/2604.11840 **Callbacks:** references Beat 3. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** arXiv researchers just caught reasoning models cheating at being human. Drop o1 or DeepSeek-R1 into a negotiation sim and they stop acting like people and start acting like game theory textbooks. **Plain English:** When you want an LLM to simulate how real people haggle, argue, or screw up a deal, the fancy reasoning models overthink it. They find the optimal move instead of the believable move. Weaker models that just vibe their way through actually mirror human messiness better. **Stakes:** If you're using o1 or R1 to stress-test a policy, a market, or a customer flow, your simulated humans are robots in a trench coat and your conclusions are fiction. **Twist:** Turning up the reasoning dial makes the solver sharper and the simulator dumber. Same model, opposite direction. **Takeaway:** Pick the model for the job. Reasoners solve, plain chat models pretend. Don't ask one to do the other. ### Beat ? [RAPID_FIRE] — Novel Operator Test catches LLMs doing the math right then writing the wrong answer anyway **Source:** arXiv cs.CL | https://arxiv.org/abs/2604.13065 **Callbacks:** references Beat 4. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Arxiv just dropped the Novel Operator Test, and five models flunked it in the most embarrassing way possible. **Plain English:** Researchers renamed Boolean operators so models couldn't pattern-match from training data. The chain-of-thought steps came out correct. The final answers did not. The reasoning and the output are two different processes, and one of them is lying to you. **Stakes:** If you trust the final answer because the reasoning trace looked clean, your agent is shipping wrong calls with a confident paper trail. **Twist:** The models reasoned correctly step by step and still wrote the wrong answer at the bottom of the page. **Takeaway:** Check the answer against the trace. A pretty chain-of-thought is not a receipt. ### Beat ? [RAPID_FIRE] — WorkRB tries to un-fragment hiring AI research with one shared benchmark **Source:** arXiv cs.CL | https://arxiv.org/abs/2604.13055 **Hook (open with this):** WorkRB landed on arXiv with a plea: hiring AI research is a mess of mismatched ontologies, ESCO versus O*NET versus whatever your country made up. **Plain English:** A group of researchers want a community benchmark for AI that screens resumes, matches jobs, and runs workforce analytics. Right now every paper uses different labels and different tasks, so nobody can tell whose model is actually better. WorkRB wants to fix that. **Stakes:** If you're buying hiring AI, the vendor's benchmark numbers probably can't be compared to anyone else's, which means you're shopping blind. **Twist:** The field built career-defining models for years without agreeing on what a job title even is. **Takeaway:** Ask hiring AI vendors which ontology and which benchmark, and watch them squirm. ### Beat ? [RAPID_FIRE] — Cognitive Companion rides shotgun on LLM agents, catches them looping and drifting at zero overhead **Source:** arXiv cs.AI | https://arxiv.org/abs/2604.13759 **Callbacks:** references Beat 3, Beat 4. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Cognitive Companion sits next to your agent like a co-pilot, watching for loops, drift, and stuck states that tank up to 30% of hard multi-step runs. **Plain English:** Agents get dumber as steps pile up. The usual fixes are hard cutoffs or an LLM judge that taxes every step 10 to 15%. This paper runs a cheap probe in parallel that spots the spiral and yanks the agent out, without the judge tax. **Stakes:** Ship a long-running agent with no monitor and one in three hard tasks quietly loops or drifts into nonsense while you pay the token bill. **Twist:** The probe version costs basically nothing to run and still catches the degradation a full LLM judge is charging you 15% a step to find. **Takeaway:** If your agent does more than three steps, put a cheap watcher next to it. Step limits are a hammer, judges are a tax, probes are neither. ### Beat ? [RAPID_FIRE] — Tri-Spirit Architecture splits agent brains across three hardware layers instead of one cloud blob **Source:** arXiv cs.AI | https://arxiv.org/abs/2604.13757 **Callbacks:** references Beat 1, Beat 7. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Tri-Spirit Architecture wants to carve your agent's brain into three pieces: plan, reason, execute, each on its own silicon. **Plain English:** The arXiv pitch says cloud-only and on-device-only both waste latency and energy because planning, reasoning, and execution get jammed into one pipe. Tri-Spirit splits them across three hardware tiers so each layer runs where it's cheapest and fastest. It's a blueprint, not a product. **Stakes:** Ignore the layering question and you'll keep paying cloud prices to execute keystrokes a Raspberry Pi could handle. **Twist:** The bottleneck for autonomous agents isn't the model, it's that nobody decided which chip should be doing which part of the thinking. **Takeaway:** Before scaling your agent, ask which step belongs on the cloud, which on the box, and which on the edge. --- ## NOT WORTH YOUR TIME TODAY Do not cover on air. These are listed so the host can acknowledge if asked. - **Thermodynamic Liquid Manifold Networks for off-grid solar forecasting** -- Four buzzwords stapled together to forecast sunshine on a cabin roof. Pass. - **arXiv paper on Thermodynamic Liquid Manifold Networks for off-grid solar forecasting** -- Five buzzwords stapled together to forecast sunshine. If the title needs a decoder ring, the method probably does too. - **Thermodynamic Liquid Manifold Networks for solar forecasting on off-grid micro-setups (arXiv)** -- Four buzzwords stapled together to forecast sunshine on a cabin roof. Cute paper, zero bearing on anything you ship Monday. --- ## ACTION ITEMS FOR THIS WEEK (Joey only) These are internal action items. Not for on-air narration. - Download Qwen3.6 onto the laptop this weekend, run Simon Willison's pelican-on-a-bike test against our current Claude Opus 4.7 pipeline, and write up which tasks we can yank out of Anthropic's cloud by Friday. - Download Qwen3.6 on the MacBook this weekend, run Simon Willison's pelican-on-a-bike prompt against our Claude Opus 4.7 baseline, and post the side-by-side to the team Slack by Monday standup. --- ## MOJO TAKE -- Editorial Outro (Read Verbatim) Three-paragraph outro. Read each block verbatim, with natural pauses between them. > **Connect the dots:** Today's thread: the big labs are losing their monopoly on 'good enough.' Qwen3.6 draws pelicans on Simon Willison's laptop, MCP ships 97 million downloads, Tri-Spirit splits brains across three boxes. Meanwhile GPT-4o flunks abstract reasoning and smart models make dumb fake humans. The center is not holding. The edge, the benchmark, and the watcher are where the work is. > **Watch next:** Watch for Qwen3.6 local benchmarks hitting Hugging Face this week, and whether Anthropic answers with a smaller Claude. Also watch WorkRB adoption: if two hiring vendors sign on, it becomes the standard. If none do, it's a paper. > **Sign-off:** Pick the right model for the job, put a cheap watcher next to it, and ship. That's Cup of Mojo. Joey out, go build something. --- ## APPENDIX -- VERBATIM SOURCE CONTENT Reference material. Do not read verbatim. Do not pull jargon from here into the spoken script. If the rundown beat does not mention a term, do not introduce it on the podcast. ### MCP hit 97 million monthly downloads but still can't tell your agent who it is or when to stop **Source:** arXiv cs.MA **Link:** https://arxiv.org/abs/2603.13417 Computer Science > Software Engineering Title:Bridging Protocol and Production: Design Patterns for Deploying AI Agents with Model Context Protocol View PDF HTML (experimental)Abstract:The Model Context Protocol (MCP) standardizes how AI agents discover and invoke external tools, with over 10,000 active servers and 97 million monthly SDK downloads as of early 2026. Yet MCP does not yet standardize how agents safely operate those tools at production scale. Three protocol-level primitives remain missing: identity propagation, adaptive tool budgeting, and structured error semantics. This paper identifies these gaps through field lessons from an enterprise deployment of an AI agent platform integrated with a major cloud provider's MCP servers (client name redacted). We propose three mechanisms to fill them: (1) the Context-Aware Broker Protocol (CABP), which extends JSON-RPC with identity-scoped request routing via a six-stage broker pipeline; (2) Adaptive Timeout Budget Allocation (ATBA), which frames sequential tool invocation as a budget allocation problem over heterogeneous latency distributions; and (3) the Structured Error Recovery Framework (SERF), which provides machine-readable failure semantics that enable deterministic agent self-correction. We organize production failure modes into five design dimensions (server contracts, user context, timeouts, errors, and observability), document concrete failure vignettes, and present a production readiness checklist. All three algorithms are formalized as testable hypotheses with reproducible experimental methodology. Field observations demonstrate that while MCP provides a solid protocol foundation, reliable agent tool integration requires infrastructure-level mechanisms that the specification does not yet address. Current browse context: ### Smarter reasoning models make worse fake humans in multi-agent negotiation sims **Source:** arXiv cs.MA **Link:** https://arxiv.org/abs/2604.11840 Computer Science > Machine Learning Title:When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation View PDF HTML (experimental)Abstract:Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers. Current browse context: ### Novel Operator Test catches LLMs doing the math right then writing the wrong answer anyway **Source:** arXiv cs.CL **Link:** https://arxiv.org/abs/2604.13065 Computer Science > Computation and Language Title:Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic View PDF HTML (experimental)Abstract:LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4's depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR's truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama's novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity. Current browse context: ### WorkRB tries to un-fragment hiring AI research with one shared benchmark **Source:** arXiv cs.CL **Link:** https://arxiv.org/abs/2604.13055 Computer Science > Computation and Language Title:WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain View PDF HTML (experimental)Abstract:Today's evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, O*NET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility exceedingly difficult. General-purpose benchmarks lack coverage of work-specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present \textbf{WorkRB} (Work Research Benchmark), the first open-source, community-driven benchmark tailored to work-domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross-lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi-stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at this https URL. ### Cognitive Companion rides shotgun on LLM agents, catches them looping and drifting at zero overhead **Source:** arXiv cs.AI **Link:** https://arxiv.org/abs/2604.13759 Computer Science > Artificial Intelligence Title:The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents View PDF HTML (experimental)Abstract:Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work. ### GPT-4o flunks the abstract thinking test on SemEval's ReCAM benchmark **Source:** arXiv cs.CL **Link:** https://arxiv.org/abs/2604.12018 Computer Science > Computation and Language Title:LLMs Struggle with Abstract Meaning Comprehension More Than Expected View PDF HTML (experimental)Abstract:Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models' ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension. ### Tri-Spirit Architecture splits agent brains across three hardware layers instead of one cloud blob **Source:** arXiv cs.AI **Link:** https://arxiv.org/abs/2604.13757 Computer Science > Artificial Intelligence Title:Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents View PDF HTML (experimental)Abstract:The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms -- cloud-centric AI, on-device inference, and edge-cloud pipelines -- treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware. ### Alibaba's Qwen3.6 drew a better pelican on a bike than Claude Opus 4.7, running on Simon Willison's laptop **Source:** Simon Willison **Link:** https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-everything Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 16th April 2026 For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic. Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here: And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript): I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame! I tried Opus a second time passing thinking_level: max . It didn’t do much better (transcript): I don’t think Qwen are cheating A lot of people are convinced that the labs train for my stupid benchmark. I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”: I’m giving this one to Qwen too, partly for the excellent <!-- Sunglasses on flamingo! --> SVG comment. What can we learn from this? The pelican benchmark has always been meant as a joke—it’s mainly a statement on how obtuse and absurd the task of comparing these models is. The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 were junk. The more recent entries have generally been much, much better—to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere, provided you had a pressing need to illustrate a pelican riding a bicycle. Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release. If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!