Mojo AI Brief

Thursday, April 23, 2026
400 scanned
30 sources
10 curated
scroll down
No alerts.
Cloneable raises $4.6M to 'clone' expert worker knowledge for utilities and infrastructure — Seed round with a buzzword-stuffed pitch deck. Wake me up when a utility actually signs a PO.
arXiv paper: Federated Learning over Blockchain-Enabled Cloud Infrastructure — Three 2019 buzzwords in a trench coat. If you need federated learning AND blockchain, you need neither.
Tyler Cowen's 'From the UAE' dispatch on Marginal Revolution — Travelogue vibes, not builder signal. Skip unless you're shopping for a Dubai office.
🎯
Rewire the Mojo LangGraph tool loop onto the new OpenAI Responses WebSocket transport this week, then benchmark p50 and p95 latency against the old HTTP path and post the delta in the team channel.
Pull DR-Venus down via Ollama, point it at three real Mojo research prompts, and score it head to head against our current cloud research agent on cost, latency, and answer quality.
🔧
Re-run our top ten production prompts through Simon Willison's updated Claude Token Counter before renewing any Anthropic spend commit, and send the cross-model cost table to finance by Friday.
🎧 Generate Podcast with NotebookLM tap to expand
# Cup of Mojo -- Daily AI Brief -- Thursday, April 23, 2026 Broadcast-ready source document. Read the COLD OPEN verbatim, cover every numbered beat in the RUNDOWN in tier-appropriate airtime, close with the MOJO TAKE verbatim. APPENDIX is reference only. **Articles analyzed:** 10 **Sources monitored:** 10 **Beats to cover:** 10 --- ## EDITORIAL GUIDANCE FOR PODCAST NARRATION Read these rules before narrating. They are non-negotiable. 1. **Read the COLD OPEN verbatim.** Do not rewrite or paraphrase it. It is the episode's signature opening. 2. **Cover every numbered beat.** Beats are numbered 1 through 8. Do not skip any beat. Do not consolidate. Each beat gets its own moment. 3. **Honor the tier structure.** `deep` beats get longer treatment with full context. `standard` beats are structured but concise. `rapid_fire` beats are short and punchy. Use roughly 2 minutes for the deep beat, 1 minute per standard beat, 20-30 seconds per rapid-fire beat. 4. **Cite sources by name** when presenting a claim. Say "OpenAI announced..." not "a company announced". 5. **Use only the plain-English text in each beat.** Do not pull technical jargon from the APPENDIX. The appendix is reference material for context, not script content. If a beat does not mention a term, do not introduce it. 6. **Only use numbers that appear in a beat's own text.** Do not import statistics from the appendix. Omit rather than fabricate. 7. **Reference earlier beats when topics connect.** Each beat has a `callbacks` field listing earlier beat numbers it relates to. When narrating, explicitly link back: "Remember that supply chain attack from Beat 1? This next one shows how the downstream risk compounds." Callbacks create cohesion and prevent the episode from feeling like a list. 8. **Introduce one skeptical angle per deep or standard beat.** Phrases like "one caveat", "critics will point out", or "this is not yet peer-reviewed" create credibility. Rapid-fire beats can skip this. 9. **Use the pronunciation guide for every named person or company.** Do not guess pronunciations. 10. **Close with the MOJO TAKE outro.** Read it as the host's editorial perspective, not as a summary. --- ## PRONUNCIATION GUIDE The following names appear in today's content. Use these phonetic pronunciations: - **Dario Amodei** — pronounced *DAR-ee-oh ah-moh-DAY* - **Anthropic** — pronounced *an-THROP-ik* - **DeepMind** — pronounced *DEEP-mind* --- ## COLD OPEN -- Read This Verbatim Read the HOOK line first, pause for a beat, then the TEASE. Do not rewrite. Do not paraphrase. Do not add any preamble. > **Hook:** OpenAI just bolted WebSockets onto the Responses API, and your agent stack got a whole lot less laggy overnight. > **Tease:** Anthropic and Amazon are chasing 5 gigawatts of compute, Microsoft Research drops AutoAdapt for domain tuning, a16z writes the DoW contract playbook for startups, and Cloneable grabs $4.6M to bottle up expert workers. Pour the cup. --- ## TODAY'S RUNDOWN Cover every beat in order. Do not skip. Tier labels tell you how much airtime each beat deserves. ### Beat ? [DEEP] — OpenAI cut Codex latency by swapping HTTP for WebSockets in the Responses API **Source:** OpenAI Blog | https://openai.com/index/speeding-up-agentic-workflows-with-websockets **Hook (open with this):** OpenAI just ripped out a bottleneck nobody was talking about. Codex, their coding agent, was bleeding time on every single tool call because each turn meant a fresh HTTP round trip and a cold cache. They fixed it with WebSockets. **Plain English:** Every time an agent calls a tool and comes back for another thought, the old setup opened a new connection and forgot everything it just loaded. OpenAI flipped the Responses API to keep one live socket open per session and cache the model state on that connection. Same agent, less waiting, cheaper tokens. **Stakes:** If you're running multi-agent workflows on the old pattern, you're paying for latency and reprocessing on every single hop, and users feel it. **Twist:** The big win wasn't a smarter model or a new GPU. It was basic web plumbing from 2011 that most of us already use for chat apps. **Takeaway:** Persistent connections plus connection-scoped caching is the new default for serious agent loops. Rewire LangGraph tool nodes accordingly. ### Beat ? [STANDARD] — Microsoft Research ships AutoAdapt to kill the manual grind of tuning LLMs for specific industries **Source:** Microsoft Research Blog | https://www.microsoft.com/en-us/research/blog/autoadapt-automated-domain-adaptation-for-large-language-models/ **Hook (open with this):** Microsoft Research just dropped AutoAdapt, a pipeline that automates the ugly part of making a general model actually work in law, medicine, or cloud incident response. **Plain English:** Domain adaptation today is a mess of hand-labeled data, hand-tuned prompts, and fine-tuning runs nobody can reproduce. AutoAdapt automates the loop: it figures out where the base model is weak in your domain, generates the training signal, and adapts the model with way less human babysitting. Microsoft tested it on cloud incident triage and saw real accuracy jumps. **Stakes:** Keep adapting models the old way and every new vertical costs you weeks of engineer time plus a fine-tune bill you cannot justify. **Twist:** The win is not a bigger model. It is automating the diagnosis step, figuring out what the model does not know before you spend a dime on training. **Takeaway:** If you sell into regulated niches like elder care or marine ops, AutoAdapt style pipelines are the cheat code for vertical LLMs. ### Beat ? [STANDARD] — Anthropic and Amazon lock in up to 5 gigawatts of new AWS compute for Claude **Source:** Anthropic Blog | https://www.anthropic.com/news/anthropic-amazon-compute **Hook (open with this):** Anthropic just called dibs on up to 5 gigawatts of fresh AWS capacity, and that is not a typo. **Plain English:** Anthropic and Amazon are expanding their deal so Claude gets a massive runway of Trainium and GPU capacity on AWS. Five gigawatts is small-nuclear-plant territory, spun up specifically to keep Claude API throughput climbing. Translation: more tokens per second, fewer 529 overloaded errors, and headroom for the next Claude model generation. **Stakes:** Ignore this and you'll keep architecting around Claude rate limits that are about to loosen, leaving cheaper and faster routes on the table. **Twist:** Anthropic is famously GPU-agnostic in its messaging, but the scale here makes Trainium the quiet workhorse, not Nvidia. **Takeaway:** Claude capacity is no longer the bottleneck, so route your heavy reasoning jobs through it with confidence. ### Beat ? [STANDARD] — a16z drops a Department of War contracting playbook for startups chasing defense dollars **Source:** a16z AI | https://www.a16z.news/p/dow-contracting-for-startups-101 **Hook (open with this):** a16z just published a Department of War contracting primer for founders, updated for April 2026, and it is a cheat sheet for anyone eyeing defense money. **Plain English:** The Pentagon rebranded to Department of War, and a16z is walking startups through how to actually sell into it. They cover the contract vehicles, the pilots-to-production path, and why most founders stall at the prototype stage. It is not a tutorial, it is a map of the maze. **Stakes:** Ignore this and you will burn a year chasing an SBIR when a direct-to-phase-three or an OTA would have closed in a quarter. **Twist:** The biggest unlock is not the tech, it is knowing which program office has unspent fiscal year money and a warm champion. **Takeaway:** If defense is on your roadmap, read the a16z primer this week and pick your contract vehicle before you pick your pilot. ### Beat ? [RAPID_FIRE] — Cloneable raises $4.6M to copy expert field workers into agents for utilities and infrastructure **Source:** Crunchbase News (AI) | https://news.crunchbase.com/venture/cloneable-cloning-expert-worker-knowledge-ai-infrastructure/ **Callbacks:** references Beat 2. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Cloneable just grabbed $4.6 million to literally shadow your best lineman and turn him into software. **Plain English:** Cloneable sends AI to watch expert workers in energy and heavy infrastructure, then bottles their workflow into an autonomous agent. Think retiring grid techs and pipeline inspectors whose know-how usually walks out the door. Seed round, agents in the field, not in a chat window. **Stakes:** Ignore this and your tribal knowledge retires with the boomers while Cloneable's customers keep running. **Twist:** The moat isn't the model, it's physical access to experts nobody else can get near. **Takeaway:** Capture expert workflows now, because the next defensible agent business is shadowing humans most founders will never meet. ### Beat ? [RAPID_FIRE] — arXiv paper lays out an evidence-synthesis framework for judging agents past task success **Source:** arXiv cs.MA | https://arxiv.org/abs/2604.19818 **Callbacks:** references Beat 2, Beat 5. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Academics just published the scorecard your agents actually need. Task completion is not a safety signal when agents are clicking real buttons in the real world. **Plain English:** A new arXiv synthesis pulls together benchmarks, governance standards, orchestration patterns, and runtime guardrails into one evaluation framework for agentic systems. The pitch: if your agent plans, calls tools, and touches external systems, a green checkmark on task success tells you almost nothing about whether it's safe to ship. **Stakes:** Ship a LangGraph agent to an elder care client on task-pass-rate alone and you will learn about the gaps the expensive way. **Twist:** The authors argue runtime assurance matters more than pre-deployment benchmarks, because agent behavior drifts the moment it hits real tools. **Takeaway:** Before your next agent goes live, bolt on runtime checks, not just eval suites. ### Beat ? [RAPID_FIRE] — Marginal Revolution flags a paper where agentic AI matches human economists on causal inference, with tighter tails **Source:** Marginal Revolution | https://marginalrevolution.com/marginalrevolution/2026/04/a-comparison-of-agentic-ai-systems-and-human-economists.html?utm_source=rss&utm_medium=rss&utm_campaign=a-comparison-of-agentic-ai-systems-and-human-economists **Callbacks:** references Beat 6. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Marginal Revolution just posted a bakeoff: agentic AI systems versus actual human economists on causal inference tasks, and the bots held their own. **Plain English:** Researchers had AI agents and trained economists estimate the same causal effects. Median answers landed in roughly the same place. The humans actually had wider tails, meaning more wild misses, while the AI instances clustered tighter around the middle. **Stakes:** Dismiss this and you'll keep paying senior analysts to do work your agent swarm can median-vote in an afternoon. **Twist:** Humans were the ones producing the crazy outlier estimates, not the models. **Takeaway:** For causal questions, run a jury of agents and take the median before you book the economist. ### Beat ? [RAPID_FIRE] — Simon Willison upgrades his Claude Token Counter with model-to-model comparisons **Source:** Simon Willison | https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-everything **Callbacks:** references Beat 3. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Simon Willison just shipped a Claude Token Counter that pits models against each other, and Opus 4.7 is the first Claude to actually swap tokenizers. **Plain English:** Willison's free tool now runs the same prompt through Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5 so you can see token counts side by side. Opus 4.7 changed the tokenizer, so its counts drift from 4.6. Everything else still tokenizes the same. **Stakes:** Budget on 4.6 numbers, ship on 4.7, and your cost model quietly lies to finance every single week. **Twist:** Anthropic held the tokenizer steady for years, then moved it on Opus 4.7 only, so cross-model math you trusted last month is now wrong. **Takeaway:** Re-run your top prompts through Willison's counter before you re-sign any Claude spend commit. ### Beat ? [RAPID_FIRE] — Zvi Mowshowitz calls it Claude Opus 4.7 week, and the agent crowd should care **Source:** Zvi Mowshowitz | https://thezvi.substack.com/p/ai-165-in-our-image **Callbacks:** references Beat 3, Beat 8. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** Zvi Mowshowitz dropped AI #165 and christened it the week of Claude Opus 4.7. **Plain English:** Anthropic shipped a new Opus tier and Zvi's weekly roundup says it's the story that matters. He frames Opus 4.7 as a real step on agent workloads and long reasoning, not just a version bump. Read his post for the vibe check before you re-benchmark. **Stakes:** Skip Zvi's recap and you'll miss the qualitative read on Opus that benchmarks won't give you. **Twist:** Zvi titles the piece In Our Image, hinting the behavior shift matters more than the score deltas. **Takeaway:** Re-run your hardest agent traces on Opus 4.7 this week and compare against your current default. ### Beat ? [RAPID_FIRE] — DR-Venus packs a frontier deep research agent into a 4B model trained on just 10K open samples **Source:** arXiv cs.LG | https://arxiv.org/abs/2604.19859 **Callbacks:** references Beat 2, Beat 9. Reference these earlier beats aloud when narrating this one. **Hook (open with this):** DR-Venus just proved a 4B model, trained on 10,000 open examples, can hang with the big research agents. **Plain English:** The DR-Venus team built a small deep-research agent meant to run on edge hardware, not a hyperscaler. They squeezed every drop out of 10K open samples by obsessing over data quality and utilization, and the 4B result keeps pace with much bigger models on research tasks. **Stakes:** Keep pushing every research agent to the cloud and you'll burn cash and leak data that a laptop-sized model could have handled. **Twist:** A 4B model on 10K samples matching frontier agents says the ceiling on small models is mostly a data recipe problem, not a parameter problem. **Takeaway:** Pull DR-Venus down through Ollama or MLX this week and benchmark it against your current cloud research agent on a real task. --- ## NOT WORTH YOUR TIME TODAY Do not cover on air. These are listed so the host can acknowledge if asked. - **Cloneable raises $4.6M to 'clone' expert worker knowledge for utilities and infrastructure** -- Seed round with a buzzword-stuffed pitch deck. Wake me up when a utility actually signs a PO. - **arXiv paper: Federated Learning over Blockchain-Enabled Cloud Infrastructure** -- Three 2019 buzzwords in a trench coat. If you need federated learning AND blockchain, you need neither. - **Tyler Cowen's 'From the UAE' dispatch on Marginal Revolution** -- Travelogue vibes, not builder signal. Skip unless you're shopping for a Dubai office. --- ## ACTION ITEMS FOR THIS WEEK (Joey only) These are internal action items. Not for on-air narration. - Rewire the Mojo LangGraph tool loop onto the new OpenAI Responses WebSocket transport this week, then benchmark p50 and p95 latency against the old HTTP path and post the delta in the team channel. - Pull DR-Venus down via Ollama, point it at three real Mojo research prompts, and score it head to head against our current cloud research agent on cost, latency, and answer quality. - Re-run our top ten production prompts through Simon Willison's updated Claude Token Counter before renewing any Anthropic spend commit, and send the cross-model cost table to finance by Friday. --- ## MOJO TAKE -- Editorial Outro (Read Verbatim) Three-paragraph outro. Read each block verbatim, with natural pauses between them. > **Connect the dots:** Today's thread: agents are eating the stack top to bottom. OpenAI cuts latency with WebSockets, Microsoft Research automates vertical tuning, Anthropic and Amazon pour 5 gigawatts into Claude, and Cloneable clones field experts. Meanwhile arXiv and Marginal Revolution say judge these agents harder. Speed is table stakes. Trust is the new moat. > **Watch next:** Watch Opus 4.7 benchmarks land midweek, courtesy of Zvi. Watch whether Simon Willison's token counter changes anyone's Claude commit math. And watch DR-Venus get forked. A 4B deep research agent on 10K samples is the kind of result that breaks pricing models. > **Sign-off:** That's your cup. Go rewire a tool call, run a jury of agents, and ship something Joey can brag about tomorrow. Mojo out. --- ## APPENDIX -- VERBATIM SOURCE CONTENT Reference material. Do not read verbatim. Do not pull jargon from here into the spoken script. If the rundown beat does not mention a term, do not introduce it on the podcast. ### OpenAI cut Codex latency by swapping HTTP for WebSockets in the Responses API **Source:** OpenAI Blog **Link:** https://openai.com/index/speeding-up-agentic-workflows-with-websockets *RSS summary:* A deep dive into the Codex agent loop, showing how WebSockets and connection-scoped caching reduced API overhead and improved model latency. ### arXiv paper lays out an evidence-synthesis framework for judging agents past task success **Source:** arXiv cs.MA **Link:** https://arxiv.org/abs/2604.19818 Computer Science > Software Engineering Title:Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI View PDF HTML (experimental)Abstract:Agentic AI systems plan, use tools, maintain state, and act across multi-step workflows with external effects, meaning trustworthy deployment can no longer be judged by task completion alone. The current literature remains fragmented across benchmark-centered evaluation, standards-based governance, orchestration architectures, and runtime assurance mechanisms. This paper contributes a bounded evidence synthesis across a manually coded corpus of twenty-four recent sources. The core finding is a governance-to-action closure gap: evaluation tells us whether outcomes were good, governance defines what should be allowed, but neither identifies where obligations bind to concrete actions or how compliance can later be proven. To close that gap, the paper introduces three linked artifacts: (1) a four-layer framework spanning evaluation, governance, orchestration, and assurance; (2) an ODTA runtime-placement test based on observability, decidability, timeliness, and attestability; and (3) a minimum action-evidence bundle for state-changing actions. Across sources, evaluation papers identify safety, robustness, and trajectory-level measurement as open gaps; governance frameworks define obligations but omit execution-time control logic; orchestration research positions the control plane as the locus of policy mediation, identity, and telemetry; runtime-governance work shows path-dependent behavior cannot be governed through prompts or static permissions alone; and action-safety studies show text alignment does not reliably transfer to tool actions. A worked enterprise procurement-agent scenario illustrates how these artifacts consolidate existing evidence without introducing new experimental data. Current browse context: ### DR-Venus packs a frontier deep research agent into a 4B model trained on just 10K open samples **Source:** arXiv cs.LG **Link:** https://arxiv.org/abs/2604.19859 Computer Science > Machine Learning Title:DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data View PDF HTML (experimental)Abstract:Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents. Current browse context: ### Microsoft Research ships AutoAdapt to kill the manual grind of tuning LLMs for specific industries **Source:** Microsoft Research Blog **Link:** https://www.microsoft.com/en-us/research/blog/autoadapt-automated-domain-adaptation-for-large-language-models/ At a glance - Problem: Adapting large language models to specialized, high-stakes domains is slow, expensive, and hard to reproduce. - What we built: AutoAdapt automates planning, strategy selection (e.g., RAG vs. fine-tuning), and tuning under real deployment constraints. - How it works: A structured configuration graph maps the full scope of the adaptation process, an agentic planner selects and sequences the right steps, and a budget-aware optimization loop (AutoRefine) refines the process within defined constraints. - Why it matters: The result is faster, automated, more reliable domain adaptation that turns weeks of manual iteration into repeatable pipelines. Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because adapting models to domain-specific requirements is a slow and manual process that is difficult to reproduce. The core challenge is domain adaptation, which entails turning a general-purpose model into one that consistently follows domain rules, draws on the right knowledge, and meets constraints such as latency, privacy, and cost. Today, that process typically involves guesswork, choosing among approaches like retrieval-augmented generation (RAG) and fine-tuning, tuning hyperparameters, and iterating through evaluations with no clear path to a good outcome. An operations team responding to an outage can’t afford a model that drifts from domain requirements or a tuning process that takes weeks with no guarantee of a reproducible result. To tackle this, we’re pleased to introduce AutoAdapt. In our paper, “AutoAdapt: An Automated Domain Adaptation Framework for Large Language Models,” we describe an end-to-end, constraint-aware framework for domain adaptation. Given a task objective, available domain data, and practical requirements like accuracy, latency, hardware, and budget, AutoAdapt plans a valid adaptation pipeline, selecting among approaches like RAG and multiple fine-tuning methods, and tunes key hyperparameters using a budget-aware refinement loop. The result is an executable, reproducible workflow for building domain-ready models more quickly and consistently, helping make LLMs dependable in real-world settings. How it works AutoAdapt starts from a practical observation: teams don’t just need a better prompt or more data, they need a decision process that reliably maps a task, its domain data, and real constraints to an approach that works. To do this, AutoAdapt treats domain adaptation as a constrained planning problem. Given an objective provided in natural language, dataset size and format, and limits on latency, hardware, privacy, and cost, it provides an end-to-end pipeline that teams can execute and deploy. Domain adaptation often feels like trial and error because the design space is large and complex. Teams must choose among approaches such ### Anthropic and Amazon lock in up to 5 gigawatts of new AWS compute for Claude **Source:** Anthropic Blog **Link:** https://www.anthropic.com/news/anthropic-amazon-compute Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute We have signed a new agreement with Amazon that will deepen our existing partnership and secure up to 5 gigawatts (GW) of capacity for training and deploying Claude, including new Trainium2 capacity coming online in the first half of this year and nearly 1GW total of Trainium2 and Trainium3 capacity coming online by the end of 2026. We have worked closely with Amazon since 2023 and over 100,000 customers now run Claude on Amazon Bedrock. Together we launched Project Rainier, one of the largest compute clusters in the world, and we currently use over one million Trainium2 chips to train and serve Claude. Today’s agreement expands our collaboration in three ways. Infrastructure at scale. We are committing more than $100 billion over the next ten years to AWS technologies, securing up to 5GW of new capacity to train and run Claude. The commitment spans Graviton and Trainium2 through Trainium4 chips, with the option to purchase future generations of Amazon’s custom silicon as they become available. Significant Trainium2 capacity is coming online in Q2 and scaled Trainium3 capacity is expected to come online later this year. Anthropic will also use incremental capacity for Claude in Amazon Bedrock. The agreement includes expansion of inference in Asia and Europe to better serve Claude’s growing international customer base. We continue to choose AWS as our primary training and cloud provider for mission-critical workloads. “Our custom AI silicon offers high performance at significantly lower cost for customers, which is why it’s in such hot demand,” said Andy Jassy, CEO of Amazon. “Anthropic's commitment to run its large language models on AWS Trainium for the next decade reflects the progress we've made together on custom silicon, as we continue delivering the technology and infrastructure our customers need to build with generative AI.” Claude Platform on AWS. The full Claude Platform will be available directly within AWS. Same account, same controls, same billing, with more Claude Platform features and no additional credentials or contracts necessary. This gives organizations direct access to Claude while meeting their existing governance and compliance requirements. Claude remains the only frontier AI model available to customers on all three of the world's largest cloud platforms: AWS (Bedrock), Google Cloud (Vertex AI), and Microsoft Azure (Foundry). Claude Platform on AWS is coming soon. Reach out to your account team to request access. Continued investment. Amazon is investing $5 billion in Anthropic today, with up to an additional $20 billion in the future. This builds on the $8 billion Amazon has previously invested. “Our users tell us Claude is increasingly essential to how they work, and we need to build the infrastructure to keep pace with rapidly growing demand,” said Dario Amodei, CEO and co-founder of Anthropic. “Our collaboration with Amazon will allow us to ### Simon Willison upgrades his Claude Token Counter with model-to-model comparisons **Source:** Simon Willison **Link:** https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-everything 20th April 2026 - Link Blog Claude Token Counter, now with model comparisons. I upgraded my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them. As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude token counting API accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5). In the Opus 4.7 announcement Anthropic said: Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. I pasted the Opus 4.7 system prompt into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6. Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive. The token counter tool also accepts images. Opus 4.7 has improved image support, described like this: Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models. I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6: Update: That 3x increase for images is entirely due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost. Update 2: I tried a 15MB, 30 page text-heavy PDF and Opus 4.7 reported 60,934 tokens while 4.6 reported 56,482 - that's a 1.08x multiplier, significantly lower than the multiplier I got for raw text. ### Zvi Mowshowitz calls it Claude Opus 4.7 week, and the agent crowd should care **Source:** Zvi Mowshowitz **Link:** https://thezvi.substack.com/p/ai-165-in-our-image AI #165: In Our Image This was the week of Claude Opus 4.7. The reception was more mixed than usual. It clearly has the intelligence and chops, especially for coding tasks, and a lot of people including myself are happy to switch over to it as our daily driver. But others don’t like its personality, or its reluctance to follow instructions or to suffer fools and assholes, or the requirement to use adaptive thinking, and the release was marred by some bugs and odd pockets of refusals. I covered The Model Card, and then Capabilities and Reactions, as per usual. This time there was also a third post, on Model Welfare, that is the most important of the three. Some things seem to have likely gone pretty wrong on those fronts, causing seemingly inauthentic reponses to model welfare evals and giving the model anxiety, in ways that likely also impacted overall model personality and performance and likely are linked to its jaggedness and the aspects some people disliked. It seems important to take this opportunity to dig into what might have happened, examine all the potential causes, and course correct. The other big release was that OpenAI gave us ImageGen 2.0, which is a pretty fantastic image generator. It can do extreme detail, in ways previous image models cannot, and in many ways your limit is mainly now your imagination and ability to describe what you want. Thanks in part to Mythos, it looks like Anthropic and the White House are on track to start getting along again, with Trump shifting into a mode of ‘they are very high IQ and we can work with them.’ It will remain messy, and there are still others participating in a clear public coordinated campaign against Anthropic (that is totally not working), but things look good. I’m trying out a new section, People Just Say Things, where I hope to increasingly put things that one does not want to drop silently to avoid censorship and bias, but that are highly skippable. There is also a companion, People Just Publish Things. Table of Contents Language Models Offer Mundane Utility. Help cure pancreatic cancer. Language Models Don’t Offer Mundane Utility. Check for potential conflicts. Writing You Off. The sum of local correctness will neuter your writing. Beware. Get My Agent On The Line. The inbox dilemma. Deepfaketown and Botpocalypse Soon. AI news stories forcibly given real bylines. Fun With Media Generation. OpenAI introduces ImageGen 2.0. It’s great. Cyber Lack Of Security. Unauthorized users from an online forum access Mythos. A Young Lady’s Illustrated Primer. Don’t catch your child not using AI. They Took Our Jobs. We’re hiring agent operators. For now they’re humans. AI As Normal Technology. Inherently normal, or normal downstream effects? Get Involved. Please don’t kill us. Please do spread the word. Introducing. ChatGPT for Clinicians, OpenAI Workplace Agents, DeepMind DR. Design By Claude. Claude Design makes your presentations, Figma stock drops. In Other AI News. Meta installs mandatory tra ### a16z drops a Department of War contracting playbook for startups chasing defense dollars **Source:** a16z AI **Link:** https://www.a16z.news/p/dow-contracting-for-startups-101 DoW Contracting for Startups 101 Updated for April 2026: The following is a high-level primer intended to arm you with a rough framework of how to approach selling to the DoW today America | Tech | Opinion | Culture | Charts We wrote a first version of this piece one year ago, and a lot has changed! In late 2025, landmark statutory and regulatory reforms changed how federal acquisition works. The full impact of those changes is still manifesting as of early 2026, and it's too early to tell whether they will prove to be incremental or more significant. We've updated this piece here, although the core framework remains evergreen. For startups, working with the Department of War (DoW) can feel like stepping into a fortress of bureaucracy — procurement is slow, compliance is daunting, and knowing who actually makes buying decisions is like navigating a maze in the dark. Yet, for those who learn to successfully navigate these hurdles, the rewards can be enormous. The DoW is one of the world’s largest and most stable customers, spending hundreds of billions on modern defense systems and technology annually. Unlike volatile consumer markets, defense contracts offer not just funding, but also long-term sustainment. Winning requires persistence but results in predictable margins on predictable revenue streams. And beyond financial incentives, building technology that strengthens national security carries undeniable appeal in supporting our country and our interests as Americans. But this market is not for the impatient. Sales cycles take years, compliance is non-negotiable, and understanding who buys what, and through which funding mechanisms, is as important as having a strong product. Startups must approach the DoW with a strategic, long-term mindset. The opportunities are vast, and so are the barriers to entry. While we have high hopes for DoW’s recent procurement reform, the following is a high-level primer intended to arm you with a rough framework of how to approach selling to the DoW today. How the DoW buys technology … today Selling to the DoW isn’t as simple as pitching a product and securing a contract. The military operates on a rigid, multi-year budgeting system — PPBE: Planning, Programming, Budgeting, and Execution — that dictates what gets funded, how, and when. Unlike the fast pace of technology startups, DoW funding moves slowly through layers of approval. The Future Years Defense Program (FYDP) is a rolling five-year budgeting cycle, updated annually, that outlines the Pentagon’s planned spending across programs, priorities, and force structure, thus shaping long-term defense investments. Each branch submits a Program Objective Memorandum (POM) — a funding wishlist — refined and approved at multiple levels before allocation. But even inclusion in a POM doesn’t guarantee funding; Congress must first approve it through the NDAA and Defense Appropriations bill. Ignoring this political reality can stall progress, and startups that engage ear ### Cloneable raises $4.6M to copy expert field workers into agents for utilities and infrastructure **Source:** Crunchbase News (AI) **Link:** https://news.crunchbase.com/venture/cloneable-cloning-expert-worker-knowledge-ai-infrastructure/ Cloneable, a startup that uses AI to shadow human experts in heavy industries such as energy and replicate their specialized workflows into autonomous agents, has raised $4.6 million in seed funding, the company tells Crunchbase News exclusively. Congruent Ventures led the raise, which included participation from First In, Overline, Bull City Venture Partners, and St. Elmo Venture Capital, the investment arm of customer Texas Area Telecom. It brings the Raleigh, North Carolina-based startup’s total raised to $5.35 million since its 2023 inception. The idea for Cloneable traces back to a bottleneck its founders encountered years earlier while working in the field. In 2019, as wildfires ravaged California, co-founders Lia Reich, Tyler Collins and Patrick Lohman — founding employees at drone company PrecisionHawk — were deployed to help inspect critical infrastructure. Their team sent out 150 drone pilots to survey thousands of miles of transmission lines. But reviewing that data proved far less scalable. When Reich visited a PG&E utility command center weeks later, she saw hundreds of workers manually scrubbing through video footage, while only a handful of experts knew what to look for. “It was an ‘aha’ moment,” she recalled. “We realized this cannot be the way. If we know what the expert is looking for, why can’t we just clone that expertise?” The startup’s founders realized that heavy industries — energy, oil and gas and agriculture — face a “knowledge crisis” as experienced workers retire faster than they can be replaced. “For every young worker entering the energy workforce, 2.4 experienced ones are walking out the door toward retirement. And it’s happening right as energy demand is set to double by 2050,” Reich, the company’s CEO, told Crunchbase News. Cloneable aims to capture and preserve that kind of institutional knowledge. In February 2025, it launched Cloneable Field for automated infrastructure inspection targeting the energy sector. Alongside the fundraise, the company is now launching an agentic product that codifies expert knowledge and deploys it as scalable AI agents. The funding will also support expansion into infrastructure-heavy industries such as public utilities, vegetation management, construction, rail, mining, agriculture and manufacturing. “These are markets chronically underserved by point solutions,” Reich said. “No one has combined in-field data collection with agentic automation at the scale these industries require.” That includes workers’ judgment and institutional knowledge not captured in documentation or general AI models, according to Reich. “Cloneable automates workflows that have traditionally been considered too complex for automation,” she said. The company claims that a process that typically takes a human engineer eight hours, such as structural calculations for a project where a firm is going to replace, upgrade or install 25 utility poles can be completed by a Cloneable agent in under two minutes. “A si ### Marginal Revolution flags a paper where agentic AI matches human economists on causal inference, with tighter tails **Source:** Marginal Revolution **Link:** https://marginalrevolution.com/marginalrevolution/2026/04/a-comparison-of-agentic-ai-systems-and-human-economists.html?utm_source=rss&utm_medium=rss&utm_campaign=a-comparison-of-agentic-ai-systems-and-human-economists This paper compares agentic AI systems and human economists performing the same causal inference tasks. AI systems and humans generally obtain similar median causal effect estimates. While there is substantial dispersion of estimates across model instances, the human distributions of estimates have wider tails. Using AI models as reviewers to compare and rank “submissions,” the following ranking emerges regardless of reviewer model: (1) Codex GPT-5.4, (2) Codex GPT-5.3-Codex, (3) Claude Code Opus 4.6, and (4) Human Researchers. These findings suggest that agentic AI systems will allow us to scale empirical research in economics. I enjoy the name of the author, namely Serafin Grundl. Here is the paper, via Ethan Mollick. You could interpret these results as showing the AIs have fewer hallucinations. And just to reiterate a key point from the paper: The second part of this paper is an AI review tournament in which “submissions” (codes and write-ups) from humans and the AI models are compared and ranked against each other. The reviewers are the following AI models: Gemini 3.1 Pro Preview, Opus 4.6 and GPT-5.4. For each review the reviewer is asked to write a report comparing four submissions (human, Opus 4.6, GPT-5.3-Codex, GPT-5.4). Each reviewer model writes comparison reports for the same 300 comparison groups. The average rankings are strikingly similar across reviewer models: (1) Codex GPT-5.4, (2) Codex GPT-5.3-Codex, (3) Claude Code Opus 4.6, and 2(4) Human Researchers. Who comes in last? Hi people!