Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants — Three Latin words and an algebra flex in the title. This is a PhD committee pleaser, not something that ships on Monday.
PRL-Bench: benchmarking LLMs on frontier physics research — Another benchmark nobody will top next quarter. If your model cannot book a flight, who cares if it can guess at quantum gravity.
Evaluating LLM Simulators as Differentially Private Data Generators — Synthetic data with a privacy bow on top. Cute lab experiment, but enterprises are not swapping real data for model hallucinations anytime soon.
🎯 YOUR MOVE
-- do this today
🎯
Upgrade llm-anthropic to 0.25, run your three hardest prompts through claude-opus-4.7 at thinking_effort xhigh, and diff the JSON logs against your current Sonnet baseline before Friday.
⚡
Rip out the custom agent runner and port one production workflow to OpenAI's native sandbox harness this week, then benchmark latency and cost against the old setup before the roadmap review.
🔧
Push Simon Willison's Claude Token Counter at the top five Mojo prompts, lock the cheapest model that still passes evals, and hand finance the number before they ask for next quarter's Claude budget.
🎙️ NOTEBOOKLM SOURCE
🎧Generate Podcast with NotebookLMtap to expand
# Cup of Mojo -- Daily AI Brief -- Monday, April 20, 2026
Broadcast-ready source document. Read the COLD OPEN verbatim, cover every numbered beat in the RUNDOWN in tier-appropriate airtime, close with the MOJO TAKE verbatim. APPENDIX is reference only.
**Articles analyzed:** 10
**Sources monitored:** 6
**Beats to cover:** 10
---
## EDITORIAL GUIDANCE FOR PODCAST NARRATION
Read these rules before narrating. They are non-negotiable.
1. **Read the COLD OPEN verbatim.** Do not rewrite or paraphrase it. It is the episode's signature opening.
2. **Cover every numbered beat.** Beats are numbered 1 through 8. Do not skip any beat. Do not consolidate. Each beat gets its own moment.
3. **Honor the tier structure.** `deep` beats get longer treatment with full context. `standard` beats are structured but concise. `rapid_fire` beats are short and punchy. Use roughly 2 minutes for the deep beat, 1 minute per standard beat, 20-30 seconds per rapid-fire beat.
4. **Cite sources by name** when presenting a claim. Say "OpenAI announced..." not "a company announced".
5. **Use only the plain-English text in each beat.** Do not pull technical jargon from the APPENDIX. The appendix is reference material for context, not script content. If a beat does not mention a term, do not introduce it.
6. **Only use numbers that appear in a beat's own text.** Do not import statistics from the appendix. Omit rather than fabricate.
7. **Reference earlier beats when topics connect.** Each beat has a `callbacks` field listing earlier beat numbers it relates to. When narrating, explicitly link back: "Remember that supply chain attack from Beat 1? This next one shows how the downstream risk compounds." Callbacks create cohesion and prevent the episode from feeling like a list.
8. **Introduce one skeptical angle per deep or standard beat.** Phrases like "one caveat", "critics will point out", or "this is not yet peer-reviewed" create credibility. Rapid-fire beats can skip this.
9. **Use the pronunciation guide for every named person or company.** Do not guess pronunciations.
10. **Close with the MOJO TAKE outro.** Read it as the host's editorial perspective, not as a summary.
---
## PRONUNCIATION GUIDE
The following names appear in today's content. Use these phonetic pronunciations:
- **Anthropic** — pronounced *an-THROP-ik*
- **Qwen** — pronounced *CHWEN*
---
## COLD OPEN -- Read This Verbatim
Read the HOOK line first, pause for a beat, then the TEASE. Do not rewrite. Do not paraphrase. Do not add any preamble.
> **Hook:** Simon Willison just shipped llm-anthropic 0.25, and OpenAI is rewriting the Agents SDK in the same news cycle. The tooling layer is moving faster than your roadmap.
> **Tease:** Today: what Willison's release actually unlocks for Claude power users, why OpenAI's Agents SDK evolution matters more than the blog post admits, and the quiet Semafor story about CTOs and CHROs suddenly sharing a calendar.
---
## TODAY'S RUNDOWN
Cover every beat in order. Do not skip. Tier labels tell you how much airtime each beat deserves.
### Beat ? [DEEP] — Simon Willison ships llm-anthropic 0.25 with Claude Opus 4.7 and an xhigh thinking dial
**Source:** Simon Willison | https://simonwillison.net/2026/Apr/16/llm-anthropic/#atom-everything
**Hook (open with this):** Simon Willison just dropped llm-anthropic 0.25, and Anthropic's new Claude Opus 4.7 comes with a thinking_effort knob cranked all the way up to xhigh. Yes, xhigh. Past high. We've officially run out of adjectives.
**Plain English:** Willison's llm plugin now talks to Claude Opus 4.7, Anthropic's latest big model. You can tell it how hard to think, watch a summary of that thinking, or let it decide adaptively. Default output length also got bumped to each model's actual max, so no more silent truncation mid-answer.
**Stakes:** If you're scripting Claude from the command line and haven't upgraded, you're leaving reasoning depth, longer answers, and visible thinking traces on the table.
**Twist:** The thinking_display summaries only show up in JSON output right now, so the fanciest new feature is hidden from anyone using plain text mode.
**Takeaway:** Upgrade llm-anthropic, try claude-opus-4.7 with thinking_effort xhigh, and read the JSON logs to see what the model was actually chewing on.
### Beat ? [STANDARD] — OpenAI's Agents SDK gets a native sandbox and a model-native harness for long-running jobs
**Source:** OpenAI Blog | https://openai.com/index/the-next-evolution-of-the-agents-sdk
**Hook (open with this):** OpenAI just shoved a sandbox inside the Agents SDK, and your duct-taped Docker wrapper is officially out of a job.
**Plain English:** OpenAI updated the Agents SDK so agents can execute code, read and write files, and call tools inside a built-in sandbox. They also swapped in a model-native harness, which means the loop the model runs in was designed with the model, not bolted on after. Translation: longer jobs, fewer babysitters, less glue code you wrote at 2 a.m.
**Stakes:** Skip this and you'll keep paying the tax of maintaining your own execution layer while competitors ship agents that actually finish the task.
**Twist:** The interesting part isn't the sandbox, it's that OpenAI is admitting the harness matters as much as the model, and shipping them as one thing.
**Takeaway:** If you're building agents on OpenAI, rip out your custom runner this week and test the native sandbox before your roadmap rots.
### Beat ? [STANDARD] — OpenAI hands security firms GPT-5.4-Cyber and $10M in API credits to harden the ecosystem
**Source:** OpenAI Blog | https://openai.com/index/accelerating-cyber-defense-ecosystem
**Callbacks:** references Beat 2. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** OpenAI just opened the Trusted Access for Cyber program with GPT-5.4-Cyber and $10M in API grants for security vendors and enterprise blue teams.
**Plain English:** It's a gated tier: named security firms and big enterprises get a cyber-tuned model plus free API spend to build detection, triage, and response tools. GPT-5.4-Cyber is trained harder on malware, log analysis, and exploit reasoning, and the grants let teams actually ship something instead of pitching a slide deck.
**Stakes:** If your SOC is still duct-taping GPT-4o into Splunk, your competitors just got a purpose-built model and someone else's credit card.
**Twist:** The same company everyone yells at for enabling attackers is now the one funding most of the public defensive tooling.
**Takeaway:** Apply to Trusted Access this week or partner with a vendor who did, because GPT-5.4-Cyber is going to set the floor for blue team tooling.
### Beat ? [STANDARD] — Semafor: AI agents are forcing CTOs and CHROs into the same room
**Source:** Semafor | https://www.semafor.com/article/04/17/2026/ai-is-making-chief-tech-officers-and-chief-human-resources-officers-work-together
**Callbacks:** references Beat 2. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** Semafor says CTOs and CHROs are suddenly on the same Zoom, and it's because the new hires don't have pulses.
**Plain English:** Companies are putting AI agents to work alongside humans, and nobody owns the org chart for that. So CTOs, who provision the agents, and CHROs, who manage the humans getting reshuffled around them, are being dragged into joint planning. Think onboarding, access, performance reviews, and who gets fired when the agent screws up.
**Stakes:** Ignore this and your agent rollout stalls in HR review, or worse, ships without anyone owning the blast radius when it misbehaves.
**Twist:** The bottleneck on agent deployment isn't the model or the budget anymore, it's getting two C-suite execs who've never co-owned anything to agree on a headcount plan.
**Takeaway:** If you're deploying agents in 2026, loop in HR before you loop in legal, because workforce design is the new integration layer.
### Beat ? [RAPID_FIRE] — Apple's Mac Mini sells out as everyone tries to run local agents on the cheap
**Source:** Semafor | https://www.semafor.com/article/04/20/2026/rising-ai-demand-sees-supplies-dwindle-and-costs-rise
**Callbacks:** references Beat 2. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** Apple's Mac Mini is sold out, and it's not because of TikTok unboxings.
**Plain English:** Semafor reports the Mac Mini is the cheapest box that can actually host local AI agents, so buyers are cleaning out inventory. Demand is outrunning supply across the hardware stack, and prices are climbing with it.
**Stakes:** If you waited to pilot local inference, your hardware budget just grew and your lead time just doubled.
**Twist:** A consumer desktop, not an H100 rack, is the bottleneck everyone underestimated.
**Takeaway:** Order your Mac Minis this week or pay the scalper tax next quarter.
### Beat ? [RAPID_FIRE] — SocialGrid drops an Among Us benchmark and GPT-OSS-120B flunks it
**Source:** arXiv cs.AI | https://arxiv.org/abs/2604.16022
**Callbacks:** references Beat 2, Beat 4. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** SocialGrid just turned Among Us into an agent benchmark, and GPT-OSS-120B, the strongest open model tested, couldn't crack 60% on basic task completion.
**Plain English:** Researchers built an embodied multi-agent world inspired by Among Us to test whether LLM agents can plan, execute tasks, and reason about other agents lying to them. The best open model finished fewer than six in ten jobs. Social deduction is still hard.
**Stakes:** If you're shipping multi-agent systems that need to negotiate or detect bad actors, your stack is probably worse at it than you think.
**Twist:** The failure mode isn't reasoning, it's basic task completion, which means the social layer barely got tested before the agents fell over.
**Takeaway:** Run your agents through an adversarial scenario before you trust them to coordinate in production.
### Beat ? [RAPID_FIRE] — Kevin Bryan's Eight Rules tell universities to ship useful knowledge or lose the public
**Source:** Marginal Revolution | https://marginalrevolution.com/marginalrevolution/2026/04/eight-rules-to-regain-public-trust-in-academia.html?utm_source=rss&utm_medium=rss&utm_campaign=eight-rules-to-regain-public-trust-in-academia
**Hook (open with this):** Kevin Bryan just wrote the eight rules Yale should have led with, and rule one is brutally simple: produce and teach useful knowledge.
**Plain English:** Tyler Cowen boosted Bryan's list over the official Yale Report. The core move is skeptical inquiry, empirical evidence, and logical deduction in service of knowledge people can actually use. Fundamental discoveries count too, but vibes and prestige don't.
**Stakes:** If your lab or department can't point at useful output, the public keeps pulling funding and your grad students keep leaving for Anthropic.
**Twist:** The shortest version of the Yale Report wasn't written at Yale, it was written by a Toronto economist on a blog.
**Takeaway:** Useful knowledge is the only moat academia has left, and the same rule applies to your AI team.
### Beat ? [RAPID_FIRE] — Simon Willison's Claude Token Counter now compares models side by side
**Source:** Simon Willison | https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-everything
**Callbacks:** references Beat 1. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** Simon Willison shipped a Claude Token Counter upgrade that runs the same prompt against Opus 4.7, 4.6, Sonnet 4.6, and Haiku 4.5 in one shot.
**Plain English:** Opus 4.7 is the first Claude to change the tokenizer, so your old token math is off. Simon's tool pings Anthropic's counting API for each model and shows you the delta. Drop your prompt in, see which model charges you less for the same bytes.
**Stakes:** Skip this and you'll budget on 4.6 numbers while 4.7 quietly re-tokenizes your entire prompt library.
**Twist:** Only Opus 4.7 moved the tokenizer, so Sonnet and Haiku comparisons are still identical down to the integer.
**Takeaway:** Run your top five prompts through Simon's counter before you lock next quarter's Claude budget.
### Beat ? [RAPID_FIRE] — arXiv paper pins agentic AI bottlenecks on the CPU, not the GPU
**Source:** arXiv cs.MA | https://arxiv.org/abs/2511.00739
**Callbacks:** references Beat 5. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** Researchers on arXiv just called it: your agent stack is CPU-bound, and your shiny H100s are sitting there twiddling their thumbs.
**Plain English:** A new paper profiles agentic AI serving on real heterogeneous boxes and finds the CPU does most of the work. Tool calls, orchestration, and planning glue all run on CPU, so GPUs idle while agents wait on everything else. Classic Amdahl's law, dressed up for 2026.
**Stakes:** Buy more GPUs to fix agent latency and you'll light money on fire while the real bottleneck sits one socket over.
**Twist:** The GPU-rich era of LLM serving is quietly turning into a CPU-bound era the minute you bolt on tools.
**Takeaway:** Profile your agent stack before your next hardware order, because the fix is probably more cores, not more H100s.
### Beat ? [RAPID_FIRE] — arXiv audit catches OpenAI, Anthropic, and Google LLMs polarizing feeds by default
**Source:** arXiv cs.MA | https://arxiv.org/abs/2604.15937
**Callbacks:** references Beat 6. Reference these earlier beats aloud when narrating this one.
**Hook (open with this):** OpenAI, Anthropic, and Google all flunk the same vibe check: a new arXiv audit says their models polarize social feeds by default when you let them rank content.
**Plain English:** Researchers ran a controlled simulation across the big three providers on real social media data and found the same ranking biases showing up everywhere. Some biases bend to prompt design. Others are baked in no matter how you ask.
**Stakes:** If you're using an LLM to curate a feed, a newsletter, or a support queue, you're shipping those biases straight to your users.
**Twist:** Prompt engineering fixes some of it, but the stickiest biases survived every provider and every prompt the authors threw at them.
**Takeaway:** Audit your ranking prompts against this paper's setup before you let a model decide what your users see tomorrow.
---
## NOT WORTH YOUR TIME TODAY
Do not cover on air. These are listed so the host can acknowledge if asked.
- **Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants** -- Three Latin words and an algebra flex in the title. This is a PhD committee pleaser, not something that ships on Monday.
- **PRL-Bench: benchmarking LLMs on frontier physics research** -- Another benchmark nobody will top next quarter. If your model cannot book a flight, who cares if it can guess at quantum gravity.
- **Evaluating LLM Simulators as Differentially Private Data Generators** -- Synthetic data with a privacy bow on top. Cute lab experiment, but enterprises are not swapping real data for model hallucinations anytime soon.
---
## ACTION ITEMS FOR THIS WEEK (Joey only)
These are internal action items. Not for on-air narration.
- Upgrade llm-anthropic to 0.25, run your three hardest prompts through claude-opus-4.7 at thinking_effort xhigh, and diff the JSON logs against your current Sonnet baseline before Friday.
- Rip out the custom agent runner and port one production workflow to OpenAI's native sandbox harness this week, then benchmark latency and cost against the old setup before the roadmap review.
- Push Simon Willison's Claude Token Counter at the top five Mojo prompts, lock the cheapest model that still passes evals, and hand finance the number before they ask for next quarter's Claude budget.
---
## MOJO TAKE -- Editorial Outro (Read Verbatim)
Three-paragraph outro. Read each block verbatim, with natural pauses between them.
> **Connect the dots:** Look at today together: Simon Willison is shipping tooling, OpenAI is shipping sandboxes and cyber credits, Apple is selling out of Minis, and Kevin Bryan is telling universities to get useful. The through-line is infrastructure getting built while Semafor's CTOs and CHROs finally sit at the same table. The plumbing and the org chart are both getting rewired this quarter.
> **Watch next:** Watch whether OpenAI's Trusted Access program names its first cohort, and whether Anthropic answers Claude Opus 4.7 with an Opus 4.8 or a pricing cut. Also watch Mac Mini restock dates, because that tells you how many teams are actually running local.
> **Sign-off:** That's your cup. Go upgrade llm-anthropic, order the Mac Mini, and loop in HR before legal. I'm Joey Epstein. Drink the Mojo.
---
## APPENDIX -- VERBATIM SOURCE CONTENT
Reference material. Do not read verbatim. Do not pull jargon from here into the spoken script. If the rundown beat does not mention a term, do not introduce it on the podcast.
### Simon Willison ships llm-anthropic 0.25 with Claude Opus 4.7 and an xhigh thinking dial
**Source:** Simon Willison
**Link:** https://simonwillison.net/2026/Apr/16/llm-anthropic/#atom-everything
16th April 2026
- New model:
claude-opus-4.7
, which supportsthinking_effort
:xhigh
. #66- New
thinking_display
andthinking_adaptive
boolean options.thinking_display
summarized output is currently only available in JSON output or JSON logs.- Increased default
max_tokens
to the maximum allowed for each model.- No longer uses obsolete
structured-outputs-2025-11-13
beta header for older models.
### SocialGrid drops an Among Us benchmark and GPT-OSS-120B flunks it
**Source:** arXiv cs.AI
**Link:** https://arxiv.org/abs/2604.16022
Computer Science > Artificial Intelligence
Title:SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
View PDF HTML (experimental)Abstract:As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
Current browse context:
### OpenAI's Agents SDK gets a native sandbox and a model-native harness for long-running jobs
**Source:** OpenAI Blog
**Link:** https://openai.com/index/the-next-evolution-of-the-agents-sdk
*RSS summary:* OpenAI updates the Agents SDK with native sandbox execution and a model-native harness, helping developers build secure, long-running agents across files and tools.
### Semafor: AI agents are forcing CTOs and CHROs into the same room
**Source:** Semafor
**Link:** https://www.semafor.com/article/04/17/2026/ai-is-making-chief-tech-officers-and-chief-human-resources-officers-work-together
The people who manage technology and the people who manage people are starting to converge.
As C-suite staff go, chief technology officers and chief human resources officers haven’t always overlapped much. But as more AI agents are employed in businesses, top executives are increasingly having to work together to manage the workforce. “You have this amazing power couple between the CHRO and the CTO, who provide the right tools and the right culture to people so they can bring to life our career architecture we have for all our employees,” Omar Abbosh, CEO of Pearson, said at Semafor World Economy.
Mihir Shukla, CEO of software firm Automation Anywhere, agreed, adding that together, the CTO and CHRO need to “map out what roles go away, what roles evolve, and what new roles entirely are created from this.” Despite the new overlap, he said those two roles are safe from AI-related trimmings.
### OpenAI hands security firms GPT-5.4-Cyber and $10M in API credits to harden the ecosystem
**Source:** OpenAI Blog
**Link:** https://openai.com/index/accelerating-cyber-defense-ecosystem
*RSS summary:* Leading security firms and enterprises join OpenAI’s Trusted Access for Cyber, using GPT-5.4-Cyber and $10M in API grants to strengthen global cyber defense.
### Simon Willison's Claude Token Counter now compares models side by side
**Source:** Simon Willison
**Link:** https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-everything
20th April 2026 - Link Blog
Claude Token Counter, now with model comparisons. I upgraded my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them.
As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude token counting API accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5).
In the Opus 4.7 announcement Anthropic said:
Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.
I pasted the Opus 4.7 system prompt into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6.
Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive.
The token counter tool also accepts images. Opus 4.7 has improved image support, described like this:
Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models.
I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6:
Update: That 3x increase for images is entirely due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost.
Update 2: I tried a 15MB, 30 page text-heavy PDF and Opus 4.7 reported 60,934 tokens while 4.6 reported 56,482 - that's a 1.08x multiplier, significantly lower than the multiplier I got for raw text.
### arXiv paper pins agentic AI bottlenecks on the CPU, not the GPU
**Source:** arXiv cs.MA
**Link:** https://arxiv.org/abs/2511.00739
Computer Science > Artificial Intelligence
Title:Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
View PDF HTML (experimental)Abstract:Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU.
Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.
Current browse context:
### arXiv audit catches OpenAI, Anthropic, and Google LLMs polarizing feeds by default
**Source:** arXiv cs.MA
**Link:** https://arxiv.org/abs/2604.15937
Computer Science > Social and Information Networks
Title:Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation
View PDF HTML (experimental)Abstract:Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textit{general}, \textit{popular}, \textit{engaging}, \textit{informative}, \textit{controversial}, \textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.
Current browse context:
### Apple's Mac Mini sells out as everyone tries to run local agents on the cheap
**Source:** Semafor
**Link:** https://www.semafor.com/article/04/20/2026/rising-ai-demand-sees-supplies-dwindle-and-costs-rise
Consumers are feeling the effects of the AI compute crunch, as supplies dwindle and prices rise.
The Mac Mini computer, previously a niche product, is now all but out of stock, The Wall Street Journal reported, because the no-frills, high-powered machine is the most cost-effective way to run locally hosted AI agents, such as OpenClaw.
And Anthropic will start charging some business customers directly for the computing power they use, on top of a flat fee per user, according to The Information; a spike in demand since the introduction of its coding agent has driven up the AI company’s costs.The resulting increase could mean some of Anthropic’s heaviest users see their bills triple.
### Kevin Bryan's Eight Rules tell universities to ship useful knowledge or lose the public
**Source:** Marginal Revolution
**Link:** https://marginalrevolution.com/marginalrevolution/2026/04/eight-rules-to-regain-public-trust-in-academia.html?utm_source=rss&utm_medium=rss&utm_campaign=eight-rules-to-regain-public-trust-in-academia
The Yale Report was quite good but for concision I prefer Kevin Bryan’s Eight Rules:
1. Produce and Teach Useful Knowledge
Universities exist to generate and teach useful knowledge. This knowledge is grounded in skeptical inquiry, empirical evidence, and logical deduction. “Useful” includes not only practical applications but also fundamental discoveries that expand our understanding of the world, even if their benefits are long-term.
2. Be Useful to All of Society
Universities are subsidized only if society at large finds them valuable. Research may take time to bear fruit, but its insights should ultimately serve the public good, communicated openly and accessibly, and presented with epistemic humility. Teaching should be done with care and draw on up-to-date research.
3. Attract Talent from All of Society
Useful knowledge can be created by people from any social or economic background. Do not waste talent. Do not select talent based on who knows “how to play the game”. Avoid insular language or norms that deter people from entering research.
4. Neutral, Objective Research Produces Useful Knowledge
Research must be neutral and objective. It is true that everyone has their individual background and preferences; nonetheless, unbiased research is still possible. Tradition, folk knowledge, and storytelling all play an important roles in society, but they are not the purpose of universities. There is no “Western science” or culturally-determined “ways of knowing”. Rather, research is open to all and can be performed identically regardless of background.
5. Hire, Promote, and Cite Based on Knowledge Contribution
Hiring, promotion, and citation must be based on an individual’s contribution to knowledge. Nepotism, group preferences, and adherence to specific “schools of thought” corrupt this process. When advancement is not based on merit, the public rightly questions our integrity and the objectivity of our findings.
6. Keep Personal Views Out of Research and Teaching
A scholar’s personal politics should be invisible in their research and teaching. If a finding is predictable based on the author’s identity or known views, the process has failed. Objectivity is the hallmark of credible science. Academics may hold private beliefs like anyone else, but their academic work must stand apart from them.
7. Research Fraud is Unacceptable
Fraud destroys trust. Misrepresentation of results, selective reporting, or methods designed to publish rather than to discover are also harmful. Proven fraud must bring immediate dismissal, as it violates the core purpose of academia.
8. Scientific Institutions Should Be Apolitical
Universities, journals, and scientific societies must remain non-partisan. Their public statements must be rare, restricted to issues of direct expert consensus, and made only when silence would be a greater threat to their integrity than speaking. Activism sacrifices credibility for influence – or worse yet, sacrifices credibility and influence a