Signup  |  Work With Us  |  Follow on X  |  Read on Web

.

Hey ,

Welcome to AlphaSignal – the most read newsletter by AI developers. 


We bring you the top 1% of news, papers, models, and repos, all summarized to keep you updated on the latest in AI.

IN TODAY'S SIGNAL

Read time: 6 min 10 sec

🎖️ Top News

⚡️ Trending Signals

📌  Aiola

💻  Top Lectures




🧠  Deep Dive

If you're enjoying AlphaSignal please forward this email to a colleague. 

It helps us keep this content free.

TOP NEWS

AI Research

OpenAI introduces PaperBench to test AI agents on replicating papers from scratch; Claude 3.5 leads at 21%


⇧ 7,035 Likes

What's New

OpenAI introduces PaperBench, a benchmark testing AI agents on replicating 20 ICML 2024 Spotlight and Oral papers. Claude 3.5 Sonnet (new) scores 21.0%, the highest among tested models. Human ML PhDs achieve 41.4%, outperforming AI on a subset of tasks.


Key Results
PaperBench evaluates the agents against a structured grading system.

  • Claude 3.5 Sonnet (new) scores 21.0%, the highest recorded performance.

  • OpenAI’s o1 model improves from 13.2% to 24.4% with optimized prompting.

  • Agents fail to match human researchers, who achieve 41.4% in 48 hours.

  • Tested models struggle with long-horizon planning and execution.


How PaperBench Works

The agents receive a paper and must reproduce its experiments from scratch.

  • It reads the ICML 2024 paper and an additional addendum.

  • Each paper has a rubric with 8,316 total evaluation points across the benchmark. Rubrics were co-developed with the original authors of the ICML papers.

  • Agents must generate a complete codebase and reproduce all experiments from scratch.

  • They generate a reproduce script as the execution entry point.

  • The script runs in a sandboxed VM or Docker container with GPU support.

  • SimpleJudge grades the results using a detailed rubric.


Evaluation Method
Submissions are graded using an LLM-based automated judge called SimpleJudge.

  • The judge uses the o3-mini-high model to assess individual rubric items.

  • It executes the submitted reproduce script and compares outputs to rubric expectations.

  • You can inspect judge outputs to understand scoring decisions.

  • Evaluation includes script correctness, result quality, and reproducibility.


Access

You can use PaperBench or its lighter variant, Code-Dev, through the released GitHub repo.

  • PaperBench Code-Dev focuses on code only and skips experiment execution.

  • It does not require GPU access and supports faster evaluation cycles.

  • Full benchmark setup includes Docker, GPUs, and specific dependencies.

  • You can use the same rubrics for custom internal evaluations.


Community Feedback

Virgile Blais
"How did 3.5 Sonnet obliterate even highly inference scaled models like o3-mini-high on this benchmark but not in others? Seems like quite a unique benchmark where the top performers differ significantly from others’ like Humanity’s Last Exam"

Simon Frieder
"An apparent paradox: Since the datasets papers are top ML ones, it could well be that they were intensely discussed in forum on the internet. This may make it actually _easier_ to reproduce them, rather than harder. _That_ is probably what should be accounted for in more technical detail in the limitations section, for example by using a control dataset of obscure papers and (re-)implementing that too."

Julian J. Neuss
"this is how you know it’s getting real agents aren’t just writing blog posts anymore they’re reading top research, running code, and replicating experiments soon, “I read the paper” won’t mean human anymore"

READ MORE

TRENDING SIGNALS

AI in Education

Anthropic launches Claude for Education with  learning mode, guiding students' reasoning instead of giving direct answers

⇧ 5,047 Likes

AI Benchmark

UCSD researchers demonstrate GPT-4.5 passes the Turing test, fooling judges 73% of the time outperforming real humans

⇧ 1,535 Likes

Coding Assistant

Codeium releases Wave 6, adds one-click app deployment, commit message generation, and Jupyter Notebook integration

⇧ 1,395 Likes

AI Music Generation

Mureka introduces the first chain of though AI music model supporting audio-based prompting for music creation

⇧ 11,495 Likes

AI Safety

Google DeepMind publishes 145-page AGI safety blueprint, OpenAI and Anthropic in AGI safety paper, proposes cybersecurity measures

⇧ 829 Like

ASR That Understands Noise, Accents & Jargon

Jargonic is built for real-world enterprise speech. Trained on 1M+ hours of diverse audio, it transcribes any language, accent, or acoustic setting—no retraining needed.

  • Built for Enterprises: Captures jargon and industry-specific terms.

  • Low-Latency Transcription: Works in real-time for automation and AI assistants.

  • Easy Integration: Runs with aiOla’s Conversational AI stack.

Scale voice-led workflows with Jargonic today.

Book a Demo Today ↗️

TOP LECTURES

LLM Development

Getting Structured LLM Output

⇧ 1,046 Likes

In this lecture by Andrew Ng's deeplearning.ai , you will learn how to generate structured LLM outputs using APIs, re-prompting libraries, and constrained decoding. Use OpenAI’s structured output API with Pydantic. Validate outputs with the “instructor” library. Apply constrained decoding using the “outlines” library and regex-based finite-state machines. Build a social media agent and parse outputs into pandas data frames.

LLM

Gemini 2.5 Pro Release Note

⇧ 908 Likes

In this podcast, Sr. Product Manager Logan Kilpatrick and Gemini Product Lead Tulsee Doshi break down Gemini 2.5 Pro’s reasoning, coding, and multimodal improvements. They discuss its 1M token context, evaluation methods, pre/post-training optimization, and test-time compute. Learn how Google coordinates cross-stack updates, embeds safety, and advances Gemini’s architecture for real-world applications.

Video Generation

Getting Started with Sora

⇧ 451 Likes

With OpenAI Academy's series learn how to generate 20-second videos using Sora from text, images, or clips. Use tools to storyboard, recut, blend, remix, and loop videos. Understand how editing steps affect output. Apply structured inputs for precise control. Build repeatable workflows to streamline video generation using Sora’s core features for consistent results.

DEEP DIVE

Data Science

Mastering Data Analytics with AI and Earn a Professional Certificate

⇧ 908 Likes

The Data Analytics Professional Certificate from DeepLearning.AI provides a structured, five-course program to help you build practical analytics skills using Python, SQL, spreadsheets, Tableau, and generative AI.


It covers the complete analytics pipeline from defining problems to delivering actionable insights through real-world, project-based learning led by Netflix data science leader Sean Barnes.


You will learn how:

  • To Classify data types and understand their analytic uses

  • Data flows across roles and systems in an organization

  • To clean, preprocess, and validate data using Python and SQL

  • To calculate and apply descriptive and inferential statistics

  • To build interactive dashboards with Tableau

  • To use LLMs for stakeholder analysis, visualization, and simulation

  • To apply analytics to real-world problems in business, science, and conservation.

START NOW

Stop receiving emails here