IN TODAY'S SIGNAL
| Read time: 6 min 10 sec | 🎖️ Top News
⚡️ Trending Signals
📌 Aiola
💻 Top Lectures
🧠 Deep Dive |
|
|
|
If you're enjoying AlphaSignal please forward this email to a colleague.
It helps us keep this content free. |
|
|
|
TOP NEWS
| AI Research | OpenAI introduces PaperBench to test AI agents on replicating papers from scratch; Claude 3.5 leads at 21%
| ⇧ 7,035 Likes |
| What's New |
OpenAI introduces PaperBench, a benchmark testing AI agents on replicating 20 ICML 2024 Spotlight and Oral papers. Claude 3.5 Sonnet (new) scores 21.0%, the highest among tested models. Human ML PhDs achieve 41.4%, outperforming AI on a subset of tasks.
Key Results PaperBench evaluates the agents against a structured grading system. Claude 3.5 Sonnet (new) scores 21.0%, the highest recorded performance.
OpenAI’s o1 model improves from 13.2% to 24.4% with optimized prompting.
Agents fail to match human researchers, who achieve 41.4% in 48 hours.
Tested models struggle with long-horizon planning and execution.
How PaperBench Works The agents receive a paper and must reproduce its experiments from scratch.
It reads the ICML 2024 paper and an additional addendum.
Each paper has a rubric with 8,316 total evaluation points across the benchmark. Rubrics were co-developed with the original authors of the ICML papers.
Agents must generate a complete codebase and reproduce all experiments from scratch.
They generate a reproduce script as the execution entry point.
The script runs in a sandboxed VM or Docker container with GPU support.
SimpleJudge grades the results using a detailed rubric.
Evaluation Method Submissions are graded using an LLM-based automated judge called SimpleJudge. The judge uses the o3-mini-high model to assess individual rubric items.
It executes the submitted reproduce script and compares outputs to rubric expectations.
You can inspect judge outputs to understand scoring decisions.
Evaluation includes script correctness, result quality, and reproducibility.
Access You can use PaperBench or its lighter variant, Code-Dev, through the released GitHub repo.
PaperBench Code-Dev focuses on code only and skips experiment execution.
It does not require GPU access and supports faster evaluation cycles.
Full benchmark setup includes Docker, GPUs, and specific dependencies.
You can use the same rubrics for custom internal evaluations.
Community Feedback
Virgile Blais "How did 3.5 Sonnet obliterate even highly inference scaled models like o3-mini-high on this benchmark but not in others? Seems like quite a unique benchmark where the top performers differ significantly from others’ like Humanity’s Last Exam"
Simon Frieder "An apparent paradox: Since the datasets papers are top ML ones, it could well be that they were intensely discussed in forum on the internet. This may make it actually _easier_ to reproduce them, rather than harder. _That_ is probably what should be accounted for in more technical detail in the limitations section, for example by using a control dataset of obscure papers and (re-)implementing that too."
Julian J. Neuss "this is how you know it’s getting real agents aren’t just writing blog posts anymore they’re reading top research, running code, and replicating experiments soon, “I read the paper” won’t mean human anymore"
|
|
READ MORE
|
|
|
|
TRENDING SIGNALS
| AI in Education |
|
⇧ 5,047 Likes | |
AI Benchmark |
|
⇧ 1,535 Likes | |
Coding Assistant |
|
⇧ 1,395 Likes | |
AI Music Generation |
|
⇧ 11,495 Likes | |
AI Safety |
|
⇧ 829 Like | | |
|
|
|
|
ASR That Understands Noise, Accents & Jargon
| Jargonic is built for real-world enterprise speech. Trained on 1M+ hours of diverse audio, it transcribes any language, accent, or acoustic setting—no retraining needed.
Built for Enterprises: Captures jargon and industry-specific terms.
Low-Latency Transcription: Works in real-time for automation and AI assistants.
Easy Integration: Runs with aiOla’s Conversational AI stack. Scale voice-led workflows with Jargonic today.
|
Book a Demo Today ↗️ |
|
|
|
TOP LECTURES
| LLM Development |
| ⇧ 1,046 Likes |
In this lecture by Andrew Ng's deeplearning.ai , you will learn how to generate structured LLM outputs using APIs, re-prompting libraries, and constrained decoding. Use OpenAI’s structured output API with Pydantic. Validate outputs with the “instructor” library. Apply constrained decoding using the “outlines” library and regex-based finite-state machines. Build a social media agent and parse outputs into pandas data frames. |
| LLM |
| ⇧ 908 Likes |
In this podcast, Sr. Product Manager Logan Kilpatrick and Gemini Product Lead Tulsee Doshi break down Gemini 2.5 Pro’s reasoning, coding, and multimodal improvements. They discuss its 1M token context, evaluation methods, pre/post-training optimization, and test-time compute. Learn how Google coordinates cross-stack updates, embeds safety, and advances Gemini’s architecture for real-world applications. |
| Video Generation |
| ⇧ 451 Likes |
With OpenAI Academy's series learn how to generate 20-second videos using Sora from text, images, or clips. Use tools to storyboard, recut, blend, remix, and loop videos. Understand how editing steps affect output. Apply structured inputs for precise control. Build repeatable workflows to streamline video generation using Sora’s core features for consistent results. |
| |
| |
|
|
DEEP DIVE
| Data Science
|
Mastering Data Analytics with AI and Earn a Professional Certificate
| ⇧ 908 Likes |
| The Data Analytics Professional Certificate from DeepLearning.AI provides a structured, five-course program to help you build practical analytics skills using Python, SQL, spreadsheets, Tableau, and generative AI.
It covers the complete analytics pipeline from defining problems to delivering actionable insights through real-world, project-based learning led by Netflix data science leader Sean Barnes.
You will learn how:
To Classify data types and understand their analytic uses
Data flows across roles and systems in an organization
To clean, preprocess, and validate data using Python and SQL
To calculate and apply descriptive and inferential statistics
To build interactive dashboards with Tableau
To use LLMs for stakeholder analysis, visualization, and simulation
To apply analytics to real-world problems in business, science, and conservation. |
START NOW
|
|
|
|
|