Dragonase needed a way to validate hundreds of AI-generated experiments without creating a manual bottleneck for their lead scientists. We built a system to capture and scale scientific judgement, creating an automated judge and quality benchmark. This allowed the lab to scale its research safely while maintaining high scientific standards and full client ownership.
Context
Dragonase is a biotech startup building a self-driving lab to accelerate research in longevity and cell therapy. Their models generate complex experiments at a volume that would overwhelm a lean scientific team, especially when quality varies widely across proposals.
Challenge
The primary obstacle was the lack of a scalable validation signal. The AI could generate thousands of ideas, but many were low value. Manual review by human experts could not keep up, and early proxy metrics like paper quality were unreliable. The team also lacked a rigorous way to measure whether changes to the generator improved experiment quality.
Approach
We treated the problem as a measurement failure rather than a generation failure. The focus shifted from making the generator better to building a ruler that defines what good actually looks like. The goal was to amplify the lead scientist’s judgement through a scalable architecture.
System design
The Judgement Amplification System ingests proposed experiments and routes them through a reasoning engine trained on expert judgement. Inputs are AI-generated proposals, outputs are a 1–5 score paired with a detailed reasoning trace, and feedback loops are handled through a correction interface that lets scientists refine the prompts over time.
Implementation
Discovery began by analyzing the reasoning for approved and rejected ideas. We mapped those heuristics into a formal rubric that could be processed by a machine. The build focused on an AI judge and a low-friction correction interface so the team could see immediate triage value without sacrificing intellectual property. Handoff emphasized durability, with AI-readable documentation and decision logs that keep the system maintainable without ongoing consultant support.
Results
The system now evaluates thousands of experiments per day, more than what is required for current throughput. Lead scientists can focus on high-stakes decisions while the judge handles routine triage.
“The judge is actually probably going to be a lot better for triage than the paper evaluator. The fact that you’ve got a great paper and a bad experiment means that if you rank order stuff by paper quality, it’s a waste of time. The judge ranking is much more promising.”
— Dragonase Lead Scientist
Measurement plan: future phases will track alignment between the AI judge and human experts and use that data to refine prompts automatically.
What made it work
The pivot to validation created a permanent asset instead of a temporary fix. Reasoning traces made judgments auditable and trustworthy. AI-ready documentation made the system maintainable by the client’s internal team. A low-friction feedback interface kept the human expert in the loop without causing burnout.
Next steps
Next phases include automated prompt optimization using historical feedback and hardening the system for 24/7 operation in the live laboratory environment.