Building Dual-Model Scoring Systems for Quality Control
Why One Model Is Not Enough
When I first built content scoring systems, I used the same model for both generation and evaluation. It seemed logical, but I quickly discovered a fundamental problem: models tend to rate their own output favorably. A model that generates verbose, hedging content will also score that style highly when evaluating it.
The solution is dual-model scoring, where you use two different models to evaluate the same content. The disagreements between models are where the most interesting quality signals live.
How Dual-Model Scoring Works
The concept is simple. Each piece of content gets scored independently by two different LLMs. The scores are then compared and combined:
- If both models agree the content is good, it passes
- If both models agree it is bad, it fails
- If the models disagree, the content gets flagged for human review
This approach catches two categories of problems that single-model scoring misses:
- Model-specific blind spots: Each model has different weaknesses in evaluation
- Style bias: Models tend to prefer their own writing style
Implementation Architecture
I run both scoring requests in parallel to minimize latency. Here is the core implementation:
import asyncio
from anthropic import AsyncAnthropic
from openai import AsyncOpenAI
async def dual_score(content: str, rubric: str) -> dict:
claude_task = score_with_claude(content, rubric)
gpt_task = score_with_gpt(content, rubric)
claude_result, gpt_result = await asyncio.gather(
claude_task, gpt_task
)
return reconcile_scores(claude_result, gpt_result)
def reconcile_scores(claude: dict, gpt: dict) -> dict:
combined = {}
disagreements = []
for dimension in claude['scores']:
c_score = claude['scores'][dimension]
g_score = gpt['scores'][dimension]
combined[dimension] = (c_score + g_score) / 2
if abs(c_score - g_score) > 2:
disagreements.append({
'dimension': dimension,
'claude': c_score,
'gpt': g_score,
'delta': abs(c_score - g_score)
})
overall = sum(combined.values()) / len(combined)
return {
'scores': combined,
'overall': overall,
'disagreements': disagreements,
'needs_review': len(disagreements) > 0,
'pass': overall >= 6.5 and len(disagreements) == 0
}
What I Have Learned About Model Disagreements
After running dual-model scoring on thousands of pieces of content, clear patterns have emerged in where Claude and GPT-4o disagree:
- Technical accuracy: Claude tends to be stricter on technical claims, catching subtle errors that GPT-4o lets slide
- Writing style: GPT-4o penalizes repetition more heavily, while Claude focuses more on logical flow
- Completeness: Claude is more likely to flag missing context or caveats
- Originality: GPT-4o scores originality more generously overall
These differences are actually the strength of the system. Each model compensates for the other's weaknesses.
Weighting the Models
I do not weight both models equally for all dimensions. Based on my observations, I give Claude more weight on accuracy and completeness, while GPT-4o gets more weight on style and readability. The weights are configurable per project:
DIMENSION_WEIGHTS = {
'accuracy': {'claude': 0.6, 'gpt': 0.4},
'relevance': {'claude': 0.5, 'gpt': 0.5},
'clarity': {'claude': 0.4, 'gpt': 0.6},
'originality': {'claude': 0.5, 'gpt': 0.5},
'completeness':{'claude': 0.6, 'gpt': 0.4}
}
Cost Considerations
Dual-model scoring doubles your evaluation costs. For high-volume pipelines, this matters. I mitigate this in a few ways:
- Use cheaper model tiers for scoring (Claude Haiku and GPT-4o-mini work surprisingly well as judges)
- Only run dual scoring on content that passes rule-based pre-filters
- Cache scoring results so identical or near-identical content is not re-scored
- Sample a percentage of content for dual scoring rather than scoring everything
Calibration and Feedback Loops
The system improves over time through feedback loops. When a human reviewer overrides the automated score, that data point feeds back into the calibration. I track the correlation between automated scores and human judgments, and adjust the thresholds quarterly.
The goal of dual-model scoring is not perfect agreement between models. It is using their disagreements as a signal for where human attention is most needed.
When to Use Dual-Model Scoring
Dual-model scoring is not always necessary. For quick internal drafts or low-stakes content, single-model scoring is fine. But for anything that will be published publicly, sent to customers, or used in decision-making, the extra cost of dual scoring is well worth the quality improvement. It has become a standard part of my production content pipelines, and I would not go back to single-model evaluation.