How to Score and Filter AI-Generated Content Automatically
The Quality Problem with AI Content
AI can generate content fast, but speed without quality control is a recipe for publishing garbage. I learned this the hard way when an early version of one of my content pipelines pushed out several posts with factual errors and awkward phrasing. Since then, I have built automated scoring into every content generation system I operate.
The principle is simple: every piece of AI-generated content goes through a scoring pipeline before it reaches any human or gets published. Content that scores below a threshold gets flagged for review or rejected entirely.
Designing a Scoring Rubric
Before you can score content, you need to define what good content looks like. I use a rubric with five dimensions, each scored from 1 to 10:
- Relevance: Does the content address the intended topic or query?
- Accuracy: Are the claims factually correct and well-supported?
- Clarity: Is the writing clear, well-structured, and easy to follow?
- Originality: Does it offer unique insights rather than generic filler?
- Completeness: Does it cover the topic thoroughly enough to be useful?
The overall score is a weighted average. For technical content, I weight accuracy and completeness more heavily. For marketing copy, relevance and clarity get higher weights.
Using an LLM as a Scoring Judge
The most effective approach I have found is using a different LLM to judge the output. If I generate content with GPT-4o, I score it with Claude, and vice versa. This cross-model evaluation catches blind spots that a single model might miss.
async def score_content(content: str, rubric: dict) -> dict:
prompt = f"""Score this content on each dimension (1-10).
Provide a brief justification for each score.
Rubric:
{json.dumps(rubric, indent=2)}
Content:
{content}
Return JSON: {{"scores": {{"relevance": n, ...}},
"justifications": {{"relevance": "...", ...}},
"overall": n,
"pass": true/false}}"""
response = await claude_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return parse_scores(response.content[0].text)
Calibrating the Threshold
I started with a pass threshold of 7.0 out of 10 and adjusted based on real-world feedback. For my current projects, a score of 6.5 or above passes automatically, 5.0 to 6.5 gets flagged for human review, and below 5.0 is rejected outright. These thresholds vary by project and content type.
Rule-Based Pre-Filters
Before spending API credits on LLM scoring, I run content through cheap rule-based checks that catch obvious problems:
- Length check: Is the content within the expected word count range?
- Repetition detection: Are sentences or phrases repeated verbatim?
- Banned phrase filter: Common AI cliches like "dive into" or "it's important to note"
- Structure validation: Does the content have proper headings, paragraphs, and formatting?
- Link checking: Are any referenced URLs valid and accessible?
def pre_filter(content: str) -> tuple[bool, list[str]]:
issues = []
words = content.split()
if len(words) < 300:
issues.append("Content too short")
if len(words) > 3000:
issues.append("Content too long")
# Check for repetition
sentences = content.split('.')
if len(sentences) != len(set(sentences)):
issues.append("Duplicate sentences detected")
banned = ["dive into", "in today's world", "it's worth noting"]
for phrase in banned:
if phrase.lower() in content.lower():
issues.append(f"Banned phrase: {phrase}")
return len(issues) == 0, issues
Building the Pipeline
The full scoring pipeline runs in stages, from cheapest to most expensive:
- Stage 1: Rule-based pre-filters (free, instant)
- Stage 2: Embedding-based similarity check against existing content (cheap, fast)
- Stage 3: LLM-based scoring with rubric (more expensive, slower)
- Stage 4: Optional human review for borderline cases
Content that fails at any stage does not proceed to the next one. This saves significant API costs because most poor content gets caught in stages 1 or 2.
Tracking Scores Over Time
I store every score in a Supabase database, which lets me track quality trends over time. If average scores start dropping, it usually means my generation prompts need updating or the model has changed behavior after an update.
Automated scoring is not about replacing human judgment. It is about making human judgment scalable by focusing attention where it matters most.
Results
Since implementing this system, the rejection rate for my content pipelines has dropped from about 30% (caught by manual review) to under 5% (caught automatically before any human sees it). The content that does make it through consistently meets quality standards. More importantly, I can now generate and publish content at a pace that would be impossible with manual review alone.