| 4 min read

Prompt Engineering for Brand Compliance Scoring

prompt engineering brand compliance LLM content moderation AI applications

The Brand Compliance Challenge

A client came to me with a problem: they had 50 freelance writers producing content across 10 brands, and quality was wildly inconsistent. Some articles matched brand voice perfectly, while others read like they were written for a completely different company. They needed an automated scoring system that could evaluate every piece of content against brand guidelines before publication.

This is a perfect use case for LLM-based scoring, but getting the prompts right took serious iteration. Here is what I learned.

Designing the Scoring Rubric

The first mistake people make is asking the LLM to give a single score. "Rate this content for brand compliance on a scale of 1-10" produces inconsistent, unreliable scores. Instead, I break compliance into discrete, measurable dimensions.

For a typical brand, I score across 6 dimensions:

  • Voice and tone: Does the writing match the brand personality?
  • Vocabulary compliance: Are approved terms used? Are banned terms avoided?
  • Audience alignment: Is the content appropriate for the target demographic?
  • Style guide adherence: Capitalization, formatting, structural rules
  • Messaging accuracy: Are key brand messages and value propositions correctly represented?
  • Competitor mentions: Are competitor references handled according to policy?

The Scoring Prompt Structure

SCORING_PROMPT = """
You are a brand compliance auditor for {brand_name}.

BRAND GUIDELINES:
{brand_guidelines}

SCORING RUBRIC:
For each dimension, assign a score from 1-5 using these exact criteria:

## Voice and Tone
5: Perfectly matches brand voice. Every sentence sounds authentically {brand_name}.
4: Mostly on-brand with minor tonal inconsistencies.
3: Recognizably attempts brand voice but with notable deviations.
2: Significant tonal mismatch. Reads as generic or off-brand.
1: Completely wrong tone. Could be mistaken for a different brand.

## Vocabulary Compliance  
5: All approved terms used correctly. Zero banned terms.
4: Approved terms used correctly. One minor vocabulary deviation.
3: Generally compliant but 2-3 vocabulary issues.
2: Multiple banned terms or misused brand terminology.
1: Widespread vocabulary violations.

[... similar rubrics for other dimensions ...]

CONTENT TO EVALUATE:
{content}

Respond in this exact JSON format:
{{
  "scores": {{
    "voice_and_tone": <1-5>,
    "vocabulary_compliance": <1-5>,
    "audience_alignment": <1-5>,
    "style_guide": <1-5>,
    "messaging_accuracy": <1-5>,
    "competitor_mentions": <1-5>
  }},
  "overall": ,
  "issues": ["specific issue 1", "specific issue 2"],
  "suggestions": ["specific fix 1", "specific fix 2"]
}}
"""

Calibration Is Everything

The raw prompt will produce scores, but they will not be consistent until you calibrate. Here is my calibration process:

Step 1: Create a reference set. Have a human brand expert score 30 pieces of content across all dimensions. These are your ground truth scores.

Step 2: Run the LLM scorer on the same content. Compare LLM scores to human scores and identify systematic biases.

Step 3: Adjust the rubric descriptions. If the LLM consistently scores voice and tone 1 point higher than humans, add more stringent criteria to the 4 and 5 descriptions.

def calibrate_scorer(human_scores: list, llm_scores: list) -> dict:
    """Calculate bias and correlation per dimension."""
    calibration = {}
    for dimension in human_scores[0].keys():
        human = [s[dimension] for s in human_scores]
        llm = [s[dimension] for s in llm_scores]
        
        bias = np.mean(np.array(llm) - np.array(human))
        correlation = np.corrcoef(human, llm)[0, 1]
        
        calibration[dimension] = {
            'bias': round(bias, 2),
            'correlation': round(correlation, 3)
        }
    return calibration

Few-Shot Examples Drive Consistency

Adding 2-3 scored examples to the prompt dramatically improves consistency. I include one high-scoring example, one low-scoring example, and one borderline case.

FEW_SHOT_EXAMPLES = """
EXAMPLE 1 (High compliance):
Content: "{example_high}"
Scores: voice_and_tone: 5, vocabulary_compliance: 5, ...
Rationale: Perfect brand voice, all approved terminology used correctly.

EXAMPLE 2 (Low compliance):
Content: "{example_low}"
Scores: voice_and_tone: 2, vocabulary_compliance: 1, ...
Rationale: Generic corporate tone, multiple banned terms detected.

EXAMPLE 3 (Borderline):
Content: "{example_borderline}"
Scores: voice_and_tone: 3, vocabulary_compliance: 4, ...
Rationale: Attempts brand voice but lapses into formal tone in technical sections.
"""

Handling Multi-Brand Scoring

When scoring across multiple brands, I maintain separate prompt configurations per brand. Each brand has its own guidelines document, vocabulary lists, and calibrated examples.

class BrandComplianceScorer:
    def __init__(self, brand_configs: dict):
        self.brands = brand_configs
        self.client = OpenAI()
    
    def score(self, content: str, brand_id: str) -> dict:
        config = self.brands[brand_id]
        prompt = SCORING_PROMPT.format(
            brand_name=config['name'],
            brand_guidelines=config['guidelines'],
            content=content
        )
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"},
            temperature=0.1  # Low temperature for consistency
        )
        
        return json.loads(response.choices[0].message.content)

Production Patterns

In production, I run every score through three checks:

  • Range validation: All scores must be between 1 and 5
  • Consistency check: If overall score differs from weighted average by more than 0.5, re-run the evaluation
  • Confidence interval: Run the scorer 3 times with slightly varied temperature and take the median. If the spread exceeds 1.5 points on any dimension, flag for human review

Results

After deploying this system, the client saw brand compliance scores across their content portfolio increase from an average of 3.1 to 4.2 within 8 weeks. Writers received immediate, specific feedback on every submission, and the editorial team saved roughly 20 hours per week on manual reviews.

The key takeaway: do not ask LLMs for vague scores. Give them specific rubrics, calibrate against human judgment, and validate the outputs. The prompts are the product.