Enforcing Data Quality in Web Scrapers: Pydantic and the Dead Letter Queue

Building a web scraper is often the easy part. The real challenge starts on day two, when the website changes its layout, a field that was always a number suddenly contains "Price on Request," or an unexpected null value crashes your entire downstream ETL pipeline.

This phenomenon is known as Schema Drift. In production, silent data corruption is often worse than a total scraper failure because it poisons your database with "dirty" data that is difficult to clean later.

This guide explains how to transform a standard Python scraper into a robust data pipeline. We'll use the Dermstore.com Playwright scraper as a base and upgrade it using Pydantic for strict schema validation and the Dead Letter Queue (DLQ) pattern for error handling.

Prerequisites

To follow along, you'll need Python 3.8+ installed. We'll build upon the existing ScrapeOps Dermstore repository.

First, clone the repository and install the dependencies:

git clone https://github.com/scraper-bank/Dermstore.com-Scrapers.git
cd Dermstore.com-Scrapers
pip install playwright pydantic[email]
playwright install chromium

We will modify the logic found in python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py.

The Problem: Why dataclasses Aren't Enough

The current implementation in the repository uses standard Python dataclasses to structure scraped data:

@dataclass
class ScrapedData:
    name: str = ""
    price: float = 0.0
    url: str = ""
    # ... other fields

While dataclasses provide a clean way to group data, they have a major flaw: they do not enforce types at runtime.

If your extraction logic accidentally grabs a string like "Call for Price" and assigns it to the price field (hinted as a float), Python will not complain. The object will be created, serialized to JSON, and sent to your database. When your pricing algorithm later tries to calculate a discount on that string, the system crashes.

You need a way to ensure that if data doesn't match your requirements, it's caught before it leaves the scraper.

Step 1: Defining a Strict Schema with Pydantic

Pydantic allows you to create models that validate data the moment they are initialized. If the data is wrong, Pydantic raises an error immediately.

Let's redefine the ScrapedData model using Pydantic’s BaseModel. We'll use strict types to ensure no dirty data slips through.

from pydantic import BaseModel, Field, HttpUrl, field_validator
from typing import List, Optional, Dict, Any

class ProductSchema(BaseModel):
    name: str = Field(..., min_length=1)
    productId: str = Field(..., alias="sku")
    price: float = Field(..., gt=0) # Must be a float greater than 0
    currency: str = Field(default="USD")
    url: HttpUrl # Validates that the string is a proper URL
    availability: str
    brand: Optional[str] = None
    
    @field_validator('availability')
    @classmethod
    def validate_availability(cls, v: str) -> str:
        allowed = {'in_stock', 'out_of_stock'}
        if v.lower() not in allowed:
            raise ValueError(f"Availability must be one of {allowed}")
        return v.lower()

Why this is better:

Strict Typing: If price is passed as a non-numeric string, Pydantic will attempt to convert it (e.g., "10.99" becomes 10.99). If it cannot (e.g., "TBD"), it throws a ValidationError.
Field Validation: The validate_availability method ensures you only accept specific states, preventing "InStock" vs "in_stock" inconsistencies.
Required Fields: Using ... (the Ellipsis) marks a field as required. If the scraper fails to find the product name, the record is rejected.

Step 2: Validating Data During Extraction

Now we update the extract_data function. Instead of returning a dictionary or a loose dataclass, we will attempt to instantiate our ProductSchema.

In the original script, data is extracted into a dictionary. We wrap the final return in our Pydantic model:

async def extract_data(page: Page) -> Optional[ProductSchema]:
    # ... (extraction logic from the repo)
    
    raw_data = {
        "name": await page.locator("#product-title").inner_text(),
        "sku": await page.locator("#ratingSummary").get_attribute("data-sku"),
        "price": price_extracted_from_page,
        "url": page.url,
        "availability": "in_stock" if "in stock" in avail_text else "out_of_stock"
    }

    try:
        validated_product = ProductSchema(**raw_data)
        return validated_product
    except Exception as e:
        logger.error(f"Validation failed for {page.url}: {e}")
        raise

Step 3: Handling Complex Nested Fields

Dermstore products contain nested lists for reviews and specifications. Pydantic handles this using nested models.

class ReviewSchema(BaseModel):
    author: str
    content: str
    rating: float = Field(..., ge=1, le=5)
    date: str

class SpecificationSchema(BaseModel):
    key: str
    value: str

class EnhancedProductSchema(ProductSchema):
    reviews: List[ReviewSchema] = []
    specifications: List[SpecificationSchema] = []

By defining List[ReviewSchema], Pydantic automatically validates every review object in the list. If one review has a rating of 6.0, the entire product record is flagged as invalid.

Step 4: Implementing the Dead Letter Queue (DLQ)

If a product fails validation, the scraper shouldn't crash. You might have 1,000 other successful products to process. However, you shouldn't ignore the error either.

The Dead Letter Queue (DLQ) pattern involves saving invalid records to a separate file (dlq.jsonl) along with the reason they failed. This allows you to fix your selectors later without losing data.

Modify the DataPipeline class as follows:

class DataPipeline:
    def __init__(self, output_file="products.jsonl", dlq_file="dlq.jsonl"):
        self.output_file = output_file
        self.dlq_file = dlq_file

    def save_to_jsonl(self, filename: str, data: dict):
        with open(filename, "a", encoding="utf-8") as f:
            f.write(json.dumps(data) + "\n")

    def add_data(self, raw_extracted_data: dict):
        try:
            # Attempt validation
            validated_item = EnhancedProductSchema(**raw_extracted_data)
            
            # If successful, save to main output
            self.save_to_jsonl(self.output_file, validated_item.model_dump())
            logger.info(f"Successfully saved {validated_item.name}")
            
        except Exception as e:
            # If validation fails, move to DLQ
            error_entry = {
                "url": raw_extracted_data.get("url"),
                "error": str(e),
                "raw_data": raw_extracted_data,
                "timestamp": datetime.now().isoformat()
            }
            self.save_to_jsonl(self.dlq_file, error_entry)
            logger.warning(f"Validation failed. Record moved to DLQ: {self.dlq_file}")

Recommended Approaches for Production

Implementing strict validation is a major step forward, but maintenance is key to long-term success.

Monitor your DLQ: A few items in the DLQ are normal, such as a single product page with a weird edge case. If the DLQ starts growing rapidly, it is a clear signal that the website’s schema has shifted and your selectors need updating.
Use Optional Sparingly: It is tempting to make every field optional to prevent errors. Avoid this. Force your scraper to fail on critical fields like price or name. It is better to have no data than wrong data.
Node.js Alternatives: If you use the Node.js implementations from the repository, use Zod. It provides a nearly identical experience to Pydantic for TypeScript and JavaScript.
Automated Alerts: Set up a script to check the size of your dlq.jsonl after a run. If it exceeds 5% of your total successful scrapes, trigger an alert.

To Wrap Up

By moving from loose dictionaries to strict Pydantic models, you transform a scraper from a brittle script into a reliable data pipeline.

Key Takeaways:

Schema Drift is inevitable; plan for it with runtime validation.
Pydantic ensures only data meeting your specifications enters your database.
Dead Letter Queues prevent scraper crashes while ensuring bad data is preserved for debugging rather than silently discarded.

If you want a head start on building your next scraper with these patterns, use the ScrapeOps AI Scraper Generator to create the base extraction logic, then layer in Pydantic for production-grade reliability.Building a web scraper is often the easy part. The real challenge starts on day two, when the website changes its layout, a field that was always a number suddenly contains "Price on Request," or an unexpected null value crashes your entire downstream ETL pipeline.

Prerequisites

To follow along, you'll need Python 3.8+ installed. We'll build upon the existing ScrapeOps Dermstore repository.

First, clone the repository and install the dependencies:

git clone https://github.com/scraper-bank/Dermstore.com-Scrapers.git
cd Dermstore.com-Scrapers
pip install playwright pydantic[email]
playwright install chromium

We will modify the logic found in python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py.

The Problem: Why dataclasses Aren't Enough

The current implementation in the repository uses standard Python dataclasses to structure scraped data:

@dataclass
class ScrapedData:
    name: str = ""
    price: float = 0.0
    url: str = ""
    # ... other fields

While dataclasses provide a clean way to group data, they have a major flaw: they do not enforce types at runtime.

You need a way to ensure that if data doesn't match your requirements, it's caught before it leaves the scraper.

Step 1: Defining a Strict Schema with Pydantic

Pydantic allows you to create models that validate data the moment they are initialized. If the data is wrong, Pydantic raises an error immediately.

Let's redefine the ScrapedData model using Pydantic’s BaseModel. We'll use strict types to ensure no dirty data slips through.

from pydantic import BaseModel, Field, HttpUrl, field_validator
from typing import List, Optional, Dict, Any

class ProductSchema(BaseModel):
    name: str = Field(..., min_length=1)
    productId: str = Field(..., alias="sku")
    price: float = Field(..., gt=0) # Must be a float greater than 0
    currency: str = Field(default="USD")
    url: HttpUrl # Validates that the string is a proper URL
    availability: str
    brand: Optional[str] = None
    
    @field_validator('availability')
    @classmethod
    def validate_availability(cls, v: str) -> str:
        allowed = {'in_stock', 'out_of_stock'}
        if v.lower() not in allowed:
            raise ValueError(f"Availability must be one of {allowed}")
        return v.lower()

Why this is better:

Strict Typing: If price is passed as a non-numeric string, Pydantic will attempt to convert it (e.g., "10.99" becomes 10.99). If it cannot (e.g., "TBD"), it throws a ValidationError.
Field Validation: The validate_availability method ensures you only accept specific states, preventing "InStock" vs "in_stock" inconsistencies.
Required Fields: Using ... (the Ellipsis) marks a field as required. If the scraper fails to find the product name, the record is rejected.

Step 2: Validating Data During Extraction

Now we update the extract_data function. Instead of returning a dictionary or a loose dataclass, we will attempt to instantiate our ProductSchema.

In the original script, data is extracted into a dictionary. We wrap the final return in our Pydantic model:

async def extract_data(page: Page) -> Optional[ProductSchema]:
    # ... (extraction logic from the repo)
    
    raw_data = {
        "name": await page.locator("#product-title").inner_text(),
        "sku": await page.locator("#ratingSummary").get_attribute("data-sku"),
        "price": price_extracted_from_page,
        "url": page.url,
        "availability": "in_stock" if "in stock" in avail_text else "out_of_stock"
    }

    try:
        validated_product = ProductSchema(**raw_data)
        return validated_product
    except Exception as e:
        logger.error(f"Validation failed for {page.url}: {e}")
        raise

Step 3: Handling Complex Nested Fields

Dermstore products contain nested lists for reviews and specifications. Pydantic handles this using nested models.

class ReviewSchema(BaseModel):
    author: str
    content: str
    rating: float = Field(..., ge=1, le=5)
    date: str

class SpecificationSchema(BaseModel):
    key: str
    value: str

class EnhancedProductSchema(ProductSchema):
    reviews: List[ReviewSchema] = []
    specifications: List[SpecificationSchema] = []

By defining List[ReviewSchema], Pydantic automatically validates every review object in the list. If one review has a rating of 6.0, the entire product record is flagged as invalid.

Step 4: Implementing the Dead Letter Queue (DLQ)

If a product fails validation, the scraper shouldn't crash. You might have 1,000 other successful products to process. However, you shouldn't ignore the error either.

Modify the DataPipeline class as follows:

class DataPipeline:
    def __init__(self, output_file="products.jsonl", dlq_file="dlq.jsonl"):
        self.output_file = output_file
        self.dlq_file = dlq_file

    def save_to_jsonl(self, filename: str, data: dict):
        with open(filename, "a", encoding="utf-8") as f:
            f.write(json.dumps(data) + "\n")

    def add_data(self, raw_extracted_data: dict):
        try:
            # Attempt validation
            validated_item = EnhancedProductSchema(**raw_extracted_data)
            
            # If successful, save to main output
            self.save_to_jsonl(self.output_file, validated_item.model_dump())
            logger.info(f"Successfully saved {validated_item.name}")
            
        except Exception as e:
            # If validation fails, move to DLQ
            error_entry = {
                "url": raw_extracted_data.get("url"),
                "error": str(e),
                "raw_data": raw_extracted_data,
                "timestamp": datetime.now().isoformat()
            }
            self.save_to_jsonl(self.dlq_file, error_entry)
            logger.warning(f"Validation failed. Record moved to DLQ: {self.dlq_file}")

Recommended Approaches for Production

Implementing strict validation is a major step forward, but maintenance is key to long-term success.

Monitor your DLQ: A few items in the DLQ are normal, such as a single product page with a weird edge case. If the DLQ starts growing rapidly, it is a clear signal that the website’s schema has shifted and your selectors need updating.
Use Optional Sparingly: It is tempting to make every field optional to prevent errors. Avoid this. Force your scraper to fail on critical fields like price or name. It is better to have no data than wrong data.
Node.js Alternatives: If you use the Node.js implementations from the repository, use Zod. It provides a nearly identical experience to Pydantic for TypeScript and JavaScript.
Automated Alerts: Set up a script to check the size of your dlq.jsonl after a run. If it exceeds 5% of your total successful scrapes, trigger an alert.

To Wrap Up

By moving from loose dictionaries to strict Pydantic models, you transform a scraper from a brittle script into a reliable data pipeline.

Key Takeaways:

Schema Drift is inevitable; plan for it with runtime validation.
Pydantic ensures only data meeting your specifications enters your database.
Dead Letter Queues prevent scraper crashes while ensuring bad data is preserved for debugging rather than silently discarded.

Enforcing Data Quality in Web Scrapers: Pydantic and the Dead Letter Queue

Prerequisites

The Problem: Why dataclasses Aren't Enough

Step 1: Defining a Strict Schema with Pydantic

Why this is better:

Step 2: Validating Data During Extraction

Step 3: Handling Complex Nested Fields

Step 4: Implementing the Dead Letter Queue (DLQ)

Recommended Approaches for Production

To Wrap Up

Prerequisites

The Problem: Why dataclasses Aren't Enough

Step 1: Defining a Strict Schema with Pydantic

Why this is better:

Step 2: Validating Data During Extraction

Step 3: Handling Complex Nested Fields

Step 4: Implementing the Dead Letter Queue (DLQ)

Recommended Approaches for Production

To Wrap Up

Comments

More from this blog

From Generated Code to Production Pipeline: Hardening a Beautylish Scraper

Prompt-to-Schema: Ensuring Type-Safe JSON Extraction from Unstructured HTML

Handling E-Commerce A/B Testing: Resilient Selector Strategies for Zappos with Playwright

Hardening Your Costco Scraper: Detecting Soft Bans and Enforcing Data Quality with Pydantic

Command Palette

Prerequisites

The Problem: Why dataclasses Aren't Enough

Step 1: Defining a Strict Schema with Pydantic

Why this is better:

Step 2: Validating Data During Extraction

Step 3: Handling Complex Nested Fields

Step 4: Implementing the Dead Letter Queue (DLQ)

Recommended Approaches for Production

To Wrap Up

Prerequisites

The Problem: Why dataclasses Aren't Enough

Step 1: Defining a Strict Schema with Pydantic

Why this is better:

Step 2: Validating Data During Extraction

Step 3: Handling Complex Nested Fields

Step 4: Implementing the Dead Letter Queue (DLQ)

Recommended Approaches for Production

To Wrap Up

Comments

More from this blog