Enforcing Data Quality in Web Scrapers: Pydantic and the Dead Letter Queue

Building a web scraper is often the easy part. The real challenge starts on day two, when the website changes its layout, a field that was always a number suddenly contains "Price on Request," or an unexpected null value crashes your entire downstream ETL pipeline.
This phenomenon is known as Schema Drift. In production, silent data corruption is often worse than a total scraper failure because it poisons your database with "dirty" data that is difficult to clean later.
This guide explains how to transform a standard Python scraper into a robust data pipeline. We'll use the Dermstore.com Playwright scraper as a base and upgrade it using Pydantic for strict schema validation and the Dead Letter Queue (DLQ) pattern for error handling.
Prerequisites
To follow along, you'll need Python 3.8+ installed. We'll build upon the existing ScrapeOps Dermstore repository.
First, clone the repository and install the dependencies:
git clone https://github.com/scraper-bank/Dermstore.com-Scrapers.git
cd Dermstore.com-Scrapers
pip install playwright pydantic[email]
playwright install chromium
We will modify the logic found in python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py.
The Problem: Why dataclasses Aren't Enough
The current implementation in the repository uses standard Python dataclasses to structure scraped data:
@dataclass
class ScrapedData:
name: str = ""
price: float = 0.0
url: str = ""
# ... other fields
While dataclasses provide a clean way to group data, they have a major flaw: they do not enforce types at runtime.
If your extraction logic accidentally grabs a string like "Call for Price" and assigns it to the price field (hinted as a float), Python will not complain. The object will be created, serialized to JSON, and sent to your database. When your pricing algorithm later tries to calculate a discount on that string, the system crashes.
You need a way to ensure that if data doesn't match your requirements, it's caught before it leaves the scraper.
Step 1: Defining a Strict Schema with Pydantic
Pydantic allows you to create models that validate data the moment they are initialized. If the data is wrong, Pydantic raises an error immediately.
Let's redefine the ScrapedData model using Pydantic’s BaseModel. We'll use strict types to ensure no dirty data slips through.
from pydantic import BaseModel, Field, HttpUrl, field_validator
from typing import List, Optional, Dict, Any
class ProductSchema(BaseModel):
name: str = Field(..., min_length=1)
productId: str = Field(..., alias="sku")
price: float = Field(..., gt=0) # Must be a float greater than 0
currency: str = Field(default="USD")
url: HttpUrl # Validates that the string is a proper URL
availability: str
brand: Optional[str] = None
@field_validator('availability')
@classmethod
def validate_availability(cls, v: str) -> str:
allowed = {'in_stock', 'out_of_stock'}
if v.lower() not in allowed:
raise ValueError(f"Availability must be one of {allowed}")
return v.lower()
Why this is better:
Strict Typing: If
priceis passed as a non-numeric string, Pydantic will attempt to convert it (e.g.,"10.99"becomes10.99). If it cannot (e.g.,"TBD"), it throws aValidationError.Field Validation: The
validate_availabilitymethod ensures you only accept specific states, preventing "InStock" vs "in_stock" inconsistencies.Required Fields: Using
...(the Ellipsis) marks a field as required. If the scraper fails to find the product name, the record is rejected.
Step 2: Validating Data During Extraction
Now we update the extract_data function. Instead of returning a dictionary or a loose dataclass, we will attempt to instantiate our ProductSchema.
In the original script, data is extracted into a dictionary. We wrap the final return in our Pydantic model:
async def extract_data(page: Page) -> Optional[ProductSchema]:
# ... (extraction logic from the repo)
raw_data = {
"name": await page.locator("#product-title").inner_text(),
"sku": await page.locator("#ratingSummary").get_attribute("data-sku"),
"price": price_extracted_from_page,
"url": page.url,
"availability": "in_stock" if "in stock" in avail_text else "out_of_stock"
}
try:
validated_product = ProductSchema(**raw_data)
return validated_product
except Exception as e:
logger.error(f"Validation failed for {page.url}: {e}")
raise
Step 3: Handling Complex Nested Fields
Dermstore products contain nested lists for reviews and specifications. Pydantic handles this using nested models.
class ReviewSchema(BaseModel):
author: str
content: str
rating: float = Field(..., ge=1, le=5)
date: str
class SpecificationSchema(BaseModel):
key: str
value: str
class EnhancedProductSchema(ProductSchema):
reviews: List[ReviewSchema] = []
specifications: List[SpecificationSchema] = []
By defining List[ReviewSchema], Pydantic automatically validates every review object in the list. If one review has a rating of 6.0, the entire product record is flagged as invalid.
Step 4: Implementing the Dead Letter Queue (DLQ)
If a product fails validation, the scraper shouldn't crash. You might have 1,000 other successful products to process. However, you shouldn't ignore the error either.
The Dead Letter Queue (DLQ) pattern involves saving invalid records to a separate file (dlq.jsonl) along with the reason they failed. This allows you to fix your selectors later without losing data.
Modify the DataPipeline class as follows:
class DataPipeline:
def __init__(self, output_file="products.jsonl", dlq_file="dlq.jsonl"):
self.output_file = output_file
self.dlq_file = dlq_file
def save_to_jsonl(self, filename: str, data: dict):
with open(filename, "a", encoding="utf-8") as f:
f.write(json.dumps(data) + "\n")
def add_data(self, raw_extracted_data: dict):
try:
# Attempt validation
validated_item = EnhancedProductSchema(**raw_extracted_data)
# If successful, save to main output
self.save_to_jsonl(self.output_file, validated_item.model_dump())
logger.info(f"Successfully saved {validated_item.name}")
except Exception as e:
# If validation fails, move to DLQ
error_entry = {
"url": raw_extracted_data.get("url"),
"error": str(e),
"raw_data": raw_extracted_data,
"timestamp": datetime.now().isoformat()
}
self.save_to_jsonl(self.dlq_file, error_entry)
logger.warning(f"Validation failed. Record moved to DLQ: {self.dlq_file}")
Recommended Approaches for Production
Implementing strict validation is a major step forward, but maintenance is key to long-term success.
Monitor your DLQ: A few items in the DLQ are normal, such as a single product page with a weird edge case. If the DLQ starts growing rapidly, it is a clear signal that the website’s schema has shifted and your selectors need updating.
Use Optional Sparingly: It is tempting to make every field optional to prevent errors. Avoid this. Force your scraper to fail on critical fields like
priceorname. It is better to have no data than wrong data.Node.js Alternatives: If you use the Node.js implementations from the repository, use Zod. It provides a nearly identical experience to Pydantic for TypeScript and JavaScript.
Automated Alerts: Set up a script to check the size of your
dlq.jsonlafter a run. If it exceeds 5% of your total successful scrapes, trigger an alert.
To Wrap Up
By moving from loose dictionaries to strict Pydantic models, you transform a scraper from a brittle script into a reliable data pipeline.
Key Takeaways:
Schema Drift is inevitable; plan for it with runtime validation.
Pydantic ensures only data meeting your specifications enters your database.
Dead Letter Queues prevent scraper crashes while ensuring bad data is preserved for debugging rather than silently discarded.
If you want a head start on building your next scraper with these patterns, use the ScrapeOps AI Scraper Generator to create the base extraction logic, then layer in Pydantic for production-grade reliability.Building a web scraper is often the easy part. The real challenge starts on day two, when the website changes its layout, a field that was always a number suddenly contains "Price on Request," or an unexpected null value crashes your entire downstream ETL pipeline.
This phenomenon is known as Schema Drift. In production, silent data corruption is often worse than a total scraper failure because it poisons your database with "dirty" data that is difficult to clean later.
This guide explains how to transform a standard Python scraper into a robust data pipeline. We'll use the Dermstore.com Playwright scraper as a base and upgrade it using Pydantic for strict schema validation and the Dead Letter Queue (DLQ) pattern for error handling.
Prerequisites
To follow along, you'll need Python 3.8+ installed. We'll build upon the existing ScrapeOps Dermstore repository.
First, clone the repository and install the dependencies:
git clone https://github.com/scraper-bank/Dermstore.com-Scrapers.git
cd Dermstore.com-Scrapers
pip install playwright pydantic[email]
playwright install chromium
We will modify the logic found in python/playwright/product_data/scraper/dermstore_scraper_product_data_v1.py.
The Problem: Why dataclasses Aren't Enough
The current implementation in the repository uses standard Python dataclasses to structure scraped data:
@dataclass
class ScrapedData:
name: str = ""
price: float = 0.0
url: str = ""
# ... other fields
While dataclasses provide a clean way to group data, they have a major flaw: they do not enforce types at runtime.
If your extraction logic accidentally grabs a string like "Call for Price" and assigns it to the price field (hinted as a float), Python will not complain. The object will be created, serialized to JSON, and sent to your database. When your pricing algorithm later tries to calculate a discount on that string, the system crashes.
You need a way to ensure that if data doesn't match your requirements, it's caught before it leaves the scraper.
Step 1: Defining a Strict Schema with Pydantic
Pydantic allows you to create models that validate data the moment they are initialized. If the data is wrong, Pydantic raises an error immediately.
Let's redefine the ScrapedData model using Pydantic’s BaseModel. We'll use strict types to ensure no dirty data slips through.
from pydantic import BaseModel, Field, HttpUrl, field_validator
from typing import List, Optional, Dict, Any
class ProductSchema(BaseModel):
name: str = Field(..., min_length=1)
productId: str = Field(..., alias="sku")
price: float = Field(..., gt=0) # Must be a float greater than 0
currency: str = Field(default="USD")
url: HttpUrl # Validates that the string is a proper URL
availability: str
brand: Optional[str] = None
@field_validator('availability')
@classmethod
def validate_availability(cls, v: str) -> str:
allowed = {'in_stock', 'out_of_stock'}
if v.lower() not in allowed:
raise ValueError(f"Availability must be one of {allowed}")
return v.lower()
Why this is better:
Strict Typing: If
priceis passed as a non-numeric string, Pydantic will attempt to convert it (e.g.,"10.99"becomes10.99). If it cannot (e.g.,"TBD"), it throws aValidationError.Field Validation: The
validate_availabilitymethod ensures you only accept specific states, preventing "InStock" vs "in_stock" inconsistencies.Required Fields: Using
...(the Ellipsis) marks a field as required. If the scraper fails to find the product name, the record is rejected.
Step 2: Validating Data During Extraction
Now we update the extract_data function. Instead of returning a dictionary or a loose dataclass, we will attempt to instantiate our ProductSchema.
In the original script, data is extracted into a dictionary. We wrap the final return in our Pydantic model:
async def extract_data(page: Page) -> Optional[ProductSchema]:
# ... (extraction logic from the repo)
raw_data = {
"name": await page.locator("#product-title").inner_text(),
"sku": await page.locator("#ratingSummary").get_attribute("data-sku"),
"price": price_extracted_from_page,
"url": page.url,
"availability": "in_stock" if "in stock" in avail_text else "out_of_stock"
}
try:
validated_product = ProductSchema(**raw_data)
return validated_product
except Exception as e:
logger.error(f"Validation failed for {page.url}: {e}")
raise
Step 3: Handling Complex Nested Fields
Dermstore products contain nested lists for reviews and specifications. Pydantic handles this using nested models.
class ReviewSchema(BaseModel):
author: str
content: str
rating: float = Field(..., ge=1, le=5)
date: str
class SpecificationSchema(BaseModel):
key: str
value: str
class EnhancedProductSchema(ProductSchema):
reviews: List[ReviewSchema] = []
specifications: List[SpecificationSchema] = []
By defining List[ReviewSchema], Pydantic automatically validates every review object in the list. If one review has a rating of 6.0, the entire product record is flagged as invalid.
Step 4: Implementing the Dead Letter Queue (DLQ)
If a product fails validation, the scraper shouldn't crash. You might have 1,000 other successful products to process. However, you shouldn't ignore the error either.
The Dead Letter Queue (DLQ) pattern involves saving invalid records to a separate file (dlq.jsonl) along with the reason they failed. This allows you to fix your selectors later without losing data.
Modify the DataPipeline class as follows:
class DataPipeline:
def __init__(self, output_file="products.jsonl", dlq_file="dlq.jsonl"):
self.output_file = output_file
self.dlq_file = dlq_file
def save_to_jsonl(self, filename: str, data: dict):
with open(filename, "a", encoding="utf-8") as f:
f.write(json.dumps(data) + "\n")
def add_data(self, raw_extracted_data: dict):
try:
# Attempt validation
validated_item = EnhancedProductSchema(**raw_extracted_data)
# If successful, save to main output
self.save_to_jsonl(self.output_file, validated_item.model_dump())
logger.info(f"Successfully saved {validated_item.name}")
except Exception as e:
# If validation fails, move to DLQ
error_entry = {
"url": raw_extracted_data.get("url"),
"error": str(e),
"raw_data": raw_extracted_data,
"timestamp": datetime.now().isoformat()
}
self.save_to_jsonl(self.dlq_file, error_entry)
logger.warning(f"Validation failed. Record moved to DLQ: {self.dlq_file}")
Recommended Approaches for Production
Implementing strict validation is a major step forward, but maintenance is key to long-term success.
Monitor your DLQ: A few items in the DLQ are normal, such as a single product page with a weird edge case. If the DLQ starts growing rapidly, it is a clear signal that the website’s schema has shifted and your selectors need updating.
Use Optional Sparingly: It is tempting to make every field optional to prevent errors. Avoid this. Force your scraper to fail on critical fields like
priceorname. It is better to have no data than wrong data.Node.js Alternatives: If you use the Node.js implementations from the repository, use Zod. It provides a nearly identical experience to Pydantic for TypeScript and JavaScript.
Automated Alerts: Set up a script to check the size of your
dlq.jsonlafter a run. If it exceeds 5% of your total successful scrapes, trigger an alert.
To Wrap Up
By moving from loose dictionaries to strict Pydantic models, you transform a scraper from a brittle script into a reliable data pipeline.
Key Takeaways:
Schema Drift is inevitable; plan for it with runtime validation.
Pydantic ensures only data meeting your specifications enters your database.
Dead Letter Queues prevent scraper crashes while ensuring bad data is preserved for debugging rather than silently discarded.
If you want a head start on building your next scraper with these patterns, use the ScrapeOps AI Scraper Generator to create the base extraction logic, then layer in Pydantic for production-grade reliability.



