Hardening Your Costco Scraper: Detecting Soft Bans and Enforcing Data Quality with Pydantic

When scraping high-value e-commerce targets like Costco, an HTTP 200 OK status code is often a lie. While many developers rely on status codes to trigger retries, Costco frequently employs soft bans. Instead of a blunt 403 Forbidden, the server serves a valid HTML page containing a "Please verify you are human" challenge or a skeleton page devoid of product data.
If your scraper doesn't differentiate between a real product page and these "ghost" responses, your database will quickly fill with null values and zeroed-out prices. To prevent this, you need to move beyond simple status checks and implement Schema Validation.
This guide explains how to upgrade the Costco.com-Scrapers repository from basic Python dataclasses to Pydantic models. This allows your scraper to fail fast, rotate proxies, and retry until it secures high-quality data.
Prerequisites
To follow along, you should have:
Python 3.8+ installed.
Familiarity with
BeautifulSouporPlaywright.A ScrapeOps API Key for proxy rotation.
Pydantic installed:
pip install pydantic.
Identifying Soft Bans and Data Drift
A soft ban occurs when Costco's anti-bot system—such as Akamai or PerimeterX—suspects automated activity but hasn't completely severed the connection. There are three common scenarios:
The "Human Check" Page: The response is a
200 OK, but the body contains a CAPTCHA or a "Verify you are human" script.The Empty Skeleton: You receive a page with headers and footers, but the main product JSON-LD or DOM elements containing the price and SKU are missing.
Geo-Location Shifts: Sometimes a proxy in a restricted region causes the site to hide pricing or show "Out of Stock" nationally, even if the item is available.
In the current repository scripts, such as python/BeautifulSoup/product_data/scraper/costco_scraper_product_data_v1.py, data is stored in a standard dataclass. If the scraper fails to find a price, it might default to 0.0. Without validation, your pipeline saves this empty data as if it were a success.
Replacing Dataclasses with Pydantic
The existing repository uses a ScrapedData dataclass that looks like this:
@dataclass
class ScrapedData:
name: str = ""
price: float = 0.0
productId: str = ""
# ... other fields
The problem is that ScrapedData(name="", price=0.0) is considered a valid object. We want to ensure that a product is only valid if it has a name, a non-zero price, and a SKU.
Create a Pydantic model in a new file, models.py, to define what a valid Costco product looks like:
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
class CostcoProduct(BaseModel):
name: str = Field(..., min_length=2) # Must exist and be at least 2 chars
price: float = Field(..., gt=0) # Price must be greater than 0
product_id: str = Field(..., pattern=r"^\d+$") # Must be numeric string
availability: str
url: str
@field_validator('price', mode='before')
@classmethod
def parse_comma_price(cls, v):
if isinstance(v, str):
return float(v.replace(',', '').replace('$', ''))
return v
Key Benefits
Required Fields: The
...(Ellipsis) indicates a field is mandatory. If Costco sends a page without a price, Pydantic raises aValidationError.Type Coercion: It automatically converts string prices like "$1,299.99" into floats.
Data Integrity: The
patterncheck ensures theproduct_idmatches Costco's internal numeric format.
Implementing the Validation Logic
Now we need to integrate this into the extraction flow. We will modify the extract_data function found in the repository to wrap the raw dictionary in our new model.
First, define a custom exception to distinguish between a bad page (soft ban) and a bad script (bug):
class SoftBanError(Exception):
"""Raised when the page returns 200 OK but contains no valid data."""
pass
Next, update the extraction wrapper:
from pydantic import ValidationError
from models import CostcoProduct
def get_validated_data(soup, url):
# This calls the original repository logic to get a raw dict
raw_data = extract_raw_dict_from_soup(soup, url)
try:
# Attempt to create the Pydantic model
return CostcoProduct(**raw_data)
except ValidationError as e:
# If price or name is missing, it's likely a soft ban
raise SoftBanError(f"Validation failed: {e.json()}")
The "Retry-on-Validation-Failure" Loop
This is the core architectural change. We must treat a SoftBanError exactly like a 403 Forbidden. If validation fails, we rotate the proxy and try again.
Here is how to refactor the main execution loop:
import requests
from models import SoftBanError
def scrape_with_validation(url, api_key):
max_retries = 5
for attempt in range(max_retries):
# Using ScrapeOps Proxy Port
proxy_url = f"http://scrapeops:{api_key}@residential-proxy.scrapeops.io:8181"
proxies = {"http": proxy_url, "https": proxy_url}
try:
response = requests.get(url, proxies=proxies, timeout=30)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
# VALIDATION STEP
product_data = get_validated_data(soup, url)
return product_data # Success!
except (SoftBanError, requests.RequestException) as e:
print(f"Attempt {attempt + 1} failed: {e}. Rotating proxy...")
continue
return None # Permanent failure after retries
By adding SoftBanError to the except block, the scraper no longer accepts empty results. It forces a retry until the ScrapeOps residential proxy finds a clean route to the data.
Monitoring Data Quality Trends
Once you implement schema validation, you'll likely notice that your success rate in terms of HTTP codes stays high, but your data quality rate might fluctuate.
You can use the ScrapeOps Dashboard to monitor these trends. By logging custom metrics, you can see if a specific Costco layout change has broken your selectors globally (100% validation failure) or if your proxy pool is simply being throttled (intermittent failures).
Pro Tip: If you see a spike in ValidationError for the price field but not the name field, Costco may have moved the price into a protected Javascript element. This signals it is time to switch from BeautifulSoup to the Playwright implementation in the repository.
To Wrap Up
Hardening your scraper against soft bans is the difference between a production-grade data pipeline and a fragile script. By moving to Pydantic, you achieve several wins:
Accuracy: You no longer save
0.0prices or empty product names.Resilience: The scraper automatically retries and rotates proxies when it detects ghost pages.
Clarity:
ValidationErrorstell you exactly which part of the page changed, making maintenance significantly faster.
To further improve your Costco scraping stack, consider exploring the Playwright or Selenium versions in the Costco.com-Scrapers repository to handle pages where data is rendered dynamically after the initial load.



