Skip to main content

Command Palette

Search for a command to run...

Hardening Your Costco Scraper: Detecting Soft Bans and Enforcing Data Quality with Pydantic

Updated
6 min read
Hardening Your Costco Scraper: Detecting Soft Bans and Enforcing Data Quality with Pydantic
E
Technical Writing for Modern Development

When scraping high-value e-commerce targets like Costco, an HTTP 200 OK status code is often a lie. While many developers rely on status codes to trigger retries, Costco frequently employs soft bans. Instead of a blunt 403 Forbidden, the server serves a valid HTML page containing a "Please verify you are human" challenge or a skeleton page devoid of product data.

If your scraper doesn't differentiate between a real product page and these "ghost" responses, your database will quickly fill with null values and zeroed-out prices. To prevent this, you need to move beyond simple status checks and implement Schema Validation.

This guide explains how to upgrade the Costco.com-Scrapers repository from basic Python dataclasses to Pydantic models. This allows your scraper to fail fast, rotate proxies, and retry until it secures high-quality data.

Prerequisites

To follow along, you should have:

  • Python 3.8+ installed.

  • Familiarity with BeautifulSoup or Playwright.

  • A ScrapeOps API Key for proxy rotation.

  • Pydantic installed: pip install pydantic.

Identifying Soft Bans and Data Drift

A soft ban occurs when Costco's anti-bot system—such as Akamai or PerimeterX—suspects automated activity but hasn't completely severed the connection. There are three common scenarios:

  1. The "Human Check" Page: The response is a 200 OK, but the body contains a CAPTCHA or a "Verify you are human" script.

  2. The Empty Skeleton: You receive a page with headers and footers, but the main product JSON-LD or DOM elements containing the price and SKU are missing.

  3. Geo-Location Shifts: Sometimes a proxy in a restricted region causes the site to hide pricing or show "Out of Stock" nationally, even if the item is available.

In the current repository scripts, such as python/BeautifulSoup/product_data/scraper/costco_scraper_product_data_v1.py, data is stored in a standard dataclass. If the scraper fails to find a price, it might default to 0.0. Without validation, your pipeline saves this empty data as if it were a success.

Replacing Dataclasses with Pydantic

The existing repository uses a ScrapedData dataclass that looks like this:

@dataclass
class ScrapedData:
    name: str = ""
    price: float = 0.0
    productId: str = ""
    # ... other fields

The problem is that ScrapedData(name="", price=0.0) is considered a valid object. We want to ensure that a product is only valid if it has a name, a non-zero price, and a SKU.

Create a Pydantic model in a new file, models.py, to define what a valid Costco product looks like:

from pydantic import BaseModel, Field, field_validator
from typing import Optional, List

class CostcoProduct(BaseModel):
    name: str = Field(..., min_length=2)  # Must exist and be at least 2 chars
    price: float = Field(..., gt=0)       # Price must be greater than 0
    product_id: str = Field(..., pattern=r"^\d+$") # Must be numeric string
    availability: str
    url: str

    @field_validator('price', mode='before')
    @classmethod
    def parse_comma_price(cls, v):
        if isinstance(v, str):
            return float(v.replace(',', '').replace('$', ''))
        return v

Key Benefits

  • Required Fields: The ... (Ellipsis) indicates a field is mandatory. If Costco sends a page without a price, Pydantic raises a ValidationError.

  • Type Coercion: It automatically converts string prices like "$1,299.99" into floats.

  • Data Integrity: The pattern check ensures the product_id matches Costco's internal numeric format.

Implementing the Validation Logic

Now we need to integrate this into the extraction flow. We will modify the extract_data function found in the repository to wrap the raw dictionary in our new model.

First, define a custom exception to distinguish between a bad page (soft ban) and a bad script (bug):

class SoftBanError(Exception):
    """Raised when the page returns 200 OK but contains no valid data."""
    pass

Next, update the extraction wrapper:

from pydantic import ValidationError
from models import CostcoProduct

def get_validated_data(soup, url):
    # This calls the original repository logic to get a raw dict
    raw_data = extract_raw_dict_from_soup(soup, url) 
    
    try:
        # Attempt to create the Pydantic model
        return CostcoProduct(**raw_data)
    except ValidationError as e:
        # If price or name is missing, it's likely a soft ban
        raise SoftBanError(f"Validation failed: {e.json()}")

The "Retry-on-Validation-Failure" Loop

This is the core architectural change. We must treat a SoftBanError exactly like a 403 Forbidden. If validation fails, we rotate the proxy and try again.

Here is how to refactor the main execution loop:

import requests
from models import SoftBanError

def scrape_with_validation(url, api_key):
    max_retries = 5
    for attempt in range(max_retries):
        # Using ScrapeOps Proxy Port
        proxy_url = f"http://scrapeops:{api_key}@residential-proxy.scrapeops.io:8181"
        proxies = {"http": proxy_url, "https": proxy_url}

        try:
            response = requests.get(url, proxies=proxies, timeout=30)
            
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, "lxml")
                # VALIDATION STEP
                product_data = get_validated_data(soup, url)
                return product_data # Success!
                
        except (SoftBanError, requests.RequestException) as e:
            print(f"Attempt {attempt + 1} failed: {e}. Rotating proxy...")
            continue
            
    return None # Permanent failure after retries

By adding SoftBanError to the except block, the scraper no longer accepts empty results. It forces a retry until the ScrapeOps residential proxy finds a clean route to the data.

Once you implement schema validation, you'll likely notice that your success rate in terms of HTTP codes stays high, but your data quality rate might fluctuate.

You can use the ScrapeOps Dashboard to monitor these trends. By logging custom metrics, you can see if a specific Costco layout change has broken your selectors globally (100% validation failure) or if your proxy pool is simply being throttled (intermittent failures).

Pro Tip: If you see a spike in ValidationError for the price field but not the name field, Costco may have moved the price into a protected Javascript element. This signals it is time to switch from BeautifulSoup to the Playwright implementation in the repository.

To Wrap Up

Hardening your scraper against soft bans is the difference between a production-grade data pipeline and a fragile script. By moving to Pydantic, you achieve several wins:

  • Accuracy: You no longer save 0.0 prices or empty product names.

  • Resilience: The scraper automatically retries and rotates proxies when it detects ghost pages.

  • Clarity: ValidationErrors tell you exactly which part of the page changed, making maintenance significantly faster.

To further improve your Costco scraping stack, consider exploring the Playwright or Selenium versions in the Costco.com-Scrapers repository to handle pages where data is rendered dynamically after the initial load.

O

First off, you'd want batch-level validators that can spot those honeypot red flags - like when everything's priced the same or inventory's flat-lining across a whole product line. Also you gotta nail down the difference between schema drift and soft bans because they need totally different fixes, one's just weird data that needs a human look, the other means rotate your proxy and try again. And throw in exponential backoff with some jitter on retries - hammering the same IP over and over is basically begging to stay blocked. Also you mention Playwright as a fallback, which is smart, but it's kinda vague. Maybe flesh out exactly when you'd flip the switch, like okay, price field disappeared after two attempts, time to spin up Playwright. So you're treating data validation as your actual anti-bot defense layer, and most people straight-up sleep on that