esedark
developer reviewing scraper logs and monitoring dashboards

scraping / reliability / parsers / observability / production

How to keep a scraper from breaking every week

Most scraper instability is not caused by scraping itself. It comes from weak boundaries between navigation, extraction, parsing, storage and monitoring.

If your scraper breaks every week, the problem is usually not one bad selector. The problem is that the whole pipeline depends on brittle assumptions and gives you very little evidence when those assumptions fail.

A robust scraper is a controlled production system. It needs parser versioning, stable retry rules, public-data boundaries, change detection and enough traceability to explain why one record was collected or rejected. That is the difference between occasional data extraction and a pipeline a business can actually use.

Why scrapers break so often

Sites change markup, class names, pagination rules, lazy-loading behavior and anti-abuse flows. That part is normal. What turns a normal change into an expensive incident is poor system design.

  • selectors are tightly coupled to CSS noise instead of stable structure
  • navigation, parsing and database writes happen in one script
  • failures do not save screenshots or HTML evidence
  • operators cannot replay one broken job safely
  • all errors share the same retry logic

This is why teams that start from a quick browser script often end up rebuilding around the same ideas used in clean Puppeteer production architecture or other controlled automation stacks.

Build the scraper as layers, not as one script

The easiest way to make a scraper robust is to separate the responsibilities clearly.

scheduler
  -> decides what URLs to crawl

worker
  -> loads page and captures evidence

extractor
  -> returns raw fields

parser
  -> normalizes values and validates schema

storage
  -> writes accepted records and audit logs

Once those layers exist, a parser fix does not require touching browser code, and a DOM change does not risk corrupting storage. Stability improves because each failure has a smaller blast radius.

Use selector strategy, not random selectors

A scraper that depends on long chains of generated class names is asking to fail. Prefer selectors based on structure, labels, repeated layout blocks or nearby static text. When that is not possible, keep selector profiles versioned so you can roll forward cleanly.

That versioning matters for traceability. If a customer asks why a field started failing on Monday, you want to answer with evidence instead of guesswork.

Common mistakes

The first mistake is treating screenshots as optional. If a page fails to parse, you want the screenshot, URL, parser version and a fragment of the raw HTML.

The second mistake is retrying schema errors the same way you retry timeouts. A timeout may recover on the next attempt. A broken selector usually will not.

The third mistake is scraping public data without operational limits. Public data still needs sensible pacing, source review, documented storage and a clear business purpose.

The fourth mistake is letting scrapers write directly to the final table without validation or duplicate checks. One markup change can pollute the whole dataset.

The fifth mistake is assuming more proxies will fix weak engineering. Network stability matters, but the data pipeline still needs clean architecture and predictable behavior.

Practical checklist for a robust scraper

  • jobs are queued with crawl profile and retry class
  • browser workers save screenshots and HTML evidence on failure
  • selectors and parser rules are versioned
  • schema validation runs before database writes
  • duplicate detection exists before records become final
  • public-data usage is documented and rate-limited where relevant
  • operators can replay one failed job in isolation
  • success rate, parse rate and latency are visible in logs
  • manual review exists for ambiguous or high-value records
  • weekly maintenance focuses on diffs, not guesswork

Detect breakage before the business does

You do not want your first alert to come from sales or operations. Good scraper maintenance watches for leading indicators: selector miss rate, empty-field rate, pagination depth changes, unexpected duplicate spikes and storage rejects.

{
  "job_id": "listing-2841",
  "source": "marketplace-a",
  "parser_version": "v6",
  "result": "selector_miss",
  "evidence_path": "s3://scrapers/evidence/listing-2841.png",
  "retry_class": "manual_fix"
}

That kind of evidence turns maintenance into engineering instead of panic.

When hiring a technical person makes sense

If your team already has scripts, some records reach the database and the business depends on the output, but nobody trusts the pipeline after layout changes, the real issue is no longer scraping syntax. The issue is production ownership.

That is where focused help through technical services or direct architecture support from fractional CTO work can save time and money. The value is in designing parser boundaries, evidence collection, safe public-data workflows and a stable maintenance process.

Final takeaway

If you want to stop a scraper from breaking every week, stop thinking about one script and start thinking about one system. The browser is only one layer. Stability comes from traceability, isolation, validation and operational discipline.

If you need help auditing or redesigning an unstable scraping pipeline, use contact and bring the current sources, failure patterns, storage flow and the limits you need to respect. That is where the useful work starts.