Build Agents That Collect Data at Scale
Key Takeaways
What is the biggest mistake when scaling web scrapers? Starting with a single-threaded script that works for 10 pages and expecting it to handle 10,000. The jump from prototype to production requires designing for failure, tuning concurrency, and building data validation into the pipeline from day one.
What concurrency settings should I use for web scraping? It depends on the target site. Small sites (under 10K pages) handle 5-10 concurrent requests. Medium sites (10K-100K) need 10-25 with proxy rotation. Large sites (100K+) require 25-50 concurrent requests with residential proxies and browser fingerprint rotation.
How do I ensure data quality in scraped datasets? Every record should pass through four validation stages: schema validation (required fields present), deduplication (no duplicate entries), freshness checks (data is current), and normalization (consistent formats for prices, dates, currencies).
The Single-Script Trap
Every web scraping project with tools like Crawlee (Apify's open-source library) or Playwright (Microsoft) starts the same way: a quick script that fetches a page, parses some HTML with CSS selectors, and dumps the results to a JSON file. It works perfectly for 10 pages. It breaks catastrophically at 1,000.
The jump from prototype scraper to production data pipeline is where most projects fail. The challenges aren't algorithmic — they're operational: network timeouts, IP bans, markup changes, and rate limiting.
Design for Failure
At scale, everything fails. Target pages return 403 Forbidden errors. Proxy IP addresses get banned by anti-bot systems like Cloudflare. Websites change their HTML markup without warning. Your production scraper built with Crawlee and Playwright needs to handle all of these failure modes gracefully with automatic retries and structured error logging.
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
maxConcurrency: 10,
async requestHandler({ page, request, enqueueLinks }) {
try {
await page.waitForSelector('.data-container', {
timeout: 15000,
});
const data = await extractData(page);
await Dataset.pushData(data);
} catch (error) {
log.warning(`Failed to process ${request.url}: ${error.message}`);
throw error; // Triggers automatic retry
}
},
});
The Concurrency Sweet Spot
More concurrent requests doesn't always mean faster web scraping. Too many simultaneous connections from the same IP address range triggers rate limiting and IP bans. Too few concurrent requests wastes time and compute resources.
The optimal concurrency depends on the target site's infrastructure, but here are proven starting points based on site size:
| Site Size | Page Count | Concurrency | Proxy Type | Estimated Speed |
|---|---|---|---|---|
| Small | Under 10K pages | 5-10 requests | Datacenter proxies | ~500 pages/hour |
| Medium | 10K-100K pages | 10-25 requests | Rotating residential | ~2,000 pages/hour |
| Large | 100K+ pages | 25-50 requests | Residential + fingerprint rotation | ~5,000 pages/hour |
Data Quality Over Quantity
Raw scraped data is inherently messy. Duplicate entries, missing required fields, character encoding issues, and stale content are the norm in any large-scale data collection operation. Build validation into your Apify Actor pipeline from day one, not as an afterthought.
Every scraped record should pass through these four validation stages:
- Schema validation — Does the record contain all required fields with correct data types?
- Deduplication — Has this exact record (or a near-duplicate) been collected before?
- Freshness check — Is this data still current, or has the source page been updated?
- Normalization — Are prices converted to a consistent currency? Are dates in ISO 8601 format?
Monitoring and Alerting
A production web scraper without monitoring is a ticking time bomb. Track these four key metrics using your platform's built-in tools or services like Grafana and PagerDuty:
- Success rate — What percentage of HTTP requests return valid data? Alert below 90%.
- Data volume — Are you collecting the expected number of records per run? Alert on >20% deviation.
- Latency — How long does each page take to load and process? Alert above 30 seconds average.
- Error distribution — What types of failures (timeout, 403, parsing error) are occurring? Track trends.
When any metric drifts outside normal bounds, you want to know immediately — not three days later when your downstream data pipeline produces garbage output.
The Platform Advantage
Building all of this infrastructure from scratch — browser pools, proxy management, job scheduling, dataset storage, monitoring dashboards — is possible but expensive in engineering time. Platforms like Apify provide the complete infrastructure stack as a managed service, so you can focus on the extraction logic that's unique to your use case.
That's the philosophy behind ParseBird's Apify Actors: production-grade data collection tools, tested against real anti-bot systems, ready to deploy and integrate into your workflow.
Related: Web Scraping in 2026 covers the current technical landscape including headless browsers and AI extraction. The Agentic Stack explains how data collection fits into the broader AI agent architecture.