How to Structure Web Scraped Data for AI Pipelines
Key Takeaways
What output format should web scrapers use for AI pipelines? JSON Lines (JSONL) with flat, typed fields is the standard for AI-ready scraper output. Each line is a self-contained record with consistent field names, explicit types (string, number, boolean), and null values for missing data rather than omitted keys. This format streams efficiently, validates easily, and loads directly into vector databases and RAG pipelines.
How should scraper output schemas handle missing or inconsistent data?
Use explicit null values instead of omitting keys, normalize formats (ISO 8601 for dates, consistent currency codes for prices), and include a scrapedAt timestamp on every record. Schema validation at the scraper level — before data enters the pipeline — prevents silent corruption downstream.
What is the difference between scraping for RAG and scraping for structured analysis? RAG pipelines need chunked text with metadata (source URL, title, date) for embedding and retrieval. Structured analysis pipelines need typed fields (price as number, rating as float, categories as arrays) for filtering, aggregation, and direct LLM reasoning. The best scraper schemas support both by including both raw text content and parsed structured fields.
The Schema Is the Product
Most web scraping tutorials end at "extract the data." They show you how to select DOM elements, handle pagination, and write results to a file. What they skip is the part that determines whether your AI pipeline actually works: the output schema.
A scraper that dumps inconsistent JSON — sometimes including a price field, sometimes not, sometimes as a string with a dollar sign, sometimes as a bare number — creates a data quality problem that compounds through every downstream system. Your vector database indexes garbage. Your RAG pipeline retrieves irrelevant chunks. Your LLM reasons over malformed input and produces confident, wrong output.
The output schema isn't an afterthought. For AI pipelines, it's the most important design decision in the entire scraper.
Flat JSON with Typed Fields
The foundation of an AI-ready scraper output is flat JSON with explicitly typed fields. Nested objects create parsing complexity. Dynamic keys create schema drift. Untyped fields create silent conversion errors.
{
"contractorName": "Cf Construction and Remodeling, Inc",
"slug": "cf-construction-and-remodeling-inc",
"url": "https://www.buildzoom.com/contractor/cf-construction-and-remodeling-inc",
"bzScore": 180,
"priceRange": "$25,000 - $1,000,000",
"fullAddress": "3532 SW 113th Ct, Miami, FL 33165",
"city": "Miami",
"state": "FL",
"zipCode": "33165",
"phoneNumber": "(904) 513-9494",
"totalProjects": 149,
"rating": 5.0,
"reviewCount": 6,
"description": "Premier residential General Contractor...",
"scrapedAt": "2026-04-06T12:00:00.000Z"
}
This is the output format from ParseBird's BuildZoom Scraper. Every field has a consistent type across all records. Numbers are numbers, not strings. The address is decomposed into components (city, state, zip) for filtering. The scrapedAt timestamp establishes data freshness.
Compare this to what a naive scraper might produce:
{
"name": "Cf Construction",
"info": "180 score, Miami FL, (904) 513-9494, 149 projects",
"details": { "rating": "5 stars", "reviews": "6 reviews" }
}
The second format is usable by a human reading a spreadsheet. It's nearly useless for an AI pipeline. The score is buried in a free-text string. The rating is a string with a unit. The phone number isn't in its own field. Every downstream consumer has to parse, guess, and hope.
Schema Design Principles for AI Consumption
Five principles separate scraper schemas that work in AI pipelines from schemas that create debugging nightmares:
1. Every field has one type, always. If rating is a number, it's a number in every record. Never 5.0 in one record and "5.0 stars" in another. Type inconsistency is the single most common cause of pipeline failures.
2. Missing data is null, not absent. When a business listing doesn't have an email, the output should include "email": null, not omit the key entirely. This lets downstream systems distinguish between "we checked and there's no email" and "we didn't check."
3. Dates are ISO 8601. Always. "2026-04-06T12:00:00.000Z" — not "April 6, 2026", not "04/06/2026", not "6 days ago". Relative dates are meaningless once the data leaves the scraper.
4. Arrays for multi-value fields. Categories, tags, skills, and similar multi-value fields should be arrays of strings, not comma-separated strings. ["Plumbers", "Water Heaters"] — not "Plumbers, Water Heaters".
5. Include provenance metadata. Every record needs at minimum: url (source page), scrapedAt (collection timestamp). For AI pipelines that need to cite sources or assess freshness, this metadata is essential.
| Principle | Bad Example | Good Example |
|---|---|---|
| Consistent types | "rating": "5 stars" | "rating": 5.0 |
| Explicit nulls | {} (key omitted) | "email": null |
| ISO dates | "6 days ago" | "2026-04-06T12:00:00.000Z" |
| Array values | "Plumbers, Roofers" | ["Plumbers", "Roofers"] |
| Provenance | No source URL | "url": "https://...", "scrapedAt": "..." |
Structuring Data for RAG Pipelines
RAG (Retrieval-Augmented Generation) pipelines have specific requirements that differ from traditional data analysis. The pipeline needs to embed text chunks into a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context.
For RAG, your scraper output needs two things: chunkable text content and filterable metadata.
{
"title": "Senior React Developer at Acme Corp",
"content": "We're looking for a senior React developer with 5+ years of experience in building production web applications. You'll work on our core product team, building features used by millions of users. Requirements: React, TypeScript, Node.js, PostgreSQL. Benefits include remote work, equity, and unlimited PTO.",
"metadata": {
"source": "ycombinator",
"url": "https://www.ycombinator.com/companies/acme-corp/jobs/senior-react-developer",
"company": "Acme Corp",
"location": "Remote",
"salary_min": 150000,
"salary_max": 200000,
"posted_date": "2026-04-01T00:00:00.000Z",
"scraped_at": "2026-04-06T12:00:00.000Z"
}
}
The content field gets chunked and embedded. The metadata fields enable filtered retrieval — "find job listings from Y Combinator posted in the last week with salary above $150K." Without structured metadata, your RAG pipeline can only do semantic search over raw text, missing the precision that structured filters provide.
ParseBird's Y Combinator Jobs Scraper and We Work Remotely Scraper produce output in this pattern: structured fields for filtering and full text content for embedding.
Structuring Data for LLM Tool Calls
When an LLM calls a scraping tool via MCP or a function call, the output format matters even more. The model needs to parse the response, extract relevant information, and incorporate it into its reasoning — all within a single inference step.
The optimal format for tool call responses is compact JSON with self-documenting field names:
// What the LLM receives from a tool call
{
"results": [
{
"businessName": "Joe's Plumbing LLC",
"phone": "(512) 555-0142",
"email": "joe@joesplumbing.com",
"rating": 4.5,
"reviewCount": 47,
"city": "Austin",
"state": "TX"
}
],
"totalResults": 47,
"query": { "searchQuery": "plumbers", "location": "Austin, TX" }
}
Three things make this format LLM-friendly:
- Self-documenting field names.
businessNameis unambiguous.namecould be anything.biz_nmrequires documentation. - Included query echo. The
queryfield tells the model what it asked for, preventing confusion when multiple tool calls are in flight. - Summary statistics.
totalResultslets the model know whether it has a complete picture or needs to paginate.
Validation at the Source
Schema validation should happen inside the scraper, not downstream. By the time malformed data reaches your vector database or LLM context window, the damage is done.
import { z } from 'zod';
const BusinessListingSchema = z.object({
businessName: z.string().min(1),
phone: z.string().nullable(),
email: z.string().email().nullable(),
address: z.string().min(1),
city: z.string().min(1),
state: z.string().length(2),
zip: z.string().regex(/^\d{5}(-\d{4})?$/),
rating: z.number().min(0).max(5).nullable(),
reviewCount: z.number().int().min(0),
scrapedAt: z.string().datetime(),
});
function validateRecord(raw: unknown) {
const result = BusinessListingSchema.safeParse(raw);
if (!result.success) {
log.warning(`Validation failed: ${result.error.message}`);
return null;
}
return result.data;
}
Validation at the scraper level catches problems when they're cheapest to fix: before the data enters any pipeline, before it's embedded into vectors, before an LLM reasons over it. A record that fails validation is a handled error. A record that passes validation but contains garbage is silent corruption.
The Output Schema Checklist
Before connecting a scraper to an AI pipeline, verify these properties:
- Every field has a consistent type across all records
- Missing values are explicit nulls, not omitted keys
- Dates use ISO 8601 format
- Multi-value fields use arrays, not delimited strings
- Every record includes source URL and scrape timestamp
- Numeric fields are numbers, not strings with units
- Text content is clean (no HTML tags, no navigation text, no boilerplate)
- Schema validation runs inside the scraper before output
Production Apify Actors like ParseBird's scrapers enforce these properties by design. The output schema is the contract between the scraper and every downstream system that consumes its data.
Related: Why Your AI Agent Keeps Hallucinating and How Fresh Data Fixes It explains why structured, fresh data is the primary defense against hallucination. Build Agents That Collect Data at Scale covers the operational challenges of running scrapers in production.