What is the Crawl API?
The Crawl API is like a smart web crawler that automatically:- Starts from a seed URL (e.g., homepage)
- Follows internal links to discover more pages
- Extracts content from each page in Markdown format
- Streams results as pages are scraped (no waiting!)
Multi-Page Scraping
Scrape dozens of pages in one request
Real-Time Streaming
Get results as pages are scraped, not at the end
Smart crawling: The API automatically discovers pages by following links, respects depth limits, and avoids duplicates.
Crawl vs Map vs Scrape
Understand when to use each API:| Feature | Map API | Scrape API | Crawl API |
|---|---|---|---|
| What it does | Discovers URLs | Scrapes 1 page | Scrapes multiple pages |
| Content returned | Titles only | Full content | Full content |
| Speed | Very fast (1-5s) | Fast (2-7s) | Depends on pages |
| Cost | $0.002 flat | $0.001 per page | $0.001 per success |
| Use when | Planning | Single page | Bulk extraction |
| Streaming | No | No | ✅ Yes |
Pricing (Pay Per Success)
Per-Page Success Model
$0.001 per successfully scraped page= $1 for 1,000 pages
You only pay for pages that succeed! Failed pages are free. This is better than competitors who charge per attempt.
Advanced Proxy Pricing
Advanced Proxy (Optional)
Additional $0.004 per successful pageTotal cost with advanced proxy: $0.005 per pageUse advanced proxy for:
- Sites with aggressive bot detection
- Sites that block standard requests
- Enterprise websites with strict security
- E-commerce sites with protection
Only pay when you need it: The advanced proxy adds $0.004/page but significantly improves success rates on protected sites.
Pricing Examples
Standard crawling:Before You Start
Authentication
All requests require your API key in theAuthorization header:
Important Limits
Max Pages
100 pages per requestHard limit enforced by API
Timeout
60 seconds defaultConfigurable up to longer times
Your First Crawl (2-Minute Start)
Let’s crawl a website and get results in real-time!Done! You just crawled a website and got content from multiple pages in real-time. Notice you only paid for the 3 successful pages, not the failed one!
How Streaming Works
The Crawl API uses Server-Sent Events (SSE) to stream results as they happen.Event Types
- page
- usage
- done
- error
Individual page resultSent each time a page is scraped (success or failure)
Why streaming? Get results immediately as pages are scraped instead of waiting for all pages to complete. Perfect for long-running crawls!
Basic Crawling
Crawl with Depth Control
Control how many “clicks” away from the seed URL to crawl.Clean Content with Main Content Only
Extract only the main article/content without navigation, headers, or footers.Perfect for:
- Blog posts (without sidebar clutter)
- News articles (just the story)
- Documentation (pure content)
- Research papers (main text only)
- AI training data (cleaner input)
- ❌ Navigation bars
- ❌ Sidebars
- ❌ Headers and footers
- ❌ Advertisement sections
- ❌ Related posts widgets
- ✅ Main article content
- ✅ Embedded images in content
- ✅ Code blocks and tables
Advanced Proxy for Protected Sites
Use advanced proxy infrastructure for sites with strict bot protection.When to use advanced proxy:
- Site returns 403 Forbidden
- Getting CAPTCHA challenges
- High-security enterprise sites
- E-commerce platforms
- Sites that block datacenter IPs
- After standard crawl fails
Combine Both Features
Get clean content from protected sites.Include Subdomains
Crawl across all subdomains (blog., docs., api.*, etc.)Request Parameters (Complete Reference)
Endpoint:POST /api/v2/crawl_stream
Required Parameters
Starting URL (seed) for the crawl. Must be a valid HTTP(S) URL.Examples:
- ✅
https://docs.example.com - ✅
https://blog.example.com/posts - ❌
example.com(missing protocol)
Optional Parameters
Maximum number of pages to crawl.Range: 1 - 50 (hard limit)Default: 25
Maximum depth to crawl from seed URL.
1- Only pages directly linked from seed2- Seed + 1st level + 2nd level (default)3+- Deeper crawling
Total timeout for the entire crawl operation in seconds.Default: 60 seconds
This is the total crawl timeout, not per-page timeout. The crawl stops when time runs out.
Extract only the main content, removing navigation, headers, footers, and sidebars.
Perfect for: Blog posts, news articles, documentation, and AI training data where you want clean, focused content.
Enable advanced proxy infrastructure for sites with bot protection.
Use when sites return 403 errors, CAPTCHA challenges, or have aggressive bot detection.
Follow links to subdomains (blog., docs., api.*, etc.)
Include images in markdown output.
Include hyperlinks in markdown output.
Response Format (SSE Frames)
The API streams Server-Sent Events (SSE) with JSON payloads.Frame Types
- page
- usage
- done
- error
Individual page resultFields:
requested_url- Original URL requestedfinal_url- URL after redirectstitle- Page titlehash_sha256- Content hash for deduplicationmarkdown- Markdown contentsuccess- Boolean indicating successerror- Error message if failed
Best Practices
💰 Cost Optimization
Use Map first
- Map the site ($0.002)
- Filter URLs you need
- Crawl only those sections
- Don’t set
maxPages: 50if you only need 10 - Use
maxDepth: 1for focused crawling - Stop when you have enough
- Failed pages are free
- No penalty for timeouts
- Only for protected sites
- Costs 5x more per page
- But significantly improves success rate
⚡ Performance Tips
Use shallow depths
maxDepth: 1is fastest- Deep crawls take longer
- Balance depth vs coverage
- Process pages as they stream
- Don’t wait for completion
- Save incrementally
- Faster extraction
- Smaller markdown output
- Better for AI processing
✨ Better Results
Choose right depth
- Docs sites:
maxDepth: 2-3 - Blogs:
maxDepth: 1-2 - Large sites:
maxDepth: 1
- Blog posts
- News articles
- Documentation
- AI training data
- Use
final_urlnotrequested_url - Track URL mappings
- Avoid duplicates with
hash_sha256
🛡️ Reliability
Track failures
- Count successful vs failed
- Log errors for review
- Retry failed URLs if needed
- Getting 403 errors
- Site blocks requests
- Standard crawl fails
- Need higher success rate
- Write to disk as frames arrive
- Don’t hold everything in memory
- Survive interruptions
Important Limitations
Frequently Asked Questions
When should I use main_content_only?
When should I use main_content_only?
Use
main_content_only: true when:- You want clean article/blog content
- You’re training AI models (cleaner data)
- You need to remove sidebars and navigation
- You’re processing news articles
- You want focused documentation content
- You need the full page structure
- Navigation menus are important
- You want sidebar information
- Page layout matters
- Headers and footers
- Navigation bars
- Sidebars
- Advertisement sections
- Related posts widgets
- Main article content
- Images within content
- Code blocks
- Tables
When should I use advanced_proxy?
When should I use advanced_proxy?
Use
advanced_proxy: true when:- Standard crawl returns 403 Forbidden
- Site shows CAPTCHA challenges
- E-commerce sites with protection
- Enterprise websites with strict security
- Datacenter IPs are blocked
- You need higher success rates on protected sites
- Standard: $0.001 per successful page
- With proxy: $0.005 per successful page (5x more)
- Worth it if standard crawl fails entirely
Do I pay for failed pages?
Do I pay for failed pages?
No! You only pay for successfully scraped pages.Example:
- Attempted: 50 pages
- Successful: 42 pages
- Failed: 8 pages
- Cost: 42 × 0.042 (without proxy)
- Cost: 42 × 0.210 (with proxy)
Can I use both main_content_only and advanced_proxy?
Can I use both main_content_only and advanced_proxy?
Yes! You can combine both features:Cost: $0.005 per successful page (advanced proxy pricing)Perfect for: Protected news sites, paywalled blogs, enterprise documentation
How does main_content_only affect the output?
How does main_content_only affect the output?
Without main_content_only:With main_content_only:The extracted content is cleaner and more focused on the actual article.
When should I use Crawl vs Scrape?
When should I use Crawl vs Scrape?
Use Crawl when:
- You need multiple related pages
- You want automatic link following
- You’re downloading a section/category
- You want streaming results
- You need exactly one page
- You know the specific URL
- You need non-streaming response
- You need specific pages (not sequential)
- You want full control over which pages
- Pages are scattered across the site
Next Steps
Map API
Discover URLs before crawling
Scraper API
Scrape individual pages
Answer API
Search + AI-powered answers
Need Help?
Found a bug or have a feature request? We’d love to hear from you! Join our Discord or email us at [email protected]
