Skip to main content

What is the Crawl API?

The Crawl API is like a smart web crawler that automatically:
  1. Starts from a seed URL (e.g., homepage)
  2. Follows internal links to discover more pages
  3. Extracts content from each page in Markdown format
  4. Streams results as pages are scraped (no waiting!)

Multi-Page Scraping

Scrape dozens of pages in one request

Real-Time Streaming

Get results as pages are scraped, not at the end
Perfect for: Documentation downloads, blog archives, content backups, site migrations, and bulk content extraction.
Smart crawling: The API automatically discovers pages by following links, respects depth limits, and avoids duplicates.

Crawl vs Map vs Scrape

Understand when to use each API:
FeatureMap APIScrape APICrawl API
What it doesDiscovers URLsScrapes 1 pageScrapes multiple pages
Content returnedTitles onlyFull contentFull content
SpeedVery fast (1-5s)Fast (2-7s)Depends on pages
Cost$0.002 flat$0.001 per page$0.001 per success
Use whenPlanningSingle pageBulk extraction
StreamingNoNo✅ Yes
Best workflow:
  1. Map the site to discover URLs ($0.002)
  2. Filter to pages you need
  3. Crawl selected sections ($0.001/page) OR Scrape individual pages

Pricing (Pay Per Success)

Per-Page Success Model

$0.001 per successfully scraped page= $1 for 1,000 pages
You only pay for pages that succeed! Failed pages are free. This is better than competitors who charge per attempt.

Advanced Proxy Pricing

Advanced Proxy (Optional)

Additional $0.004 per successful pageTotal cost with advanced proxy: $0.005 per pageUse advanced proxy for:
  • Sites with aggressive bot detection
  • Sites that block standard requests
  • Enterprise websites with strict security
  • E-commerce sites with protection
Only pay when you need it: The advanced proxy adds $0.004/page but significantly improves success rates on protected sites.

Pricing Examples

Standard crawling:
You request: max_pages = 50
Successfully scraped: 45 pages
Failed (timeouts/errors): 5 pages

Cost: 45 × $0.001 = $0.045 ✅
NOT: 50 × $0.001 = $0.050 ❌
With advanced proxy:
You request: max_pages = 50, advanced_proxy = true
Successfully scraped: 45 pages
Failed: 5 pages

Base cost: 45 × $0.001 = $0.045
Proxy cost: 45 × $0.004 = $0.180
Total cost: $0.225 ✅

(Only successful pages are charged)

Before You Start

Authentication

All requests require your API key in the Authorization header:
Authorization: Bearer YOUR_LLMLAYER_API_KEY
Keep your API key secure! Never expose it in client-side code. Always call from your backend.

Important Limits

Max Pages

100 pages per requestHard limit enforced by API

Timeout

60 seconds defaultConfigurable up to longer times

Your First Crawl (2-Minute Start)

Let’s crawl a website and get results in real-time!
import { LLMLayerClient } from 'llmlayer';

const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

// Crawl with real-time streaming
for await (const frame of client.crawlStream({
  url: 'https://docs.example.com',
  maxPages: 5,         // Crawl up to 5 pages
})) {

  // Each frame has a type
  if (frame.type === 'page') {
    const page = frame.page;

    if (page.success) {
      console.log(`✅ ${page.title}`);
      console.log(`   URL: ${page.final_url}`);
      console.log(`   Content length: ${page.markdown?.length || 0} chars\n`);
    } else {
      console.log(`❌ Failed: ${page.final_url}`);
      console.log(`   Error: ${page.error}\n`);
    }
  }

  else if (frame.type === 'usage') {
    console.log(`\n💰 Billing:`);
    console.log(`   Successful pages: ${frame.billed_count}`);
    console.log(`   Cost: $${frame.cost}`);
  }

  else if (frame.type === 'done') {
    console.log(`\n⏱️  Completed in ${frame.response_time}s`);
  }
}
Output:
✅ Getting Started
   URL: https://docs.example.com/getting-started
   Content length: 1234 chars

✅ API Reference
   URL: https://docs.example.com/api-reference
   Content length: 3456 chars

❌ Failed: https://docs.example.com/broken-link
   Error: 404 Not Found

✅ Examples
   URL: https://docs.example.com/examples
   Content length: 2345 chars

💰 Billing:
   Successful pages: 3
   Cost: $0.003

⏱️  Completed in 8.34s
Done! You just crawled a website and got content from multiple pages in real-time. Notice you only paid for the 3 successful pages, not the failed one!

How Streaming Works

The Crawl API uses Server-Sent Events (SSE) to stream results as they happen.

Event Types

Individual page resultSent each time a page is scraped (success or failure)
{
  "type": "page",
  "page": {
    "requested_url": "https://...",
    "final_url": "https://...",
    "title": "Page Title",
    "hash_sha256": "abc123...",
    "markdown": "# Content...",
    "success": true
  }
}
Why streaming? Get results immediately as pages are scraped instead of waiting for all pages to complete. Perfect for long-running crawls!

Basic Crawling

Crawl with Depth Control

Control how many “clicks” away from the seed URL to crawl.
// Depth 1: Only crawl pages linked from the homepage
for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 10,
  maxDepth: 1,  // Only 1 level deep
})) {
  if (frame.type === 'page' && frame.page.success) {
    console.log(`Crawled: ${frame.page.title}`);
  }
}

// Depth 2: Crawl homepage + pages it links to + pages those link to
for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 20,
  maxDepth: 2,  // 2 levels deep
})) {
  if (frame.type === 'page' && frame.page.success) {
    console.log(`Crawled: ${frame.page.title}`);
  }
}
Depth visualization:
Depth 0 (seed):  https://example.com/
                         |
          +--------------+---------------+
          |              |               |
Depth 1:  /about      /products      /contact
                         |
               +---------+---------+
               |                   |
Depth 2:  /products/item1    /products/item2
Depth tips:
  • maxDepth: 1 - Fast, focused crawling
  • maxDepth: 2 - Good balance (default)
  • maxDepth: 3+ - May crawl too much

Clean Content with Main Content Only

Extract only the main article/content without navigation, headers, or footers.
// Get clean content without navigation elements
for await (const frame of client.crawlStream({
  url: 'https://blog.example.com',
  maxPages: 20,
  mainContentOnly: true,  // Extract only main content
})) {

  if (frame.type === 'page' && frame.page.success) {
    const page = frame.page;
    console.log(`✅ ${page.title}`);
    console.log(`   Clean content: ${page.markdown?.length || 0} chars`);
    // Content without header, footer, sidebar, navigation
  }
}
Perfect for:
  • Blog posts (without sidebar clutter)
  • News articles (just the story)
  • Documentation (pure content)
  • Research papers (main text only)
  • AI training data (cleaner input)
What gets removed:
  • ❌ Navigation bars
  • ❌ Sidebars
  • ❌ Headers and footers
  • ❌ Advertisement sections
  • ❌ Related posts widgets
  • ✅ Main article content
  • ✅ Embedded images in content
  • ✅ Code blocks and tables

Advanced Proxy for Protected Sites

Use advanced proxy infrastructure for sites with strict bot protection.
// Crawl sites with bot protection
for await (const frame of client.crawlStream({
  url: 'https://protected-site.com',
  maxPages: 10,
  advancedProxy: true,  // Enable advanced proxy (+$0.004/page)
})) {

  if (frame.type === 'page' && frame.page.success) {
    console.log(`✅ Successfully scraped: ${frame.page.title}`);
  }

  else if (frame.type === 'usage') {
    console.log(`💰 Total cost: $${frame.cost}`);
    // Cost includes base ($0.001) + proxy ($0.004) per successful page
  }
}
Additional cost: Advanced proxy adds **0.004persuccessfulpage(total0.004 per successful page** (total 0.005/page instead of $0.001/page).
When to use advanced proxy:
  • Site returns 403 Forbidden
  • Getting CAPTCHA challenges
  • High-security enterprise sites
  • E-commerce platforms
  • Sites that block datacenter IPs
  • After standard crawl fails

Combine Both Features

Get clean content from protected sites.
// Best of both worlds: clean content from protected sites
for await (const frame of client.crawlStream({
  url: 'https://protected-news-site.com',
  maxPages: 30,
  mainContentOnly: true,    // Clean content
  advancedProxy: true,      // Better success rate
  maxDepth: 2
})) {

  if (frame.type === 'page' && frame.page.success) {
    const page = frame.page;
    console.log(`✅ ${page.title}`);
    console.log(`   Clean article: ${page.markdown?.length || 0} chars`);
  }

  else if (frame.type === 'usage') {
    console.log(`\n📊 Results:`);
    console.log(`   Successful: ${frame.billed_count} pages`);
    console.log(`   Cost: $${frame.cost} (includes proxy fees)`);
  }
}

Include Subdomains

Crawl across all subdomains (blog., docs., api.*, etc.)
for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 20,
  includeSubdomains: true,  // Crawl blog.*, docs.*, etc.
})) {

  if (frame.type === 'page' && frame.page.success) {
    const hostname = new URL(frame.page.final_url).hostname;
    console.log(`${hostname}: ${frame.page.title}`);
  }
}

Request Parameters (Complete Reference)

Endpoint: POST /api/v2/crawl_stream

Required Parameters

url
string
required
Starting URL (seed) for the crawl. Must be a valid HTTP(S) URL.Examples:
  • https://docs.example.com
  • https://blog.example.com/posts
  • example.com (missing protocol)

Optional Parameters

max_pages
integer
default:"25"
Maximum number of pages to crawl.Range: 1 - 50 (hard limit)Default: 25
maxPages: 10  // Stop after 10 pages
Hard limit of 50 pages per request is enforced by the API.
max_depth
integer
default:"2"
Maximum depth to crawl from seed URL.
  • 1 - Only pages directly linked from seed
  • 2 - Seed + 1st level + 2nd level (default)
  • 3+ - Deeper crawling
maxDepth: 1  // Shallow crawl
timeout
number
default:"60.0"
Total timeout for the entire crawl operation in seconds.Default: 60 seconds
timeoutSeconds: 120  // 2 minutes
This is the total crawl timeout, not per-page timeout. The crawl stops when time runs out.
main_content_only
boolean
default:"false"
Extract only the main content, removing navigation, headers, footers, and sidebars.
mainContentOnly: true  // Clean article content
Perfect for: Blog posts, news articles, documentation, and AI training data where you want clean, focused content.
advanced_proxy
boolean
default:"false"
Enable advanced proxy infrastructure for sites with bot protection.
advancedProxy: true  // Bypass bot detection (+$0.004/page)
Additional cost: Adds 0.004persuccessfulpage(total0.004 per successful page (total 0.005/page instead of $0.001/page).
Use when sites return 403 errors, CAPTCHA challenges, or have aggressive bot detection.
include_subdomains
boolean
default:"false"
Follow links to subdomains (blog., docs., api.*, etc.)
includeSubdomains: true  // Crawl across subdomains
include_images
boolean
default:"true"
Include images in markdown output.
includeImages: false  // Text only
Include hyperlinks in markdown output.
includeLinks: false  // Plain text

Response Format (SSE Frames)

The API streams Server-Sent Events (SSE) with JSON payloads.

Frame Types

Individual page result
{
  "type": "page",
  "page": {
    "requested_url": "https://example.com/page1",
    "final_url": "https://example.com/page1",
    "title": "Page Title",
    "hash_sha256": "abc123...",
    "markdown": "# Content...",
    "success": true,
    "error": null
  }
}
Fields:
  • requested_url - Original URL requested
  • final_url - URL after redirects
  • title - Page title
  • hash_sha256 - Content hash for deduplication
  • markdown - Markdown content
  • success - Boolean indicating success
  • error - Error message if failed

Best Practices

💰 Cost Optimization

Use Map first
  • Map the site ($0.002)
  • Filter URLs you need
  • Crawl only those sections
Set appropriate limits
  • Don’t set maxPages: 50 if you only need 10
  • Use maxDepth: 1 for focused crawling
  • Stop when you have enough
Remember: Pay per success
  • Failed pages are free
  • No penalty for timeouts
Use advanced proxy wisely
  • Only for protected sites
  • Costs 5x more per page
  • But significantly improves success rate

⚡ Performance Tips

Use shallow depths
  • maxDepth: 1 is fastest
  • Deep crawls take longer
  • Balance depth vs coverage
Handle partial results
  • Process pages as they stream
  • Don’t wait for completion
  • Save incrementally
Main content only for speed
  • Faster extraction
  • Smaller markdown output
  • Better for AI processing

✨ Better Results

Choose right depth
  • Docs sites: maxDepth: 2-3
  • Blogs: maxDepth: 1-2
  • Large sites: maxDepth: 1
Use main_content_only for:
  • Blog posts
  • News articles
  • Documentation
  • AI training data
Handle redirects
  • Use final_url not requested_url
  • Track URL mappings
  • Avoid duplicates with hash_sha256

🛡️ Reliability

Track failures
  • Count successful vs failed
  • Log errors for review
  • Retry failed URLs if needed
Use advanced proxy when:
  • Getting 403 errors
  • Site blocks requests
  • Standard crawl fails
  • Need higher success rate
Save progressively
  • Write to disk as frames arrive
  • Don’t hold everything in memory
  • Survive interruptions

Important Limitations

Hard limits:
  • Maximum 100 pages per crawl request
  • Maximum 60 seconds default timeout (configurable)
  • Streaming only (no blocking version)

Frequently Asked Questions

Use main_content_only: true when:
  • You want clean article/blog content
  • You’re training AI models (cleaner data)
  • You need to remove sidebars and navigation
  • You’re processing news articles
  • You want focused documentation content
Don’t use when:
  • You need the full page structure
  • Navigation menus are important
  • You want sidebar information
  • Page layout matters
What it removes:
  • Headers and footers
  • Navigation bars
  • Sidebars
  • Advertisement sections
  • Related posts widgets
What it keeps:
  • Main article content
  • Images within content
  • Code blocks
  • Tables
Use advanced_proxy: true when:
  • Standard crawl returns 403 Forbidden
  • Site shows CAPTCHA challenges
  • E-commerce sites with protection
  • Enterprise websites with strict security
  • Datacenter IPs are blocked
  • You need higher success rates on protected sites
Cost consideration:
  • Standard: $0.001 per successful page
  • With proxy: $0.005 per successful page (5x more)
  • Worth it if standard crawl fails entirely
No! You only pay for successfully scraped pages.Example:
  • Attempted: 50 pages
  • Successful: 42 pages
  • Failed: 8 pages
  • Cost: 42 × 0.001=0.001 = 0.042 (without proxy)
  • Cost: 42 × 0.005=0.005 = 0.210 (with proxy)
Failed pages (404s, timeouts, errors) are completely free.
Yes! You can combine both features:
{
  url: 'https://protected-news-site.com',
  mainContentOnly: true,   // Clean content
  advancedProxy: true      // Bypass protection
}
Cost: $0.005 per successful page (advanced proxy pricing)Perfect for: Protected news sites, paywalled blogs, enterprise documentation
Without main_content_only:
# Navigation
- Home
- About
- Contact

# Main Article
This is the article content...

# Sidebar
- Related Posts
- Advertisement

# Footer
Copyright 2024
With main_content_only:
# Main Article
This is the article content...
The extracted content is cleaner and more focused on the actual article.
Use Crawl when:
  • You need multiple related pages
  • You want automatic link following
  • You’re downloading a section/category
  • You want streaming results
Use Scrape when:
  • You need exactly one page
  • You know the specific URL
  • You need non-streaming response
Use Map + Scrape when:
  • You need specific pages (not sequential)
  • You want full control over which pages
  • Pages are scattered across the site

Next Steps


Need Help?

Found a bug or have a feature request? We’d love to hear from you! Join our Discord or email us at [email protected]