Crawl API - LLMLAYER API Documentation

What is the Crawl API?

The Crawl API is like a smart web crawler that automatically:

Starts from a seed URL (e.g., homepage)
Follows internal links to discover more pages
Extracts content from each page in Markdown format
Streams results as pages are scraped (no waiting!)

Multi-Page Scraping

Scrape dozens of pages in one request

Real-Time Streaming

Get results as pages are scraped, not at the end

Perfect for: Documentation downloads, blog archives, content backups, site migrations, and bulk content extraction.

Smart crawling: The API automatically discovers pages by following links, respects depth limits, and avoids duplicates.

Crawl vs Map vs Scrape

Understand when to use each API:

Feature	Map API	Scrape API	Crawl API
What it does	Discovers URLs	Scrapes 1 page	Scrapes multiple pages
Content returned	Titles only	Full content	Full content
Speed	Very fast (1-5s)	Fast (2-7s)	Depends on pages
Cost	$0.002 flat	$0.001 per page	$0.001 per success
Use when	Planning	Single page	Bulk extraction
Streaming	No	No	✅ Yes

Best workflow:

Map the site to discover URLs ($0.002)
Filter to pages you need
Crawl selected sections ($0.001/page) OR Scrape individual pages

Pricing (Pay Per Success)

Per-Page Success Model

$0.001 per successfully scraped page= $1 for 1,000 pages

You only pay for pages that succeed! Failed pages are free. This is better than competitors who charge per attempt.

Advanced Proxy Pricing

Advanced Proxy (Optional)

Additional $0.004 per successful pageTotal cost with advanced proxy: $0.005 per pageUse advanced proxy for:

Sites with aggressive bot detection
Sites that block standard requests
Enterprise websites with strict security
E-commerce sites with protection

Only pay when you need it: The advanced proxy adds $0.004/page but significantly improves success rates on protected sites.

Pricing Examples

Standard crawling:

You request: max_pages = 50
Successfully scraped: 45 pages
Failed (timeouts/errors): 5 pages

Cost: 45 × $0.001 = $0.045 ✅
NOT: 50 × $0.001 = $0.050 ❌

With advanced proxy:

You request: max_pages = 50, advanced_proxy = true
Successfully scraped: 45 pages
Failed: 5 pages

Base cost: 45 × $0.001 = $0.045
Proxy cost: 45 × $0.004 = $0.180
Total cost: $0.225 ✅

(Only successful pages are charged)

Before You Start

Authentication

All requests require your API key in the Authorization header:

Authorization: Bearer YOUR_LLMLAYER_API_KEY

Keep your API key secure! Never expose it in client-side code. Always call from your backend.

Important Limits

Max Pages

100 pages per requestHard limit enforced by API

Timeout

60 seconds defaultConfigurable up to longer times

Your First Crawl (2-Minute Start)

Let’s crawl a website and get results in real-time!

import { LLMLayerClient } from 'llmlayer';

const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

// Crawl with real-time streaming
for await (const frame of client.crawlStream({
  url: 'https://docs.example.com',
  maxPages: 5,         // Crawl up to 5 pages
})) {

  // Each frame has a type
  if (frame.type === 'page') {
    const page = frame.page;

    if (page.success) {
      console.log(`✅ ${page.title}`);
      console.log(`   URL: ${page.final_url}`);
      console.log(`   Content length: ${page.markdown?.length || 0} chars\n`);
    } else {
      console.log(`❌ Failed: ${page.final_url}`);
      console.log(`   Error: ${page.error}\n`);
    }
  }

  else if (frame.type === 'usage') {
    console.log(`\n💰 Billing:`);
    console.log(`   Successful pages: ${frame.billed_count}`);
    console.log(`   Cost: $${frame.cost}`);
  }

  else if (frame.type === 'done') {
    console.log(`\n⏱️  Completed in ${frame.response_time}s`);
  }
}

Output:

✅ Getting Started
   URL: https://docs.example.com/getting-started
   Content length: 1234 chars

✅ API Reference
   URL: https://docs.example.com/api-reference
   Content length: 3456 chars

❌ Failed: https://docs.example.com/broken-link
   Error: 404 Not Found

✅ Examples
   URL: https://docs.example.com/examples
   Content length: 2345 chars

💰 Billing:
   Successful pages: 3
   Cost: $0.003

⏱️  Completed in 8.34s

Done! You just crawled a website and got content from multiple pages in real-time. Notice you only paid for the 3 successful pages, not the failed one!

How Streaming Works

The Crawl API uses Server-Sent Events (SSE) to stream results as they happen.

Event Types

page
usage
done
error

Individual page resultSent each time a page is scraped (success or failure)

{
  "type": "page",
  "page": {
    "requested_url": "https://...",
    "final_url": "https://...",
    "title": "Page Title",
    "hash_sha256": "abc123...",
    "markdown": "# Content...",
    "success": true
  }
}

Billing informationSent at the end with final cost

{
  "type": "usage",
  "billed_count": 25,
  "unit_cost": 0.001,
  "cost": 0.025
}

Completion signalSent when crawling finishes

{
  "type": "done",
  "response_time": "15.67"
}

Fatal errorSent if the entire crawl fails

{
  "type": "error",
  "error": "Invalid URL"
}

Why streaming? Get results immediately as pages are scraped instead of waiting for all pages to complete. Perfect for long-running crawls!

Basic Crawling

Crawl with Depth Control

Control how many “clicks” away from the seed URL to crawl.

// Depth 1: Only crawl pages linked from the homepage
for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 10,
  maxDepth: 1,  // Only 1 level deep
})) {
  if (frame.type === 'page' && frame.page.success) {
    console.log(`Crawled: ${frame.page.title}`);
  }
}

// Depth 2: Crawl homepage + pages it links to + pages those link to
for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 20,
  maxDepth: 2,  // 2 levels deep
})) {
  if (frame.type === 'page' && frame.page.success) {
    console.log(`Crawled: ${frame.page.title}`);
  }
}

Depth visualization:

Depth 0 (seed):  https://example.com/
                         |
          +--------------+---------------+
          |              |               |
Depth 1:  /about      /products      /contact
                         |
               +---------+---------+
               |                   |
Depth 2:  /products/item1    /products/item2

Depth tips:

maxDepth: 1 - Fast, focused crawling
maxDepth: 2 - Good balance (default)
maxDepth: 3+ - May crawl too much

Clean Content with Main Content Only

Extract only the main article/content without navigation, headers, or footers.

// Get clean content without navigation elements
for await (const frame of client.crawlStream({
  url: 'https://blog.example.com',
  maxPages: 20,
  mainContentOnly: true,  // Extract only main content
})) {

  if (frame.type === 'page' && frame.page.success) {
    const page = frame.page;
    console.log(`✅ ${page.title}`);
    console.log(`   Clean content: ${page.markdown?.length || 0} chars`);
    // Content without header, footer, sidebar, navigation
  }
}

Perfect for:

Blog posts (without sidebar clutter)
News articles (just the story)
Documentation (pure content)
Research papers (main text only)
AI training data (cleaner input)

What gets removed:

❌ Navigation bars
❌ Sidebars
❌ Headers and footers
❌ Advertisement sections
❌ Related posts widgets
✅ Main article content
✅ Embedded images in content
✅ Code blocks and tables

Advanced Proxy for Protected Sites

Use advanced proxy infrastructure for sites with strict bot protection.

// Crawl sites with bot protection
for await (const frame of client.crawlStream({
  url: 'https://protected-site.com',
  maxPages: 10,
  advancedProxy: true,  // Enable advanced proxy (+$0.004/page)
})) {

  if (frame.type === 'page' && frame.page.success) {
    console.log(`✅ Successfully scraped: ${frame.page.title}`);
  }

  else if (frame.type === 'usage') {
    console.log(`💰 Total cost: $${frame.cost}`);
    // Cost includes base ($0.001) + proxy ($0.004) per successful page
  }
}

Additional cost: Advanced proxy adds **

0.004 per successful page** (total

0.005/page instead of $0.001/page).

When to use advanced proxy:

Site returns 403 Forbidden
Getting CAPTCHA challenges
High-security enterprise sites
E-commerce platforms
Sites that block datacenter IPs
After standard crawl fails

Combine Both Features

Get clean content from protected sites.

// Best of both worlds: clean content from protected sites
for await (const frame of client.crawlStream({
  url: 'https://protected-news-site.com',
  maxPages: 30,
  mainContentOnly: true,    // Clean content
  advancedProxy: true,      // Better success rate
  maxDepth: 2
})) {

  if (frame.type === 'page' && frame.page.success) {
    const page = frame.page;
    console.log(`✅ ${page.title}`);
    console.log(`   Clean article: ${page.markdown?.length || 0} chars`);
  }

  else if (frame.type === 'usage') {
    console.log(`\n📊 Results:`);
    console.log(`   Successful: ${frame.billed_count} pages`);
    console.log(`   Cost: $${frame.cost} (includes proxy fees)`);
  }
}

Include Subdomains

Crawl across all subdomains (blog., docs., api.*, etc.)

for await (const frame of client.crawlStream({
  url: 'https://example.com',
  maxPages: 20,
  includeSubdomains: true,  // Crawl blog.*, docs.*, etc.
})) {

  if (frame.type === 'page' && frame.page.success) {
    const hostname = new URL(frame.page.final_url).hostname;
    console.log(`${hostname}: ${frame.page.title}`);
  }
}

Request Parameters (Complete Reference)

Endpoint: POST /api/v2/crawl_stream

Required Parameters

url

string

required

Starting URL (seed) for the crawl. Must be a valid HTTP(S) URL.Examples:

✅ https://docs.example.com
✅ https://blog.example.com/posts
❌ example.com (missing protocol)

Optional Parameters

max_pages

integer

default:"25"

Maximum number of pages to crawl.Range: 1 - 50 (hard limit)Default: 25

maxPages: 10  // Stop after 10 pages

Hard limit of 50 pages per request is enforced by the API.

max_depth

integer

default:"2"

Maximum depth to crawl from seed URL.

1 - Only pages directly linked from seed
2 - Seed + 1st level + 2nd level (default)
3+ - Deeper crawling

maxDepth: 1  // Shallow crawl

timeout

number

default:"60.0"

Total timeout for the entire crawl operation in seconds.Default: 60 seconds

timeoutSeconds: 120  // 2 minutes

This is the total crawl timeout, not per-page timeout. The crawl stops when time runs out.

main_content_only

boolean

default:"false"

Extract only the main content, removing navigation, headers, footers, and sidebars.

mainContentOnly: true  // Clean article content

Perfect for: Blog posts, news articles, documentation, and AI training data where you want clean, focused content.

advanced_proxy

boolean

default:"false"

Enable advanced proxy infrastructure for sites with bot protection.

advancedProxy: true  // Bypass bot detection (+$0.004/page)

Additional cost: Adds

0.004 per successful page (total

0.005/page instead of $0.001/page).

Use when sites return 403 errors, CAPTCHA challenges, or have aggressive bot detection.

include_subdomains

boolean

default:"false"

Follow links to subdomains (blog., docs., api.*, etc.)

includeSubdomains: true  // Crawl across subdomains

include_images

boolean

default:"true"

Include images in markdown output.

includeImages: false  // Text only

include_links

boolean

default:"true"

Include hyperlinks in markdown output.

includeLinks: false  // Plain text

Response Format (SSE Frames)

The API streams Server-Sent Events (SSE) with JSON payloads.

Frame Types

page
usage
done
error

Individual page result

{
  "type": "page",
  "page": {
    "requested_url": "https://example.com/page1",
    "final_url": "https://example.com/page1",
    "title": "Page Title",
    "hash_sha256": "abc123...",
    "markdown": "# Content...",
    "success": true,
    "error": null
  }
}

Fields:

requested_url - Original URL requested
final_url - URL after redirects
title - Page title
hash_sha256 - Content hash for deduplication
markdown - Markdown content
success - Boolean indicating success
error - Error message if failed

Billing information

{
  "type": "usage",
  "billed_count": 25,
  "unit_cost": 0.001,
  "cost": 0.025
}

Fields:

billed_count - Number of successful pages
unit_cost - Cost per page ( $0.001 or$ 0.005 with advanced proxy)
cost - Total cost in USD

Completion signal

{
  "type": "done",
  "response_time": "15.67"
}

Fields:

response_time - Total time in seconds

Fatal error (rare)

{
  "type": "error",
  "error": "Invalid URL"
}

Fields:

error - Error message

Best Practices

💰 Cost Optimization

Use Map first

Map the site ($0.002)
Filter URLs you need
Crawl only those sections

Set appropriate limits

Don’t set maxPages: 50 if you only need 10
Use maxDepth: 1 for focused crawling
Stop when you have enough

Remember: Pay per success

Failed pages are free
No penalty for timeouts

Use advanced proxy wisely

Only for protected sites
Costs 5x more per page
But significantly improves success rate

⚡ Performance Tips

Use shallow depths

maxDepth: 1 is fastest
Deep crawls take longer
Balance depth vs coverage

Handle partial results

Process pages as they stream
Don’t wait for completion
Save incrementally

Main content only for speed

Faster extraction
Smaller markdown output
Better for AI processing

✨ Better Results

Choose right depth

Docs sites: maxDepth: 2-3
Blogs: maxDepth: 1-2
Large sites: maxDepth: 1

Use main_content_only for:

Blog posts
News articles
Documentation
AI training data

Handle redirects

Use final_url not requested_url
Track URL mappings
Avoid duplicates with hash_sha256

🛡️ Reliability

Track failures

Count successful vs failed
Log errors for review
Retry failed URLs if needed

Use advanced proxy when:

Getting 403 errors
Site blocks requests
Standard crawl fails
Need higher success rate

Save progressively

Write to disk as frames arrive
Don’t hold everything in memory
Survive interruptions

Important Limitations

Hard limits:

Maximum 100 pages per crawl request
Maximum 60 seconds default timeout (configurable)
Streaming only (no blocking version)

Frequently Asked Questions

When should I use main_content_only?

Use main_content_only: true when:

You want clean article/blog content
You’re training AI models (cleaner data)
You need to remove sidebars and navigation
You’re processing news articles
You want focused documentation content

Don’t use when:

You need the full page structure
Navigation menus are important
You want sidebar information
Page layout matters

What it removes:

Headers and footers
Navigation bars
Sidebars
Advertisement sections
Related posts widgets

What it keeps:

Main article content
Images within content
Code blocks
Tables

When should I use advanced_proxy?

Use advanced_proxy: true when:

Standard crawl returns 403 Forbidden
Site shows CAPTCHA challenges
E-commerce sites with protection
Enterprise websites with strict security
Datacenter IPs are blocked
You need higher success rates on protected sites

Cost consideration:

Standard: $0.001 per successful page
With proxy: $0.005 per successful page (5x more)
Worth it if standard crawl fails entirely

Do I pay for failed pages?

No! You only pay for successfully scraped pages.Example:

Attempted: 50 pages
Successful: 42 pages
Failed: 8 pages
Cost: 42 × $0.001 =$ 0.042 (without proxy)
Cost: 42 × $0.005 =$ 0.210 (with proxy)

Failed pages (404s, timeouts, errors) are completely free.

Can I use both main_content_only and advanced_proxy?

Yes! You can combine both features:

{
  url: 'https://protected-news-site.com',
  mainContentOnly: true,   // Clean content
  advancedProxy: true      // Bypass protection
}

Cost: $0.005 per successful page (advanced proxy pricing)Perfect for: Protected news sites, paywalled blogs, enterprise documentation

How does main_content_only affect the output?

Without main_content_only:

# Navigation
- Home
- About
- Contact

# Main Article
This is the article content...

# Sidebar
- Related Posts
- Advertisement

# Footer
Copyright 2024

With main_content_only:

# Main Article
This is the article content...

The extracted content is cleaner and more focused on the actual article.

When should I use Crawl vs Scrape?

Use Crawl when:

You need multiple related pages
You want automatic link following
You’re downloading a section/category
You want streaming results

Use Scrape when:

You need exactly one page
You know the specific URL
You need non-streaming response

Use Map + Scrape when:

You need specific pages (not sequential)
You want full control over which pages
Pages are scattered across the site

Next Steps

Map API

Discover URLs before crawling

Scraper API

Scrape individual pages

Answer API

Search + AI-powered answers

Python SDK

Python client documentation

TypeScript SDK

TypeScript client documentation

Need Help?

Discord Community

Chat with other developers

Email Support

[email protected]

Found a bug or have a feature request? We’d love to hear from you! Join our Discord or email us at [email protected]

Get Started

​What is the Crawl API?

Multi-Page Scraping

Real-Time Streaming

​Crawl vs Map vs Scrape

​Pricing (Pay Per Success)

Per-Page Success Model

​Advanced Proxy Pricing

Advanced Proxy (Optional)

​Pricing Examples

​Before You Start

​Authentication

​Important Limits

Max Pages

Timeout

​Your First Crawl (2-Minute Start)

​How Streaming Works

​Event Types

​Basic Crawling

​Crawl with Depth Control

​Clean Content with Main Content Only

​Advanced Proxy for Protected Sites

​Combine Both Features

​Include Subdomains

​Request Parameters (Complete Reference)

​Required Parameters

​Optional Parameters

​Response Format (SSE Frames)

​Frame Types

​Best Practices

💰 Cost Optimization

⚡ Performance Tips

✨ Better Results

🛡️ Reliability

​Important Limitations

​Frequently Asked Questions

​Next Steps

Map API

Scraper API

Answer API

Python SDK

TypeScript SDK

​Need Help?

Discord Community

Email Support

What is the Crawl API?

Crawl vs Map vs Scrape

Pricing (Pay Per Success)

Advanced Proxy Pricing

Pricing Examples

Before You Start

Authentication

Important Limits

Your First Crawl (2-Minute Start)

How Streaming Works

Event Types

Basic Crawling

Crawl with Depth Control

Clean Content with Main Content Only

Advanced Proxy for Protected Sites

Combine Both Features

Include Subdomains

Request Parameters (Complete Reference)

Required Parameters

Optional Parameters

Response Format (SSE Frames)

Frame Types

Best Practices

Important Limitations

Frequently Asked Questions

Next Steps

Need Help?