Skip to main content

What is the Scraper API?

The Scraper API converts any web page into clean, usable formats. Point it at a URL and get back:

Clean Text

Markdown or HTMLExtract the main content without ads, popups, or navigation clutter

Visual Captures

ScreenshotGet a visual snapshot of the entire page as PNG image
Perfect for: Content extraction, web archiving, data collection, AI training data, visual testing, and automated documentation.
Multi-format support: Request multiple formats in one API call! Each format costs $0.001.

Pricing (Pay Per Format)

Per-Format Pricing Model

$0.001 per format= $1 for 1,000 formats
Each format you request costs 0.001.Ifyourequestmarkdown+html+screenshot,thats0.001. If you request markdown + html + screenshot, that's 0.003 total.

Advanced Proxy Pricing

Advanced Proxy (Optional)

Additional $0.004 per requestUse advanced proxy for:
  • Sites with aggressive bot detection
  • Sites that block standard requests
  • Enterprise websites with strict security
  • E-commerce sites with protection
One-time fee per request: The advanced proxy adds $0.004 per request, regardless of how many formats you request.

Pricing Examples

Single format:
Request: formats = ["markdown"]
Cost: 1 format × $0.001 = $0.001 ✅
Multiple formats:
Request: formats = ["markdown", "html", "screenshot"]
Cost: 3 formats × $0.001 = $0.003 ✅
With advanced proxy:
Request: formats = ["markdown", "screenshot"], advanced_proxy = true
Format cost: 2 × $0.001 = $0.002
Proxy cost: $0.004
Total cost: $0.006 ✅

Before You Start

Authentication

All requests require your API key in the Authorization header:
Authorization: Bearer YOUR_LLMLAYER_API_KEY
Keep your API key secure! Never expose it in client-side code. Always call from your backend.

Your First Scrape (2-Minute Start)

Let’s scrape a website in under 2 minutes!
import { LLMLayerClient } from 'llmlayer';

// 1. Create a client
const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

// 2. Scrape a website
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown']  // Get clean markdown
});

// 3. Use the content
console.log(response.markdown);
console.log(`\nTitle: ${response.title}`);
console.log(`Cost: $${response.cost}`);  // $0.001
Done! You just scraped a website and got clean markdown. The API removed all ads, navigation, and clutter - leaving only the main content.

Output Formats Explained

The Scraper API supports 3 output formats. You can request one or multiple formats in a single call.

Quick Reference

FormatReturnsBest ForOutput FieldCost
markdownClean text with formattingAI processing, content extractionmarkdown$0.001
htmlRaw HTMLPreserving structure, custom parsinghtml$0.001
screenshotPNG image (base64)Visual testing, archivingscreenshot$0.001
Total cost = number of formats × $0.001Request 2 formats? Pay 0.002.Requestall3?Pay0.002. Request all 3? Pay 0.003.

Markdown Format (Clean Text)

Get clean, readable text without ads, popups, or navigation. Best for: Content extraction, AI training data, reading apps, RSS feeds

Basic Example

const response = await client.scrape({
  url: 'https://techcrunch.com/category/startups/',
  formats: ['markdown'],
  includeImages: true,   // Keep images
  includeLinks: true     // Keep hyperlinks
});

console.log(response.markdown);
console.log(`Cost: $${response.cost}`);  // $0.001
// Output: Clean markdown with the article content
// Text only - no images, no links
const textOnly = await client.scrape({
  url: 'https://example.com',
  formats: ['markdown'],
  includeImages: false,  // Remove all images
  includeLinks: false    // Remove all links
});

console.log(textOnly.markdown);
// Output: Pure text content
Markdown output is always clean:
  • Removes ads, popups, cookie banners
  • Removes navigation menus and sidebars
  • Extracts only the main content
  • Preserves formatting (headers, lists, code blocks)

Clean Content with Main Content Only

Extract only the main article/content without navigation, headers, or footers.
// Get clean content without navigation elements
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown'],
  mainContentOnly: true,  // Extract only main content
});

console.log(response.markdown);
console.log(`Cost: $${response.cost}`);  // $0.001
// Content without header, footer, sidebar, navigation
Perfect for:
  • Blog posts (without sidebar clutter)
  • News articles (just the story)
  • Documentation (pure content)
  • Research papers (main text only)
  • AI training data (cleaner input)
What gets removed:
  • ❌ Navigation bars
  • ❌ Sidebars
  • ❌ Headers and footers
  • ❌ Advertisement sections
  • ❌ Related posts widgets
  • ✅ Main article content
  • ✅ Embedded images in content
  • ✅ Code blocks and tables

HTML Format (Raw HTML)

Get the complete HTML structure of the page. Best for: Custom parsing, preserving exact structure, web scraping frameworks

Example

const response = await client.scrape({
  url: 'https://example.com',
  formats: ['html']
});

console.log(response.html);
console.log(`Cost: $${response.cost}`);  // $0.001
// Output: Full HTML content

// Parse it yourself if needed
const cheerio = require('cheerio');
const $ = cheerio.load(response.html);
const title = $('h1').first().text();
console.log(`Main heading: ${title}`);
HTML format returns everything - including ads, scripts, and navigation. Use markdown for clean content extraction.

Screenshot Format (PNG Image)

Capture a visual snapshot of the page as a PNG image. Best for: Visual testing, documentation, archiving how a page looks, change detection

Example

import fs from 'fs';

const response = await client.scrape({
  url: 'https://example.com',
  formats: ['screenshot']
});

// The screenshot is base64-encoded
console.log('Screenshot captured!');
console.log(`Cost: $${response.cost}`);  // $0.001

// Save to file
const buffer = Buffer.from(response.screenshot, 'base64');
fs.writeFileSync('page-screenshot.png', buffer);

console.log('Saved to page-screenshot.png');
Screenshot details:
  • Full-page screenshot (not just viewport)
  • PNG format
  • Base64 encoded in the response
  • Typical size: 100KB - 2MB depending on page

Advanced Proxy for Protected Sites

Use advanced proxy infrastructure for sites with strict bot protection.
// Scrape sites with bot protection
const response = await client.scrape({
  url: 'https://protected-site.com',
  formats: ['markdown'],
  advancedProxy: true,  // Enable advanced proxy (+$0.004/request)
});

console.log(`✅ Successfully scraped: ${response.title}`);
console.log(`💰 Total cost: $${response.cost}`);
// Cost includes base ($0.001) + proxy ($0.004) = $0.005
Additional cost: Advanced proxy adds $0.004 per request (not per format).
When to use advanced proxy:
  • Site returns 403 Forbidden
  • Getting CAPTCHA challenges
  • High-security enterprise sites
  • E-commerce platforms
  • Sites that block datacenter IPs
  • After standard scrape fails

Multi-Format Scraping

Request multiple formats in one API call!
import fs from 'fs';

// Get all formats at once!
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown', 'html', 'screenshot'],
  includeImages: true,
  includeLinks: true
});

// Save markdown
fs.writeFileSync('article.md', response.markdown);

// Save HTML
fs.writeFileSync('article.html', response.html);

// Save screenshot
const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
fs.writeFileSync('article.png', screenshotBuffer);

console.log('✅ Saved all formats!');
console.log(`📊 Title: ${response.title}`);
console.log(`💰 Total cost: $${response.cost}`);  // $0.003 (3 formats)
Cost calculation: 3 formats × 0.001=0.001 = 0.003 total

Combine All Features

Get clean content from protected sites with all formats.
// Best of everything: clean content from protected sites, all formats
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown', 'html', 'screenshot'],
  mainContentOnly: true,    // Clean content
  advancedProxy: true,      // Better success rate
});

console.log(`✅ ${response.title}`);
console.log(`   Clean article: ${response.markdown?.length || 0} chars`);
console.log(`💰 Total cost: $${response.cost}`);
// Cost: (3 formats × $0.001) + $0.004 proxy = $0.007

Real-World Examples

Example 1: Content Aggregator

Build a news aggregator that saves articles in multiple formats.
import fs from 'fs';

async function archiveArticle(url: string, title: string) {
  console.log(`📥 Archiving: ${title}`);

  const response = await client.scrape({
    url,
    formats: ['markdown', 'screenshot'],  // Text + visual
    mainContentOnly: true  // Clean content
  });

  // Create safe filename
  const filename = title
    .toLowerCase()
    .replace(/[^a-z0-9]+/g, '-')
    .replace(/^-|-$/g, '');

  // Save markdown for reading/searching
  fs.writeFileSync(`articles/${filename}.md`, response.markdown);

  // Save screenshot for visual archive
  const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
  fs.writeFileSync(`articles/${filename}.png`, screenshotBuffer);

  console.log(`✅ Saved to articles/${filename}.*`);
  console.log(`   Cost: $${response.cost}\n`);  // $0.002 (2 formats)

  return response;
}

// Archive multiple articles
const articles = [
  { url: 'https://...', title: 'AI Breakthrough' },
  { url: 'https://...', title: 'Climate Report' },
  { url: 'https://...', title: 'Tech News' }
];

for (const article of articles) {
  await archiveArticle(article.url, article.title);
  await new Promise(resolve => setTimeout(resolve, 100)); // Rate limiting
}

console.log('📚 Archive complete!');

Example 2: Visual Testing Tool

Monitor website changes by comparing screenshots.
import fs from 'fs';
import crypto from 'crypto';

async function checkForChanges(url: string, previousHash: string | null) {
  // Get screenshot
  const response = await client.scrape({
    url,
    formats: ['screenshot']
  });

  // Calculate hash of screenshot
  const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
  const currentHash = crypto
    .createHash('sha256')
    .update(screenshotBuffer)
    .digest('hex');

  // Compare with previous
  if (previousHash === null) {
    console.log('🆕 First capture - no previous screenshot');
    fs.writeFileSync(`screenshots/${currentHash}.png`, screenshotBuffer);
    return { changed: false, hash: currentHash };
  }

  if (currentHash !== previousHash) {
    console.log('⚠️  CHANGE DETECTED!');
    fs.writeFileSync(`screenshots/${currentHash}.png`, screenshotBuffer);
    return { changed: true, hash: currentHash };
  }

  console.log('✅ No changes detected');
  return { changed: false, hash: currentHash };
}

// Monitor a website
let lastHash: string | null = null;

// Check every hour
setInterval(async () => {
  const result = await checkForChanges('https://example.com/pricing', lastHash);
  lastHash = result.hash;

  if (result.changed) {
    // Send alert (email, Slack, etc.)
    console.log('🚨 Send alert: Pricing page changed!');
  }
}, 3600000); // 1 hour

Example 3: Protected Site Scraper

Scrape content from protected e-commerce sites.
async function scrapeProtectedProduct(url: string) {
  console.log(`🛒 Scraping product from ${url}\n`);

  try {
    const response = await client.scrape({
      url,
      formats: ['markdown', 'html'],
      advancedProxy: true,      // Bypass bot protection
      mainContentOnly: true,    // Clean product descriptions
    });

    console.log(`✅ ${response.title}`);
    console.log(`   Content: ${response.markdown?.length || 0} chars`);
    console.log(`💰 Cost: $${response.cost}`);
    // Cost: (2 formats × $0.001) + $0.004 proxy = $0.006

    return response;

  } catch (error) {
    console.error(`❌ Failed: ${error.message}`);
    return null;
  }
}

await scrapeProtectedProduct('https://shop.example.com/product/123');

Request Parameters (Complete Reference)

Endpoint: POST /api/v2/scrape

Required Parameters

url
string
required
The URL to scrape. Must be a valid HTTP(S) URL.Examples:
  • https://example.com/article
  • https://blog.com/post?id=123
  • example.com (missing protocol)
  • ftp://example.com (unsupported protocol)
formats
array
required
List of output formats to generate. Can request one or multiple.Options: "markdown", "html", "screenshot"Examples:
formats: ['markdown']                    // Just markdown ($0.001)
formats: ['markdown', 'html']            // Two formats ($0.002)
formats: ['markdown', 'html', 'screenshot']  // All formats ($0.003)
Cost = number of formats × $0.001

Optional Parameters

main_content_only
boolean
default:"false"
Extract only the main content, removing navigation, headers, footers, and sidebars.
mainContentOnly: true  // Clean article content
Perfect for: Blog posts, news articles, documentation, and AI training data where you want clean, focused content.
advanced_proxy
boolean
default:"false"
Enable advanced proxy infrastructure for sites with bot protection.
advancedProxy: true  // Bypass bot detection (+$0.004/request)
Additional cost: Adds $0.004 per request (not per format).
Use when sites return 403 errors, CAPTCHA challenges, or have aggressive bot detection.
include_images
boolean
default:"true"
Include images in markdown output. Only affects markdown format.
  • true - Keep image links in markdown
  • false - Remove all images, text only
includeImages: false  // Text-only markdown
Include hyperlinks in markdown output. Only affects markdown format.
  • true - Keep hyperlinks
  • false - Remove all links, plain text
includeLinks: false  // No hyperlinks

Response Format

Response Structure

{
  "markdown": "# Clean content...",
  "html": "<html>...</html>",
  "screenshot": "iVBORw0KGgo...",  // base64
  "url": "https://example.com/final-url",
  "title": "Page Title",
  "statusCode": 200,
  "cost": 0.002,
  "metadata": {
    "description": "Page description",
    "author": "Author name"
  }
}

Response Fields

markdown
string
Clean markdown content (when "markdown" in formats)
# Article Title

This is the clean content without ads...
html
string
Raw HTML content (when "html" in formats)
<!DOCTYPE html>
<html>...</html>
screenshot
string
Base64-encoded PNG image (when "screenshot" in formats)Decode to save:
Buffer.from(response.screenshot, 'base64')
url
string
Final URL after following redirects
"https://example.com/final-url"
title
string
Page title extracted from metadata
"Article Title - Example Blog"
statusCode
integer
HTTP status code (200 for success)
200
cost
number
Cost in USD (number of formats × 0.001,plus0.001, plus 0.004 if using advanced proxy)
0.002  // 2 formats
metadata
object
Additional page metadata (when available)
{
  "description": "Page description",
  "author": "Author name",
  "keywords": ["keyword1", "keyword2"]
}

Error Handling

Error Format

All errors use this structure:
{
  "detail": {
    "error_type": "scraping_error",
    "error_code": "url_scrape_failed",
    "message": "Failed to scrape content from the provided URL",
    "details": {
      "url": "https://example.com",
      "error": "Connection timeout"
    }
  }
}

Common Errors

Missing or invalid API key
{
  "error_code": "missing_llmlayer_api_key",
  "message": "Provide LLMLayer API key via 'Authorization: Bearer <token>'"
}
Fix: Add your API key to the Authorization header.
Invalid or malformed URL
{
  "error_code": "invalid_url",
  "message": "The provided URL is not valid"
}
Fix: Ensure URL includes protocol (https://) and is properly formatted.
Failed to scrape the website
{
  "error_code": "url_scrape_failed",
  "message": "Failed to scrape content from the provided URL",
  "details": {
    "url": "https://example.com",
    "status_code": 500
  }
}
Common causes:
  • Website is down
  • Page requires authentication
  • JavaScript-heavy site didn’t render
  • Connection timeout
Fix: Retry the request. If it persists, try enabling advancedProxy: true.
Request took too long
{
  "error_code": "scrape_timeout",
  "message": "Scraping exceeded the timeout limit"
}
Fix: The page took too long to load. Try again or the website may be slow.

Robust Error Handling

import {
  LLMLayerClient,
  AuthenticationError,
  InvalidRequest,
  InternalServerError
} from 'llmlayer';

const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

async function robustScrape(url: string, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.scrape({
        url,
        formats: ['markdown']
      });

    } catch (error) {
      // Don't retry authentication errors
      if (error instanceof AuthenticationError) {
        console.error('❌ Fix your API key');
        throw error;
      }

      // Don't retry invalid URLs
      if (error instanceof InvalidRequest) {
        console.error('❌ Invalid URL:', url);
        throw error;
      }

      // Retry server errors
      if (error instanceof InternalServerError) {
        const waitTime = Math.pow(2, attempt) * 1000;
        console.log(`⏳ Scraping failed. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));

        // Last attempt
        if (attempt === maxRetries - 1) {
          console.error('❌ Max retries exceeded');
          throw error;
        }
        continue;
      }

      throw error;
    }
  }
}

// Usage
try {
  const response = await robustScrape('https://example.com');
  console.log(`✅ Scraped: ${response.title}`);
  console.log(response.markdown);
} catch (error) {
  console.error('Scraping failed:', error);
}

Best Practices

💰 Cost Optimization

Request only what you need
  • Need just text? Request only markdown
  • Need visual verification? Add screenshot
  • Each format costs $0.001
Cache results
  • Web pages don’t change every second
  • Cache for hours or days depending on content
  • Save money on re-scraping
Use advanced proxy wisely
  • Only for protected sites
  • Costs $0.004 extra per request
  • But significantly improves success rate

⚡ Performance Tips

Choose the right format
  • Markdown: Fastest, smallest
  • HTML: Fast, larger
  • Screenshot: Slower, largest
Optimize markdown options
  • includeImages: false = smaller, faster
  • includeLinks: false = cleaner text
Use main_content_only
  • Faster extraction
  • Smaller markdown output
  • Better for AI processing
Parallel processing
  • Scrape multiple URLs simultaneously
  • Use Promise.all() or asyncio.gather()

🛡️ Reliability

Always handle errors
  • Some sites block scrapers
  • Some pages require auth
  • Network issues happen
Use advanced proxy when:
  • Getting 403 errors
  • Site blocks requests
  • Standard scrape fails
  • Need higher success rate
Implement retries
  • Exponential backoff for failures
  • Don’t retry bad URLs
  • Max 3 retries recommended
Validate URLs first
  • Check protocol (https://)
  • Ensure proper formatting
  • Handle user input carefully

Quick Tips

Starting out? Use this config for most scrapes:
{
  formats: ['markdown'],
  includeImages: true,
  includeLinks: true
}
Cost: $0.001
Need text only? Remove images and links:
{
  formats: ['markdown'],
  includeImages: false,
  includeLinks: false
}
Cost: $0.001
Building an archive? Get all formats:
{
  formats: ['markdown', 'html', 'screenshot']
}
Cost: 0.003(3formats×0.003 (3 formats × 0.001)
Monitoring changes? Use screenshots:
{
  formats: ['screenshot']
}
Cost: $0.001 - Compare hashes to detect changes
Highly protected site? Enable advanced proxy:
{
  formats: ['markdown'],
  advancedProxy: true
}
Cost: 0.005(0.005 (0.001 + $0.004 proxy)

Frequently Asked Questions

Simple formula:
  • Base cost = number of formats × $0.001
  • If advanced_proxy: true, add $0.004
Examples:
1 format: $0.001
2 formats: $0.002
3 formats: $0.003
3 formats + proxy: $0.007 ($0.003 + $0.004)
The advanced proxy fee ($0.004) is charged once per request, regardless of how many formats you request.
Use main_content_only: true when:
  • You’re training AI models (cleaner data)
  • You need to remove sidebars and navigation
  • You want focused documentation content
Don’t use when:
  • You need the full page structure
  • Navigation menus are important
  • You want sidebar information
  • Page layout matters
What it removes:
  • Headers and footers
  • Navigation bars
  • Sidebars
  • Advertisement sections
  • Related posts widgets
What it keeps:
  • Main article content
  • Images within content
  • Code blocks
  • Tables
Use advanced_proxy: true when:
  • Standard scrape returns 403 Forbidden
  • Site shows CAPTCHA challenges
  • E-commerce sites with protection
  • Enterprise websites with strict security
  • Datacenter IPs are blocked
  • You need higher success rates on protected sites
Cost consideration:
  • Standard: $0.001 per format
  • With proxy: 0.001performat+0.001 per format + 0.004 proxy fee
Example:
Without proxy (1 format): $0.001 ❌ (fails)
With proxy (1 format):    $0.005 ✅ (succeeds)

Without proxy (3 formats): $0.003 ❌ (fails)
With proxy (3 formats):    $0.007 ✅ (succeeds)
Even though it costs more, you actually get the data instead of a failure!
Markdown:
  • Clean, readable text
  • Removes ads, navigation, clutter
  • Preserves formatting (headers, lists, etc.)
  • Perfect for content extraction
  • Cost: $0.001
HTML:
  • Complete page structure
  • Includes everything (ads, scripts, etc.)
  • For custom parsing or preservation
  • Larger file size
  • Cost: $0.001
Yes! Request as many formats as you want:
formats: ['markdown', 'html', 'screenshot']
Cost: 3 formats × 0.001=0.001 = 0.003Each format adds $0.001 to your total cost.
Screenshots are taken in a headless browser environment which may:
  • Have different viewport size
  • Not load some JavaScript elements
  • Use default fonts/settings
  • Not include certain animations
The core content should still be captured accurately.
Yes! You can combine both features:
{
  formats: ['markdown'],
  mainContentOnly: true,   // Clean content
  advancedProxy: true      // Bypass protection
}
Cost: 0.001(1format)+0.001 (1 format) + 0.004 (proxy) = $0.005Perfect for: Protected news sites, paywalled blogs, enterprise documentation
Response sizes vary by format:
  • Markdown: Usually 10-200 KB
  • HTML: Usually 50-500 KB
  • Screenshot: Usually 100KB-2MB
Large pages may be truncated or fail to scrape.
Some websites use bot detection that may block scraping. Signs:
  • 403 Forbidden errors
  • Captcha pages
  • Empty/incomplete content
Solutions:
  • Enable advancedProxy: true (+$0.004)
  • Try again later
  • Check if the site has an official API
  • Contact the website owner for permission

Next Steps

Crawl API

Scrape multiple pages automatically

Answer API

Scrape + AI-powered answers in one call

Web Search

Find URLs to scrape with web search

Python SDK

Python client documentation

TypeScript SDK

TypeScript client documentation

Need Help?

Discord Community

Chat with other developers

Email Support

Found a bug or have a feature request? We’d love to hear from you! Join our Discord or email us at support@llmlayer.ai