Skip to main content

What is the Scraper API?

The Scraper API converts any web page into clean, usable formats. Point it at a URL and get back:

Clean Text

Markdown or HTMLExtract the main content without ads, popups, or navigation clutter

Visual Captures

ScreenshotGet a visual snapshot of the entire page as PNG image
Perfect for: Content extraction, web archiving, data collection, AI training data, visual testing, and automated documentation.
Multi-format support: Request multiple formats in one API call! Each format costs $0.001.

Pricing (Pay Per Format)

Per-Format Pricing Model

$0.001 per format= $1 for 1,000 formats
Each format you request costs 0.001.Ifyourequestmarkdown+html+screenshot,thats0.001. If you request markdown + html + screenshot, that's 0.003 total.

Advanced Proxy Pricing

Advanced Proxy (Optional)

Additional $0.004 per requestUse advanced proxy for:
  • Sites with aggressive bot detection
  • Sites that block standard requests
  • Enterprise websites with strict security
  • E-commerce sites with protection
One-time fee per request: The advanced proxy adds $0.004 per request, regardless of how many formats you request.

Pricing Examples

Single format:
Request: formats = ["markdown"]
Cost: 1 format × $0.001 = $0.001 ✅
Multiple formats:
Request: formats = ["markdown", "html", "screenshot"]
Cost: 3 formats × $0.001 = $0.003 ✅
With advanced proxy:
Request: formats = ["markdown", "screenshot"], advanced_proxy = true
Format cost: 2 × $0.001 = $0.002
Proxy cost: $0.004
Total cost: $0.006 ✅

Before You Start

Authentication

All requests require your API key in the Authorization header:
Authorization: Bearer YOUR_LLMLAYER_API_KEY
Keep your API key secure! Never expose it in client-side code. Always call from your backend.

Your First Scrape (2-Minute Start)

Let’s scrape a website in under 2 minutes!
import { LLMLayerClient } from 'llmlayer';

// 1. Create a client
const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

// 2. Scrape a website
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown']  // Get clean markdown
});

// 3. Use the content
console.log(response.markdown);
console.log(`\nTitle: ${response.title}`);
console.log(`Cost: $${response.cost}`);  // $0.001
Done! You just scraped a website and got clean markdown. The API removed all ads, navigation, and clutter - leaving only the main content.

Output Formats Explained

The Scraper API supports 3 output formats. You can request one or multiple formats in a single call.

Quick Reference

FormatReturnsBest ForOutput FieldCost
markdownClean text with formattingAI processing, content extractionmarkdown$0.001
htmlRaw HTMLPreserving structure, custom parsinghtml$0.001
screenshotPNG image (base64)Visual testing, archivingscreenshot$0.001
Total cost = number of formats × $0.001Request 2 formats? Pay 0.002.Requestall3?Pay0.002. Request all 3? Pay 0.003.

Markdown Format (Clean Text)

Get clean, readable text without ads, popups, or navigation. Best for: Content extraction, AI training data, reading apps, RSS feeds

Basic Example

const response = await client.scrape({
  url: 'https://techcrunch.com/category/startups/',
  formats: ['markdown'],
  includeImages: true,   // Keep images
  includeLinks: true     // Keep hyperlinks
});

console.log(response.markdown);
console.log(`Cost: $${response.cost}`);  // $0.001
// Output: Clean markdown with the article content
// Text only - no images, no links
const textOnly = await client.scrape({
  url: 'https://example.com',
  formats: ['markdown'],
  includeImages: false,  // Remove all images
  includeLinks: false    // Remove all links
});

console.log(textOnly.markdown);
// Output: Pure text content
Markdown output is always clean:
  • Removes ads, popups, cookie banners
  • Removes navigation menus and sidebars
  • Extracts only the main content
  • Preserves formatting (headers, lists, code blocks)

Clean Content with Main Content Only

Extract only the main article/content without navigation, headers, or footers.
// Get clean content without navigation elements
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown'],
  mainContentOnly: true,  // Extract only main content
});

console.log(response.markdown);
console.log(`Cost: $${response.cost}`);  // $0.001
// Content without header, footer, sidebar, navigation
Perfect for:
  • Blog posts (without sidebar clutter)
  • News articles (just the story)
  • Documentation (pure content)
  • Research papers (main text only)
  • AI training data (cleaner input)
What gets removed:
  • ❌ Navigation bars
  • ❌ Sidebars
  • ❌ Headers and footers
  • ❌ Advertisement sections
  • ❌ Related posts widgets
  • ✅ Main article content
  • ✅ Embedded images in content
  • ✅ Code blocks and tables

HTML Format (Raw HTML)

Get the complete HTML structure of the page. Best for: Custom parsing, preserving exact structure, web scraping frameworks

Example

const response = await client.scrape({
  url: 'https://example.com',
  formats: ['html']
});

console.log(response.html);
console.log(`Cost: $${response.cost}`);  // $0.001
// Output: Full HTML content

// Parse it yourself if needed
const cheerio = require('cheerio');
const $ = cheerio.load(response.html);
const title = $('h1').first().text();
console.log(`Main heading: ${title}`);
HTML format returns everything - including ads, scripts, and navigation. Use markdown for clean content extraction.

Screenshot Format (PNG Image)

Capture a visual snapshot of the page as a PNG image. Best for: Visual testing, documentation, archiving how a page looks, change detection

Example

import fs from 'fs';

const response = await client.scrape({
  url: 'https://example.com',
  formats: ['screenshot']
});

// The screenshot is base64-encoded
console.log('Screenshot captured!');
console.log(`Cost: $${response.cost}`);  // $0.001

// Save to file
const buffer = Buffer.from(response.screenshot, 'base64');
fs.writeFileSync('page-screenshot.png', buffer);

console.log('Saved to page-screenshot.png');
Screenshot details:
  • Full-page screenshot (not just viewport)
  • PNG format
  • Base64 encoded in the response
  • Typical size: 100KB - 2MB depending on page

Advanced Proxy for Protected Sites

Use advanced proxy infrastructure for sites with strict bot protection.
// Scrape sites with bot protection
const response = await client.scrape({
  url: 'https://protected-site.com',
  formats: ['markdown'],
  advancedProxy: true,  // Enable advanced proxy (+$0.004/request)
});

console.log(`✅ Successfully scraped: ${response.title}`);
console.log(`💰 Total cost: $${response.cost}`);
// Cost includes base ($0.001) + proxy ($0.004) = $0.005
Additional cost: Advanced proxy adds $0.004 per request (not per format).
When to use advanced proxy:
  • Site returns 403 Forbidden
  • Getting CAPTCHA challenges
  • High-security enterprise sites
  • E-commerce platforms
  • Sites that block datacenter IPs
  • After standard scrape fails

Multi-Format Scraping

Request multiple formats in one API call!
import fs from 'fs';

// Get all formats at once!
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown', 'html', 'screenshot'],
  includeImages: true,
  includeLinks: true
});

// Save markdown
fs.writeFileSync('article.md', response.markdown);

// Save HTML
fs.writeFileSync('article.html', response.html);

// Save screenshot
const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
fs.writeFileSync('article.png', screenshotBuffer);

console.log('✅ Saved all formats!');
console.log(`📊 Title: ${response.title}`);
console.log(`💰 Total cost: $${response.cost}`);  // $0.003 (3 formats)
Cost calculation: 3 formats × 0.001=0.001 = 0.003 total

Combine All Features

Get clean content from protected sites with all formats.
// Best of everything: clean content from protected sites, all formats
const response = await client.scrape({
  url: 'https://example.com/article',
  formats: ['markdown', 'html', 'screenshot'],
  mainContentOnly: true,    // Clean content
  advancedProxy: true,      // Better success rate
});

console.log(`✅ ${response.title}`);
console.log(`   Clean article: ${response.markdown?.length || 0} chars`);
console.log(`💰 Total cost: $${response.cost}`);
// Cost: (3 formats × $0.001) + $0.004 proxy = $0.007

Real-World Examples

Example 1: Content Aggregator

Build a news aggregator that saves articles in multiple formats.
import fs from 'fs';

async function archiveArticle(url: string, title: string) {
  console.log(`📥 Archiving: ${title}`);

  const response = await client.scrape({
    url,
    formats: ['markdown', 'screenshot'],  // Text + visual
    mainContentOnly: true  // Clean content
  });

  // Create safe filename
  const filename = title
    .toLowerCase()
    .replace(/[^a-z0-9]+/g, '-')
    .replace(/^-|-$/g, '');

  // Save markdown for reading/searching
  fs.writeFileSync(`articles/${filename}.md`, response.markdown);

  // Save screenshot for visual archive
  const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
  fs.writeFileSync(`articles/${filename}.png`, screenshotBuffer);

  console.log(`✅ Saved to articles/${filename}.*`);
  console.log(`   Cost: $${response.cost}\n`);  // $0.002 (2 formats)

  return response;
}

// Archive multiple articles
const articles = [
  { url: 'https://...', title: 'AI Breakthrough' },
  { url: 'https://...', title: 'Climate Report' },
  { url: 'https://...', title: 'Tech News' }
];

for (const article of articles) {
  await archiveArticle(article.url, article.title);
  await new Promise(resolve => setTimeout(resolve, 100)); // Rate limiting
}

console.log('📚 Archive complete!');

Example 2: Visual Testing Tool

Monitor website changes by comparing screenshots.
import fs from 'fs';
import crypto from 'crypto';

async function checkForChanges(url: string, previousHash: string | null) {
  // Get screenshot
  const response = await client.scrape({
    url,
    formats: ['screenshot']
  });

  // Calculate hash of screenshot
  const screenshotBuffer = Buffer.from(response.screenshot, 'base64');
  const currentHash = crypto
    .createHash('sha256')
    .update(screenshotBuffer)
    .digest('hex');

  // Compare with previous
  if (previousHash === null) {
    console.log('🆕 First capture - no previous screenshot');
    fs.writeFileSync(`screenshots/${currentHash}.png`, screenshotBuffer);
    return { changed: false, hash: currentHash };
  }

  if (currentHash !== previousHash) {
    console.log('⚠️  CHANGE DETECTED!');
    fs.writeFileSync(`screenshots/${currentHash}.png`, screenshotBuffer);
    return { changed: true, hash: currentHash };
  }

  console.log('✅ No changes detected');
  return { changed: false, hash: currentHash };
}

// Monitor a website
let lastHash: string | null = null;

// Check every hour
setInterval(async () => {
  const result = await checkForChanges('https://example.com/pricing', lastHash);
  lastHash = result.hash;

  if (result.changed) {
    // Send alert (email, Slack, etc.)
    console.log('🚨 Send alert: Pricing page changed!');
  }
}, 3600000); // 1 hour

Example 3: Protected Site Scraper

Scrape content from protected e-commerce sites.
async function scrapeProtectedProduct(url: string) {
  console.log(`🛒 Scraping product from ${url}\n`);

  try {
    const response = await client.scrape({
      url,
      formats: ['markdown', 'html'],
      advancedProxy: true,      // Bypass bot protection
      mainContentOnly: true,    // Clean product descriptions
    });

    console.log(`✅ ${response.title}`);
    console.log(`   Content: ${response.markdown?.length || 0} chars`);
    console.log(`💰 Cost: $${response.cost}`);
    // Cost: (2 formats × $0.001) + $0.004 proxy = $0.006

    return response;

  } catch (error) {
    console.error(`❌ Failed: ${error.message}`);
    return null;
  }
}

await scrapeProtectedProduct('https://shop.example.com/product/123');

Request Parameters (Complete Reference)

Endpoint: POST /api/v2/scrape

Required Parameters

url
string
required
The URL to scrape. Must be a valid HTTP(S) URL.Examples:
  • https://example.com/article
  • https://blog.com/post?id=123
  • example.com (missing protocol)
  • ftp://example.com (unsupported protocol)
formats
array
required
List of output formats to generate. Can request one or multiple.Options: "markdown", "html", "screenshot"Examples:
formats: ['markdown']                    // Just markdown ($0.001)
formats: ['markdown', 'html']            // Two formats ($0.002)
formats: ['markdown', 'html', 'screenshot']  // All formats ($0.003)
Cost = number of formats × $0.001

Optional Parameters

main_content_only
boolean
default:"false"
Extract only the main content, removing navigation, headers, footers, and sidebars.
mainContentOnly: true  // Clean article content
Perfect for: Blog posts, news articles, documentation, and AI training data where you want clean, focused content.
advanced_proxy
boolean
default:"false"
Enable advanced proxy infrastructure for sites with bot protection.
advancedProxy: true  // Bypass bot detection (+$0.004/request)
Additional cost: Adds $0.004 per request (not per format).
Use when sites return 403 errors, CAPTCHA challenges, or have aggressive bot detection.
include_images
boolean
default:"true"
Include images in markdown output. Only affects markdown format.
  • true - Keep image links in markdown
  • false - Remove all images, text only
includeImages: false  // Text-only markdown
Include hyperlinks in markdown output. Only affects markdown format.
  • true - Keep hyperlinks
  • false - Remove all links, plain text
includeLinks: false  // No hyperlinks

Response Format

Response Structure

{
  "markdown": "# Clean content...",
  "html": "<html>...</html>",
  "screenshot": "iVBORw0KGgo...",  // base64
  "url": "https://example.com/final-url",
  "title": "Page Title",
  "statusCode": 200,
  "cost": 0.002,
  "metadata": {
    "description": "Page description",
    "author": "Author name"
  }
}

Response Fields

markdown
string
Clean markdown content (when "markdown" in formats)
# Article Title

This is the clean content without ads...
html
string
Raw HTML content (when "html" in formats)
<!DOCTYPE html>
<html>...</html>
screenshot
string
Base64-encoded PNG image (when "screenshot" in formats)Decode to save:
Buffer.from(response.screenshot, 'base64')
url
string
Final URL after following redirects
"https://example.com/final-url"
title
string
Page title extracted from metadata
"Article Title - Example Blog"
statusCode
integer
HTTP status code (200 for success)
200
cost
number
Cost in USD (number of formats × 0.001,plus0.001, plus 0.004 if using advanced proxy)
0.002  // 2 formats
metadata
object
Additional page metadata (when available)
{
  "description": "Page description",
  "author": "Author name",
  "keywords": ["keyword1", "keyword2"]
}

Error Handling

Error Format

All errors use this structure:
{
  "detail": {
    "error_type": "scraping_error",
    "error_code": "url_scrape_failed",
    "message": "Failed to scrape content from the provided URL",
    "details": {
      "url": "https://example.com",
      "error": "Connection timeout"
    }
  }
}

Common Errors

Missing or invalid API key
{
  "error_code": "missing_llmlayer_api_key",
  "message": "Provide LLMLayer API key via 'Authorization: Bearer <token>'"
}
Fix: Add your API key to the Authorization header.
Invalid or malformed URL
{
  "error_code": "invalid_url",
  "message": "The provided URL is not valid"
}
Fix: Ensure URL includes protocol (https://) and is properly formatted.
Failed to scrape the website
{
  "error_code": "url_scrape_failed",
  "message": "Failed to scrape content from the provided URL",
  "details": {
    "url": "https://example.com",
    "status_code": 500
  }
}
Common causes:
  • Website is down
  • Page requires authentication
  • JavaScript-heavy site didn’t render
  • Connection timeout
Fix: Retry the request. If it persists, try enabling advancedProxy: true.
Request took too long
{
  "error_code": "scrape_timeout",
  "message": "Scraping exceeded the timeout limit"
}
Fix: The page took too long to load. Try again or the website may be slow.

Robust Error Handling

import {
  LLMLayerClient,
  AuthenticationError,
  InvalidRequest,
  InternalServerError
} from 'llmlayer';

const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

async function robustScrape(url: string, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.scrape({
        url,
        formats: ['markdown']
      });

    } catch (error) {
      // Don't retry authentication errors
      if (error instanceof AuthenticationError) {
        console.error('❌ Fix your API key');
        throw error;
      }

      // Don't retry invalid URLs
      if (error instanceof InvalidRequest) {
        console.error('❌ Invalid URL:', url);
        throw error;
      }

      // Retry server errors
      if (error instanceof InternalServerError) {
        const waitTime = Math.pow(2, attempt) * 1000;
        console.log(`⏳ Scraping failed. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));

        // Last attempt
        if (attempt === maxRetries - 1) {
          console.error('❌ Max retries exceeded');
          throw error;
        }
        continue;
      }

      throw error;
    }
  }
}

// Usage
try {
  const response = await robustScrape('https://example.com');
  console.log(`✅ Scraped: ${response.title}`);
  console.log(response.markdown);
} catch (error) {
  console.error('Scraping failed:', error);
}

Best Practices

💰 Cost Optimization

Request only what you need
  • Need just text? Request only markdown
  • Need visual verification? Add screenshot
  • Each format costs $0.001
Cache results
  • Web pages don’t change every second
  • Cache for hours or days depending on content
  • Save money on re-scraping
Use advanced proxy wisely
  • Only for protected sites
  • Costs $0.004 extra per request
  • But significantly improves success rate

⚡ Performance Tips

Choose the right format
  • Markdown: Fastest, smallest
  • HTML: Fast, larger
  • Screenshot: Slower, largest
Optimize markdown options
  • includeImages: false = smaller, faster
  • includeLinks: false = cleaner text
Use main_content_only
  • Faster extraction
  • Smaller markdown output
  • Better for AI processing
Parallel processing
  • Scrape multiple URLs simultaneously
  • Use Promise.all() or asyncio.gather()

🛡️ Reliability

Always handle errors
  • Some sites block scrapers
  • Some pages require auth
  • Network issues happen
Use advanced proxy when:
  • Getting 403 errors
  • Site blocks requests
  • Standard scrape fails
  • Need higher success rate
Implement retries
  • Exponential backoff for failures
  • Don’t retry bad URLs
  • Max 3 retries recommended
Validate URLs first
  • Check protocol (https://)
  • Ensure proper formatting
  • Handle user input carefully

Quick Tips

Starting out? Use this config for most scrapes:
{
  formats: ['markdown'],
  includeImages: true,
  includeLinks: true
}
Cost: $0.001
Need text only? Remove images and links:
{
  formats: ['markdown'],
  includeImages: false,
  includeLinks: false
}
Cost: $0.001
Building an archive? Get all formats:
{
  formats: ['markdown', 'html', 'screenshot']
}
Cost: 0.003(3formats×0.003 (3 formats × 0.001)
Monitoring changes? Use screenshots:
{
  formats: ['screenshot']
}
Cost: $0.001 - Compare hashes to detect changes
highly Protected site? Enable advanced proxy:
{
  formats: ['markdown'],
  advancedProxy: true
}
Cost: 0.005(0.005 (0.001 + $0.004 proxy)

Frequently Asked Questions

Simple formula:
  • Base cost = number of formats × $0.001
  • If advanced_proxy: true, add $0.004
Examples:
1 format: $0.001
2 formats: $0.002
3 formats: $0.003
3 formats + proxy: $0.007 ($0.003 + $0.004)
The advanced proxy fee ($0.004) is charged once per request, regardless of how many formats you request.
Use main_content_only: true when:
  • You’re training AI models (cleaner data)
  • You need to remove sidebars and navigation
  • You want focused documentation content
Don’t use when:
  • You need the full page structure
  • Navigation menus are important
  • You want sidebar information
  • Page layout matters
What it removes:
  • Headers and footers
  • Navigation bars
  • Sidebars
  • Advertisement sections
  • Related posts widgets
What it keeps:
  • Main article content
  • Images within content
  • Code blocks
  • Tables
Use advanced_proxy: true when:
  • Standard scrape returns 403 Forbidden
  • Site shows CAPTCHA challenges
  • E-commerce sites with protection
  • Enterprise websites with strict security
  • Datacenter IPs are blocked
  • You need higher success rates on protected sites
Cost consideration:
  • Standard: $0.001 per format
  • With proxy: 0.001performat+0.001 per format + 0.004 proxy fee
Example:
Without proxy (1 format): $0.001 ❌ (fails)
With proxy (1 format):    $0.005 ✅ (succeeds)

Without proxy (3 formats): $0.003 ❌ (fails)
With proxy (3 formats):    $0.007 ✅ (succeeds)
Even though it costs more, you actually get the data instead of a failure!
Markdown:
  • Clean, readable text
  • Removes ads, navigation, clutter
  • Preserves formatting (headers, lists, etc.)
  • Perfect for content extraction
  • Cost: $0.001
HTML:
  • Complete page structure
  • Includes everything (ads, scripts, etc.)
  • For custom parsing or preservation
  • Larger file size
  • Cost: $0.001
Yes! Request as many formats as you want:
formats: ['markdown', 'html', 'screenshot']
Cost: 3 formats × 0.001=0.001 = 0.003Each format adds $0.001 to your total cost.
Screenshots are taken in a headless browser environment which may:
  • Have different viewport size
  • Not load some JavaScript elements
  • Use default fonts/settings
  • Not include certain animations
The core content should still be captured accurately.
Yes! You can combine both features:
{
  formats: ['markdown'],
  mainContentOnly: true,   // Clean content
  advancedProxy: true      // Bypass protection
}
Cost: 0.001(1format)+0.001 (1 format) + 0.004 (proxy) = $0.005Perfect for: Protected news sites, paywalled blogs, enterprise documentation
Response sizes vary by format:
  • Markdown: Usually 10-200 KB
  • HTML: Usually 50-500 KB
  • Screenshot: Usually 100KB-2MB
Large pages may be truncated or fail to scrape.
Some websites use bot detection that may block scraping. Signs:
  • 403 Forbidden errors
  • Captcha pages
  • Empty/incomplete content
Solutions:
  • Enable advancedProxy: true (+$0.004)
  • Try again later
  • Check if the site has an official API
  • Contact the website owner for permission

Next Steps


Need Help?

Found a bug or have a feature request? We’d love to hear from you! Join our Discord or email us at [email protected]