Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/firecrawl/firecrawl/llms.txt

Use this file to discover all available pages before exploring further.

The Crawl feature allows you to scrape all pages of a website with a single request. It automatically discovers URLs, respects your limits and filters, and scrapes each page according to your specifications.

When to Use Crawl

Use Crawl when you need to:
  • Extract content from an entire website or documentation site
  • Build a knowledge base from web content
  • Index website content for search
  • Monitor website changes over time
  • Create datasets from multi-page websites

Basic Usage

1

Start a Crawl

from firecrawl import Firecrawl
from firecrawl.types import ScrapeOptions

app = Firecrawl(api_key="fc-YOUR_API_KEY")

# Crawl a website (automatically waits for completion)
result = app.crawl(
    'https://docs.firecrawl.dev',
    limit=100,
    scrape_options=ScrapeOptions(formats=['markdown', 'html'])
)

for doc in result.data:
    print(doc.metadata.source_url, doc.markdown[:100])
2

Response

The initial response contains a job ID:
{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v2/crawl/123-456-789"
}
3

Check Status (if using async)

crawl_status = app.get_crawl_status("<crawl_id>")
print(crawl_status)
Status Response:
{
  "status": "completed",
  "total": 50,
  "completed": 50,
  "creditsUsed": 50,
  "data": [
    {
      "markdown": "# Page Title\n\nContent...",
      "metadata": {
        "title": "Page Title",
        "sourceURL": "https://..."
      }
    }
  ]
}
The SDKs handle polling automatically, waiting for the crawl to complete before returning results.

Asynchronous Crawling

For long-running crawls, start the job asynchronously and poll for status:
from firecrawl.types import ScrapeOptions

# Start crawl asynchronously
crawl_job = app.start_crawl(
    'https://docs.firecrawl.dev',
    limit=100,
    scrape_options=ScrapeOptions(formats=['markdown', 'html'])
)

print(f"Crawl started with ID: {crawl_job.id}")

# Check status later
status = app.get_crawl_status(crawl_job.id)
print(f"Status: {status.status}")

Crawl Options

Control Crawl Depth

result = app.crawl(
    'https://docs.firecrawl.dev',
    max_depth=2,  # Max URL depth
    limit=100
)

Include/Exclude Paths

Use regex patterns to filter URLs:
result = app.crawl(
    'https://firecrawl.dev',
    include_paths=['blog/.*'],  # Only crawl blog pages
    exclude_paths=['blog/.*-draft'],  # Exclude drafts
    limit=50
)
result = app.crawl(
    'https://firecrawl.dev',
    allow_backward_links=True,  # Crawl sibling/parent URLs
    allow_external_links=True,  # Follow external links
    limit=100
)
  • allowBackwardLinks: false - Only crawls deeper (child) URLs
  • allowBackwardLinks: true - Crawls any internal links, including siblings and parents
  • allowExternalLinks: true - Follows links to external websites

Ignore Sitemap

result = app.crawl(
    'https://firecrawl.dev',
    ignore_sitemap=True,  # Don't use sitemap.xml
    limit=50
)

Real-Time Updates with WebSockets

Get updates as pages are crawled:
import nest_asyncio
nest_asyncio.apply()

# Define event handlers
def on_document(detail):
    print("DOC", detail)

def on_error(detail):
    print("ERR", detail['error'])

def on_done(detail):
    print("DONE", detail['status'])

# Start crawl with watcher
async def start_crawl_and_watch():
    watcher = app.crawl_url_and_watch(
        'firecrawl.dev',
        exclude_paths=['blog/*'],
        limit=5
    )

    # Add event listeners
    watcher.add_event_listener("document", on_document)
    watcher.add_event_listener("error", on_error)
    watcher.add_event_listener("done", on_done)

    # Start the watcher
    await watcher.connect()

# Run the event loop
await start_crawl_and_watch()

Cancel a Crawl

cancel_result = app.cancel_crawl(crawl_id)
print(cancel_result)

Best Practices

  • Use limit to control costs and crawl time
  • Set appropriate maxDepth to avoid crawling too deep
  • Use includePaths and excludePaths to focus on relevant content
  • Enable WebSocket watchers for real-time monitoring of large crawls
  • Set a delay between scrapes to respect website rate limits
  • Start with a small limit to test your configuration before scaling up

Next Steps

  • Learn about Map to discover URLs before crawling
  • Use Batch Scrape for known lists of URLs
  • Try Search to find and scrape search results