Firecrawl的Docker配置
最近寻找爬虫项目时发现Firecrawl,官网提供的文档比较简单,实际操作遇到的困难还是很多,这里记录一下我的配置过程和解决办法
clone代码到本地
去Firecrawl的github上克隆代码到本地即可
下载配置Docker
按照官方文档给出的示例在根目录下配置.env文件
文档地址:https://docs.firecrawl.dev/contributing/self-host
# .env
# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_URL=redis://redis:6379
#for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false
# ===== Optional ENVS ======
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=
# Other Optionals
# use if you've set up authentication and want to test with a real API key
TEST_API_KEY=
# set if you'd like to test the scraping rate limit
RATE_LIMIT_TEST_API_KEY_SCRAPE=
# set if you'd like to test the crawling rate limit
RATE_LIMIT_TEST_API_KEY_CRAWL=
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
# add for LLM dependednt features (image alt generation, etc.)
OPENAI_API_KEY=
BULL_AUTH_KEY=@
# use if you're configuring basic logging with logtail
LOGTAIL_KEY=
# set if you have a llamaparse key you'd like to use to parse pdfs
LLAMAPARSE_API_KEY=
# set if you'd like to send slack server health status messages
SLACK_WEBHOOK_URL=
# set if you'd like to send posthog events like job logs
POSTHOG_API_KEY=
# set if you'd like to send posthog events like job logs
POSTHOG_HOST=
# set if you'd like to use the fire engine closed beta
FIRE_ENGINE_BETA_URL=
# Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
PROXY_SERVER=
PROXY_USERNAME=
PROXY_PASSWORD=
# set if you'd like to block media requests to save proxy bandwidth
BLOCK_MEDIA=
# Set this to the URL of your webhook when using the self-hosted version of FireCrawl
SELF_HOSTED_WEBHOOK_URL=
# Resend API Key for transactional emails
RESEND_API_KEY=
# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO
如果需要配置supabase就将USE_DB_AUTHENTICATION
配置为true,使用爬虫的功能一般配为false即可。
如果需要使用一些AI功能,比如官方提供的Extraction或者网页交互的一些功能,则要配置OPENAI_API_KEY
,剩下的一些参数根据自己需要的功能配置即可,比较费时间需要自己去琢磨
构建并运行Docker容器
这里是最麻烦的地方,我当时配了能有一天解决
- 容器构建
cd到Firecrawl的文件夹下
docker compose build
此时可能会遇到permission denied的问题
这里我搜集资料找到两种解决办法:
第一种(未成功):
# 先退出登录
docker logout
#然后重新登录docker
docker login
第二种(成功):
直接以管理员身份运行
sudo docker compose build
输入系统密码后即可开始构建
此时会遇到更麻烦的问题,因为国内挂了很多docker镜像源,所以拉取镜像时会遇到timeout或者eof等问题,这时需要自己去寻找可用的镜像源去更换然后重新构建。
但是即使你更换镜像源后再次构建花费的时间也会很长,每拉一个镜像都得十几二十分钟,问题主要出现在DockerFile里提到的三个包:python:3.11-slim,node:20-slim,golang:1.19
所以我采取的办法是直接将镜像pull到本地再进行容器构建
docker pull python:3.11-slim
docker pull node:20-slim
docker pull golang:1.19
把这三个镜像pull到本地后再进行构建,就会快很多
sudo docker compose build
运行容器
构建完毕后启动容器
docker compose up
测试
查看docker都启动后在终端发送请求测试
% curl -X GET http://localhost:3002/test
# 会返回Hello, world!%
% curl -X POST http://localhost:3002/v0/crawl
-H 'Content-Type: application/json'
-d '{
"url": "https://docs.firecrawl.dev"
}'
#会返回一个jobid,比如{"jobId":"092aec62-18f3-4cc3-acc9-83ff65d36b9a"}%
代码调试
如果想要使用代码调用服务,设置FirecrawlApp的api_url
参数即可
以下是一个调用本地服务爬取网页的示例:
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY", api_url="http://localhost:3002/")
# Crawl a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result['markdown'])