使用scrapyd部署scrapy爬虫

介绍

Scrapyd is a service for running Scrapy spiders.

It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API.

安装

安装scrapyd服务端(需要scrapy项目环境)

pip install scrapyd

安装 scrapyd-deploy (无需scrapy环境)

pip install scrapyd-client

windows系统,在c:\python27\Scripts下生成的是scrapyd-deploy,无法直接在命令行里运行scrapd-deploy

配置文件位置:

sudo find / -name default_scrapyd.conf

/home/ubuntu/.virtualenvs/secarticle/lib/python2.7/site-packages/scrapyd/default_scrapyd.conf

解决办法:

C:\python27\Scripts下新建一个scrapyd-deploy.bat,文件内容如下:

@echo off
C:\Python27\python C:\Python27\Scripts\scrapyd-deploy %*

使用

开启scrapyd

终端运行:scrapyd开启scrapyd服务

发布工程到scrapyd

切换至到scrapy项目根目录,编辑scrapy.cfg,

取消
# url = http://localhost:6800/前面的注释#


打包发布scrapyd项目:

scrapyd-deploy <target> -p <project>
  • <target>就是配置文件的deploy的名字,针对上面的配置就是scearticle
  • <project>是项目名scrapyd项目名,不输入则默认为deploy的名字

部署完成之后,创建的scrapy项目中可以看到多了一个eggs的文件夹,以及里面的一些内容,里面所存放的就是scrapyd-deploy的工程打包成.egg的文件

API

查看scrapyd运行状态

curl http://localhost:6800/daemonstatus.json
返回:
{"status": "ok", "running": 0, "finished": 42, "pending": 0, "node_name": "VM-42-98-ubuntu"}

创建爬虫任务

curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

参数:
project (string, required) - 项目名
spider (string, required) - 项目中的爬虫名
setting (string, optional) - 指定使用Scrapy设置
jobid (string, optional) -  指定jobid
_version (string, optional) - 指定项目版本


示例:
curl http://localhost:6800/schedule.json -d project=SecArticleSpider -d spider=freebuf

返回:
{"status": "ok", "jobid": "7ea5ddc0bd7d11e7b256525400dc407e", "node_name": "VM-42-98-ubuntu"}

取消正在运行的job

curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444

返回:
{"status": "ok", "prevstate": "running"}

列出项目:

curl http://localhost:6800/listprojects.json

返回:
{"status": "ok", "projects": ["myproject", "otherproject"]}

列出爬虫:

curl http://localhost:6800/listspiders.json?project=myproject

返回:
{"status": "ok", "spiders": ["spider1", "spider2", "spider3"]}

列出job:

curl http://localhost:6800/listjobs.json?project=myproject

示例返回:
"status": "ok",
 "pending": [{"id": "78391cc0fcaf11e1b0090800272a6d06", "spider": "spider1"}],
 "running": [{"id": "422e608f9f28cef127b3d5ef93fe9399", "spider": "spider2", "start_time": "2012-09-12 10:14:03.594664"}],
 "finished": [{"id": "2f16646cfcaf11e1b0090800272a6d06", "spider": "spider3", "start_time": "2012-09-12 10:14:03.594664", "end_time": "2012-09-12 10:24:03.594664"}]}

删除项目:

curl http://localhost:6800/delproject.json -d project=myproject

示例返回:
{"status": "ok"}

更多支持的api:
Scrapyd 官方支持手册

未经允许不得转载:晨飞小窝 » 使用scrapyd部署scrapy爬虫

赞 (0)

评论 0

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址