0%

使用scrapyd部署scrapy爬虫

使用scrapyd部署scrapy爬虫

介绍

Scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API.

安装

安装scrapyd服务端(需要scrapy项目环境)

1
pip install scrapyd

安装 scrapyd-deploy (无需scrapy环境)

1
2
3
4
5
6
7
8
pip install scrapyd-client 
```

windows系统,在`c:\python27\Scripts`下生成的是`scrapyd-deploy`,无法直接在命令行里运行`scrapd-deploy` 配置文件位置:

##### 解决办法:

在`C:\python27\Scripts`下新建一个`scrapyd-deploy.bat`,文件内容如下:

@echo off C:\Python27\python C:\Python27\Scripts\scrapyd-deploy %*

1
2
3
4
5
6
7
8
9
10
11

使用
--

### 开启scrapyd

终端运行:`scrapyd`开启`scrapyd`服务

### 发布工程到scrapyd

切换至到`scrapy`项目根目录,编辑`scrapy.cfg`,

取消 # url = http://localhost:6800/前面的注释#

1
2
3
4

![](https://www.ichenfei.com/wp-content/uploads/2017/10/0678af8c66e671232462916041dee39d.png)

打包发布`scrapyd`项目:

scrapyd-deploy -p

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

* `<target>`就是配置文件的`deploy`的名字,针对上面的配置就是`scearticle`
* `<project>`是项目名`scrapyd`项目名,不输入则默认为`deploy`的名字

部署完成之后,创建的scrapy项目中可以看到多了一个eggs的文件夹,以及里面的一些内容,里面所存放的就是scrapyd-deploy的工程打包成.egg的文件

![](https://www.ichenfei.com/wp-content/uploads/2017/10/2e3e75aea3ef2cde9615576f22ee531e.png)

API
---

### 查看`scrapyd`运行状态


```bash
curl http://localhost:6800/daemonstatus.json 返回: {"status": "ok", "running": 0, "finished": 42, "pending": 0, "node_name": "VM-42-98-ubuntu"}

创建爬虫任务

1
2
3
4
5
6
7
8
9
10
11
curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider 

参数:

project (string, required) - 项目名
spider (string, required) - 项目中的爬虫名
setting (string, optional) - 指定使用Scrapy设置
jobid (string, optional) - 指定jobid
_version (string, optional) - 指定项目版本

示例: curl http://localhost:6800/schedule.json -d project=SecArticleSpider -d spider=freebuf 返回: {"status": "ok", "jobid": "7ea5ddc0bd7d11e7b256525400dc407e", "node\_name": "VM-42-98-ubuntu"}

取消正在运行的job

1
2
3
curl http://localhost:6800/cancel.json -d project=myproject -d job=6487ec79947edab326d6db28a2d86511e8247444 

返回: {"status": "ok", "prevstate": "running"}

列出项目:

1
2
3
curl http://localhost:6800/listprojects.json 

返回: {"status": "ok", "projects": \["myproject", "otherproject"\]}

列出爬虫:

1
2
3
curl http://localhost:6800/listspiders.json?project=myproject 

返回: {"status": "ok", "spiders": \["spider1", "spider2", "spider3"\]}

列出job:

1
2
curl http://localhost:6800/listjobs.json?project=myproject 
示例返回: "status": "ok", "pending": \[{"id": "78391cc0fcaf11e1b0090800272a6d06", "spider": "spider1"}\], "running": \[{"id": "422e608f9f28cef127b3d5ef93fe9399", "spider": "spider2", "start\_time": "2012-09-12 10:14:03.594664"}\], "finished": \[{"id": "2f16646cfcaf11e1b0090800272a6d06", "spider": "spider3", "start\_time": "2012-09-12 10:14:03.594664", "end_time": "2012-09-12 10:24:03.594664"}\]}

删除项目:

1
curl http://localhost:6800/delproject.json -d project=myproject 示例返回: {"status": "ok"}

更多支持的api: Scrapyd 官方支持手册