day 60、61、62 Python Scrapy_韭菜盒子123

网络投稿 02-07 931

文章目录一、Scrapy1、结构2、序3、API 接口4、XPath 读取 html2.1 指令零碎

I know, i know 地球另一端有你陪我一、Scrapy

是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架

常应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中

我们能很简单的通过 Scrapy 框架实现一个爬虫，抓取指定网站的内容或图片

1、结构

Scrapy Engine（引擎）负责 Spider、ItemPipeline、Downloader、Scheduler 中间的通讯，信号、数据传递等

Scheduler（调度器）它负责接受引擎发送过来的 Request 请求，并按照一定的方式进行整理排列，入队（内部是一个Request 队列），当引擎需要时，交还给引擎

Downloader（下载器）负责下载Scrapy Engine（引擎）发送的所有 Requests 请求，并将其获取到的 Responses 交还给 Scrapy Engine，由引擎交给 Spider 来处理

Spider（爬虫）它负责将需要跟进的 URL 提交给引擎，进入 Scheduler（调度器），并处理所有 Responses，从中分析提取数据，获取 Item 字段需要的数据；接着再次提交 URL

Item Pipeline（管道）它是负责处理 Spider 中获取到的 Item，并进行进行后期处理（详细分析、过滤、存储等）的地方。Item 是response 中获取到有效数据的集合（例如一个对象）

Downloader Middlewares（下载中间件）：你可以当作是一个可以自定义扩展下载功能的组件

Spider Middlewares（Spider中间件）：你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入 Spider 的 Responses；和从 Spider 出去的 Requests）

2、序

1 安装 scrapy 框架

pip install scrapy

2 打开 cmd 界面，或 pycharm 中的 Terminal

scrapy

3 快速构建scrapy项目

startproject fghscrapy

4 切入到当前项目目录

cd fghscrapy

5 打开示例 scrapy 项目

scrapy genspider example example.com 3、API 接口

一些大型网站能够做到前端后端的分离，倘若能够拿到其中的 API 接口，基本上就可以根据其中的 json 格式，获取想要的数据

这里以某东为例子

1 先检查元素拿到 url 和请求头（headers）

url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=23443207137&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1' headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53" }

items

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy # 明确爬取目标的格式，即一个个对象 class CommentItem(scrapy.Item): # 需要赋初值 id = scrapy.Field() content = scrapy.Field() creationTime = scrapy.Field() score = scrapy.Field() nickname = scrapy.Field()

spider

import re import json import scrapy from ..items import CommentItem class JDScriper(scrapy.Spider): name = 'JDScriper' allowed_domains = ['jd.com'] # 设定请求头伪装成浏览器 headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53" } # 手动发起请求 def start_requests(self): # 京东 url_format = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=23443207137&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&fold=1' # headers 请求头 # callback 回调函数：设置由哪个函数去解析response， # 即可以自定义解析方式 # yield 的效果类似 return，不同的是 yield 不会中断函数的运行， # 因此可以多次调用。好处是可以用节省内存占用，避免一次需要存储过多再返回 for i in range(8): url = url_format.format(i) yield scrapy.Request(url, headers=self.headers) # 解析函数 def parse(self, response): # print(response.text) # 得到的结果需要被解析，先拆出其中的json,使用正则表达式 # 正则表达式的使用,前面需要使用r来提示，原数据，最后去除中间的部分 # 注： .* 能够识别换行符之外的数据，原数据中是有换行符的， # 而scrapy中response 得到的数据会在字符串前默认加上r(r'123\t\n\r')， # 让转义符等特殊含义字符失效 # 注2：这里的两个r是不同的含义，一个是提示系统正则，一个是让符号失效 # ( 'fetchJSON_comment98(' )( .* )( ');' ) json_str = re.match(r"(fetchJSON_comment98\()(.*)(\);)",response.text).group(2) # print(json_str) # 解析 json 得到一个字典 commentDict = json.loads(json_str) # 得到一个list[] comments = commentDict.get('comments',None) # 进行判断并遍历 if comments is not None: for comment in comments: # 内部依然是一个个dict,因为网站中的格式几乎标准，所有直接取 # 从response中解析数据并构建成Item对象 # 需要导依赖，用到..得到上一级的 commentItem = CommentItem() # 对象的赋值需要这种格式 commentItem['id'] = comment['id'] commentItem['content'] = comment['content'] commentItem['creationTime'] = comment['creationTime'] commentItem['score'] = comment['score'] commentItem['nickname'] = comment['nickname'] # 将构建好的 Item 对象发送给pipeline做后续的处理 yield commentItem

pipline

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import pymysql class fghscrapyPipeline: pass # 只需要建立一次连接，该方法会在类初始化时调用 def open_spider(self, spider): self.conn = pymysql.connect(user='root', passwd='123456', host='master', port=3306, db='scrapy') # 对由Spider发送过来的item进行处理 def process_item(self, item, spider): # 写入MySQL id = item['id'] content = item['content'] creationTime = item['creationTime'] score = item['score'] nickname = item['nickname'] # # # 防止某些信息缺失，使用try catch try: with self.conn.cursor() as cursor: cursor.execute('insert into comments values(%s,%s,%s,%s,%s)', (id, nickname, score, creationTime, content)) except Exception as e: print(e) # 发生了异常就进行回滚 self.conn.rollback() else: # 没有发生异常就提交写入的数据 self.conn.commit() return item # 关闭同创立连接同理 def close_spider(self, spider): self.conn.close()

一些 setting

# 是否走'君子协议' ROBOTSTXT_OBEY = False # 下载延迟，可以打开，避免被检测 DOWNLOAD_DELAY = 3

运行 1 控制台键入

scrapy crawl JDScriper

2 在项目中的 init 文件中键入并开启

cmdline.execute("scrapy crawl JDScriper".split()) 4、XPath 读取 html

有的数据需要直接通过 html （HyperText Markup Language ：超文本标记语言）获取

html 包含 head 和 body，内部是多个标签互相套娃，一般数据夹杂在其中需要一层层拆解，类似命名空间中的嵌套关系

html be like:

/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]

其中最底层是 /html 类似于根目录 [n] 指其父目录下有多个同名子目录，n 表示自上而下第 n 个同名子目录，从 1 开始

2.1 指令

打开 shell 指令（可以按需添加请求头（header））

scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53" https://movie.douban.com/top250

查看当前页面的 html

response.text

通过绝对路径获取数据 / （extract 萃取，可以获得标签内的值）

response.xpath('/html/body/div[3]/div[1]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]').extract()

全位置搜索标签 //

response.xpath("//ol") // 将 ol 下的所有 li 塞进一个 list 返回 // 注意此处是 list ，因此又会从 0 开始 response.xpath("//ol/li")

指定标签属性（当多个标签重名，可以再通过属性筛选）

response.xpath("//ol[@class='gird_view']") // 多个属性标签匹配 response.xpath('//ol[contains(@class=,"li")]') response.xpath('//ol[contains(@class,"li") and @name="item"]')

子孙节点下的查找

li_1 = response.xpath("//ol[@class='grid_view']/li")[0] li_1.xpath(".//")

数据萃取

// text() 能够只输出文本部分，不输出属性等 // 二者相同，注意最后的斜杠不代表路径 li_1.xpath(".//span[@class='title']")[0].xpath("text()") li_1.xpath(".//span[@class='title']/text()")[0] // extract() 萃取，相当于只取出 value 的部分 // extract() 前都需要在 list 中先取元素 li_1.xpath(".//span[@class='title']")[0].xpath("text()").extract()

取标签的属性例如翻页的时候，有时候需要用到

response.xpath("//div[@class='paginator']/a[1]/@href") 零碎

1、Visual Studio 多个光标 alt + 左键

2、Visual Studio 格式整理 alt + shift + f

3、在开启项目时，若其中的模块存在 open_spider()，其中调用某些框架，需要先打开对应的服务，或先注释掉对应方法体 open_spider()：类似于构造方法或静态代码块，会随着类的加载而执行