腾讯招聘爬虫（Scrapy框架）_诸葛琴魔q_腾讯招聘爬虫

未知 02-07 4255

新手学习爬虫，作为练习项目，使用Scrapy框架实现腾讯招聘爬虫并保存到MongoDB数据库

附腾讯招聘链接：搜索 | 腾讯招聘

查看网页源代码后发现其中没有数据，因此转向后端抓包查找数据源。

找到接口后进行解析，不难发现timestamp后的一串数字就是当前的时间戳，因此爬虫运行时若要爬取实时的招聘信息，就要动态获取当前的时间戳。

之后开始创建项目

创建项目命令：scrapy startproject tencent

创建爬虫：scrapy genspider hr tencent.com

item:

import scrapy from scrapy import Field class TencentItem(scrapy.Item): title=Field() #职位 country=Field() #城市 type=Field() #工作类型 text=Field() #岗位介绍 time=Field() #发布时间 url=Field() #职位详情链接 collection='hr'

爬虫体：

import scrapy import time class HrSpider(scrapy.Spider): name = 'hr' allowed_domains = ['tencent.com'] start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}' '&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn'.format(int(time.time()*1000))] def parse(self, response): # print(response.json()['Data']['Posts'][0]) content=response.json()['Data']['Posts'] for tr in content: item={} item['title']=tr['RecruitPostName'] item['country']=tr['CountryName']+' '+tr['LocationName'] item['type']=tr['CategoryName'] item['text']=tr['Responsibility'] item['time']=tr['LastUpdateTime'] item['url']='https://careers.tencent.com/jobdesc.html?postId='+tr['PostId'] print(item) yield item for i in range(2,940): next_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}' \ '&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time()*1000),i) yield scrapy.Request( url=next_url, callback=self.parse )

这里先获取数据，之后创建一个空字典item，对每个字段进行拆分，生成键值对进行保存，之后查找下一页的url地址，在翻页后观察url中改变的参数信息，除了时间戳之外还发现其中的pageIndex由1变成了2，由此可以判断根据这个参数进行翻页，之后使用一个for循环生成url地址（这里爬取940页的招聘信息）

最后生成请求，将生成的url地址交给回调函数parse进行爬取，直到结束

?数据清洗和数据库保存：

from pymongo import MongoClient class MongoPipeline(object): def __init__(self,mongo_uri,mongo_db): self.mongo_uri=mongo_uri self.mongo_db=mongo_db @classmethod def from_crawler(cls,crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self,spider): self.client=MongoClient(self.mongo_uri) self.db=self.client[self.mongo_db] def process_item(self,item,spider): name=item.__class__.__name__ item['text']=item['text'].strip() #删除字符串末端空格 self.db[name].insert(dict(item)) #插入数据库 return item def close_spider(self,spider): self.client.close()

最后对爬取的信息进行清洗并保存到数据库

settings信息：

BOT_NAME = 'tencent' SPIDER_MODULES = ['tencent.spiders'] NEWSPIDER_MODULE = 'tencent.spiders' MONGO_URL='localhost' MONGO_DB='tencent' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL='WARNING' ITEM_PIPELINES = { 'tencent.pipelines.MongoPipeline': 300, }

运行结果：

作为爬虫初学者第一次尝试练习，还有很多不足，欢迎指点。