xpath re bs4 等爬虫解析器的性能比较
思路
测试网站地址:http://baijiahao.baidu.com/s?id=1644707202199076031
根据同一个网站,获取同样的数据,重复 500 次取和后进行对比。
测试例子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
| import re import time
import scrapy from bs4 import BeautifulSoup
class NewsSpider(scrapy.Spider): name = 'news' allowed_domains = ['baidu.com'] start_urls = ['http://baijiahao.baidu.com/s?id=1644707202199076031']
def parse(self, response): re_time_list = [] xpath_time_list = [] lxml_time_list = [] bs4_lxml_time_list = [] html5lib_time_list = [] bs4_html5lib_time_list = [] for _ in range(500): re_start_time = time.time() news_title = re.findall(pattern="<title>(.*?)</title>", string=response.text)[0] news_content = "".join(re.findall(pattern='<span class="bjh-p">(.*?)</span>', string=response.text)) re_time_list.append(time.time() - re_start_time) xpath_start_time = time.time() news_title = response.xpath("//div[@class='article-title']/h2/text()").extract_first() news_content = response.xpath('string(//*[@id="article"])').extract_first() xpath_time_list.append(time.time() - xpath_start_time) soup = BeautifulSoup(response.text, "html5lib") html5lib_start_time = time.time() news_title = soup.select_one("div.article-title > h2").text news_content = soup.select_one("#article").text html5lib_time_list.append(time.time() - html5lib_start_time) bs4_html5lib_start_time = time.time() soup = BeautifulSoup(response.text, "html5lib") news_title = soup.select_one("div.article-title > h2").text news_content = soup.select_one("#article").text bs4_html5lib_time_list.append(time.time() - bs4_html5lib_start_time)
soup = BeautifulSoup(response.text, "lxml") lxml_start_time = time.time() news_title = soup.select_one("div.article-title > h2").text news_content = soup.select_one("#article").text lxml_time_list.append(time.time() - lxml_start_time)
bs4_lxml_start_time = time.time() soup = BeautifulSoup(response.text, "lxml") news_title = soup.select_one("div.article-title > h2").text news_content = soup.select_one("#article").text bs4_lxml_time_list.append(time.time() - bs4_lxml_start_time) re_result = sum(re_time_list) xpath_result = sum(xpath_time_list) lxml_result = sum(lxml_time_list) html5lib_result = sum(html5lib_time_list) bs4_lxml_result = sum(bs4_lxml_time_list) bs4_html5lib_result = sum(bs4_html5lib_time_list)
print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n") print(f"re 使用时间:{re_result}") print(f"xpath 使用时间:{xpath_result}") print(f"lxml 纯解析使用时间:{lxml_result}") print(f"html5lib 纯解析使用时间:{html5lib_result}") print(f"bs4_lxml 转换解析使用时间:{bs4_lxml_result}") print(f"bs4_html5lib 转换解析使用时间:{bs4_html5lib_result}") print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n") print(f"xpath/re :{xpath_result / re_result}") print(f"lxml/re :{lxml_result / re_result}") print(f"html5lib/re :{html5lib_result / re_result}") print(f"bs4_lxml/re :{bs4_lxml_result / re_result}") print(f"bs4_html5lib/re :{bs4_html5lib_result / re_result}") print("\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
|
测试结果:
第一次
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.018010616302490234 xpath 使用时间:0.19927382469177246 lxml 纯解析使用时间:0.3410227298736572 html5lib 纯解析使用时间:0.3842911720275879 bs4_lxml 转换解析使用时间:1.6482152938842773 bs4_html5lib 转换解析使用时间:6.744122505187988
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :11.064242408196765 lxml/re :18.934539726245003 html5lib/re :21.336925154218847 bs4_lxml/re :91.51354213550078 bs4_html5lib/re :374.4526223822509 lxml/xpath :1.7113272673976896 html5lib/xpath :1.9284578525152096
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
第二次
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.023047208786010742 xpath 使用时间:0.18992280960083008 lxml 纯解析使用时间:0.3522317409515381 html5lib 纯解析使用时间:0.418229341506958 bs4_lxml 转换解析使用时间:1.710503101348877 bs4_html5lib 转换解析使用时间:7.1153998374938965
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :8.24059917034769 lxml/re :15.28305419636484 html5lib/re :18.14663742538819 bs4_lxml/re :74.21736476770769 bs4_html5lib/re :308.7315216154427 lxml/xpath :1.8546047296364272 html5lib/xpath :2.2021016979791463
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
第三次
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
re 使用时间:0.014002561569213867 xpath 使用时间:0.18992352485656738 lxml 纯解析使用时间:0.3783881664276123 html5lib 纯解析使用时间:0.39995455741882324 bs4_lxml 转换解析使用时间:1.751767873764038 bs4_html5lib 转换解析使用时间:7.1871068477630615
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xpath/re :13.563484360899695 lxml/re :27.022781835827757 html5lib/re :28.56295653062267 bs4_lxml/re :125.10338662716453 bs4_html5lib/re :513.2708620660298 lxml/xpath :1.9923185751389976 html5lib/xpath :2.1058716013241323
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
结果分析:
三次取平均值结果分析
|
re |
xpath |
lxml |
html5lib |
lxml(bs4) |
html5lib(bs4) |
re |
1 |
10.52 |
19.46 |
21.84 |
92.82 |
382.25 |
xpath |
|
1 |
1.85 |
2.08 |
8.82 |
36.34 |
lxml |
|
|
1 |
1.12 |
4.77 |
19.64 |
html5lib |
|
|
|
1 |
4.25 |
17.50 |
lxml(bs4) |
|
|
|
|
1 |
4.12 |
html5lib(bs4) |
|
|
|
|
|
1 |
- xpath/re :10.52
- lxml/re :19.46
- html5lib/re :21.84
- bs4_lxml/re :92.82
- bs4_html5lib/re :382.25
- lxml/xpath :1.85
- html5lib/xpath :2.08
- bs4_lxml/xpath :8.82
- bs4_html5lib/xpath :36.34
- html5lib/lxml :1.12
- bs4_lxml/lxml :4.77
- bs4_html5lib/lxml :19.64
- bs4_lxml/html5lib :4.25
- bs4_html5lib/html5lib :17.50
- bs4_html5lib/bs4_lxml :4.12
三种爬取方式的对比
|
re |
xpath |
bs4 |
安装 |
内置 |
第三方 |
第三方 |
语法 |
正则 |
路径匹配 |
面向对象 |
使用 |
困难 |
较困难 |
简单 |
性能 |
最高 |
适中 |
最低 |
结论
re > xpath > bs4
总的来说,xpath 加上 scrapy-redis 的分布式已经非常满足性能要求了,建议入 xpath 的坑。