python爬虫入门:lxml库进行XPath抽取

作者 : 郭然 本文共2320个字,预计阅读时间需要6分钟 发布时间: 2022-09-18 共138人阅读

python爬虫入门:lxml库进行XPath抽取

lxml起步

常用规则

表达式 描述
nodename 选取此节点的所有子节点
/ 从当前节点选取直接子节点
// 从当前节点选取子孙节点
. 选取当前节点
选取当前节点的父节点
@ 选取属性

常见用法

  • 所有节点
  • 子节点
  • 父节点
  • 属性匹配 li[@class=“xxx”]
  • 文本获取 /text()
  • 属性获取 @href
  • 属性多值获取 li[contains(@class,“xxx”)]
  • 多属性匹配 li[contains(@class,“li_test”) and @tag=“tag”]
  • 按序选择 li[1]、li[last()]
  • 节点轴选择 li[1]/ancestor::a、li[1]/attribute::a

简单实例

from lxml import etree
text = '''
<div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
<li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</div>
'''
html = etree.HTML(text)
# 或者html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8')+'\n')
print(html.xpath('//li/a[@tag="tag"]/text()'))
print(html.xpath('//li/a[@tag="tag"]/../@class'))
print(html.xpath('//li[@class="li_test"]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test")]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test") and @tag="tag"]/a/text()'))
print(html.xpath('//li[1]/a/text()'), 
	html.xpath('//li[last()]/a/text()'),
	html.xpath('//li[position()<3]/a/text()'),
	html.xpath('//li[last()-1]/a/text()'))
print('*'*20)
print(html.xpath('//li[1]/ancestor::*'),
	html.xpath('//li[1]/ancestor::ul'),
	html.xpath('//li[1]/attribute::*'),
	html.xpath('//li[1]/child::*'),
	html.xpath('//li[1]/descendant::*'),
	html.xpath('//li[1]/following::*'),
	html.xpath('//li[1]/following-sibling::*'), sep="\n")

"""

运行结果为

<html><body><div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
</li><li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</ul></div>
</body></html>

['two']
['li_test']
['two']
['two', 'three', 'four']
['four']
['one'] ['four'] ['one', 'two'] ['three']
********************
[<Element html at 0x1d4f8dba688>, <Element body at 0x1d4f8dba6c8>, <Element div at 0x1d4f8dba5c8>, <Element ul at 0x1d4f8dba748>]
[<Element ul at 0x1d4f8dba748>]
['first_li']
[<Element a at 0x1d4f8dba608>]
[<Element a at 0x1d4f8dba608>]
[<Element li at 0x1d4f8dba808>, <Element a at 0x1d4f8dba848>, <Element li at 0x1d4f8dba888>, <Element a at 0x1d4f8dba8c8>, <Element li at 0x1d4f8dba908>, <Element a at 0x1d4f8dba988>]
[<Element li at 0x1d4f8dba808>, <Element li at 0x1d4f8dba888>, <Element li at 0x1d4f8dba908>]
"""

 

赞赏

微信赞赏支付宝赞赏

VIP部落提供编程技术、教育培训、优惠购物以及各类软件和网站源码、模板等资源下载。
VIP部落 » python爬虫入门:lxml库进行XPath抽取

常见问题FAQ

提供最优质的资源集合

立即查看 了解详情