python爬虫入门:lxml库进行XPath抽取
python爬虫入门:lxml库进行XPath抽取
lxml起步
常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
常见用法
- 所有节点
- 子节点
- 父节点
- 属性匹配 li[@class=“xxx”]
- 文本获取 /text()
- 属性获取 @href
- 属性多值获取 li[contains(@class,“xxx”)]
- 多属性匹配 li[contains(@class,“li_test”) and @tag=“tag”]
- 按序选择 li[1]、li[last()]
- 节点轴选择 li[1]/ancestor::a、li[1]/attribute::a
简单实例
from lxml import etree
text = '''
<div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
<li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</div>
'''
html = etree.HTML(text)
# 或者html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8')+'\n')
print(html.xpath('//li/a[@tag="tag"]/text()'))
print(html.xpath('//li/a[@tag="tag"]/../@class'))
print(html.xpath('//li[@class="li_test"]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test")]/a/text()'))
print(html.xpath('//li[contains(@class,"li_test") and @tag="tag"]/a/text()'))
print(html.xpath('//li[1]/a/text()'),
html.xpath('//li[last()]/a/text()'),
html.xpath('//li[position()<3]/a/text()'),
html.xpath('//li[last()-1]/a/text()'))
print('*'*20)
print(html.xpath('//li[1]/ancestor::*'),
html.xpath('//li[1]/ancestor::ul'),
html.xpath('//li[1]/attribute::*'),
html.xpath('//li[1]/child::*'),
html.xpath('//li[1]/descendant::*'),
html.xpath('//li[1]/following::*'),
html.xpath('//li[1]/following-sibling::*'), sep="\n")
"""
运行结果为
<html><body><div>
<ul class="ul_test">
<li class="first_li"><a href="www.baidu.com">one</a>
</li><li class="li_test"><a href="www.csdn.net" tag="tag">two</a></li>
<li class="li li_test"><a href="www.163.com">three</a></li>
<li class="li li_test" tag="tag"><a href="www.163.com">four</a></li>
</ul></div>
</body></html>
['two']
['li_test']
['two']
['two', 'three', 'four']
['four']
['one'] ['four'] ['one', 'two'] ['three']
********************
[<Element html at 0x1d4f8dba688>, <Element body at 0x1d4f8dba6c8>, <Element div at 0x1d4f8dba5c8>, <Element ul at 0x1d4f8dba748>]
[<Element ul at 0x1d4f8dba748>]
['first_li']
[<Element a at 0x1d4f8dba608>]
[<Element a at 0x1d4f8dba608>]
[<Element li at 0x1d4f8dba808>, <Element a at 0x1d4f8dba848>, <Element li at 0x1d4f8dba888>, <Element a at 0x1d4f8dba8c8>, <Element li at 0x1d4f8dba908>, <Element a at 0x1d4f8dba988>]
[<Element li at 0x1d4f8dba808>, <Element li at 0x1d4f8dba888>, <Element li at 0x1d4f8dba908>]
"""
微信赞赏支付宝赞赏