一、爬虫简介

一段抓取互联网信息的自动化的程序，从互联网上抓取对于我们有价值的信息，理论上来说，任何支持网络通信的语言都是可以写爬虫的，爬虫本身虽然语言关系不大，但是，总有相对顺手、简单的。目前来说，大多数爬虫是用后台脚本类语言写的，其中python无疑是用的最多最广的！

二、爬虫基本操作方法

-Requests块的安装与使用

Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。

安装Requests模块

pip3 install requests

1、GET请求

# 1、无参数

import requests
 
R = requests.get('https://mp.csdn.net/')
 
print R.url
print R.text
 
 
 
# 2、有参数
 
import requests
 
payload = {'key1': 'value1', 'key2': 'value2'}
R = requests.get("https://mp.csdn.net/", params=payload)
 
print R.url
print R.text

向http://mp.csdn.net/发送一个GET请求，将请求和响应封装在R对象里面

2、POST请求

# 1、基本POST实例
 
import requests
 
payload = {'key1': 'value1', 'key2': 'value2'}
R = requests.post("http://www.qwerty.com/", data=payload)
 
print R.text

# 2、发送请求头和数据实例
 
import requests
import json
 
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
 
R = requests.post('https://api.github.com/', data=json.dumps(payload), headers=headers)
 
print R.text
print R.cookies

3、Requests模块的其他方法

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
 
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

requests模块已经将常用的Http请求方法为用户封装完成，用户直接调用其提供的相应方法即可，其中方法的所有参数有：

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

4、直接使用Request

request.request(method, url, **kwargs)方法

method: 提交方式
url: 提交地址
params: 在URL中传递的参数 —GET
data: 在请求体里传递的数据
json: 在请求体里传递的数据
headers: 请求头
cookies: Cookies
files: 上传文件
auth: 基本认知(headers中加入加密的用户名和密码)
timeout: 请求和响应的超时时间
allow_redirects: 是否允许重定向
proxies: 代理
verify: 是否忽略证书（布尔值）
cert: 证书文件
stream: 大文件分段传输

requests.request(
        method='GET',
        url= 'http://www.baidu.com',
        params = {'k1':'v1','k2':'v2'}，
        data = {'use':'alex','pwd': '123','x':[11,2,3]},
        json = {'use':'alex','pwd': '123'},
        headers={
                'Referer': 'http://dig.chouti.com/',
                'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
        }

)

5、session对象

在进行接口测试的时候，我们会调用多个接口发出多个请求，在这些请求中有时候需要保持一些共用的数据，例如cookies信息。

requests库的session对象能够帮我们跨请求保持某些参数，也会在同一个session实例发出的所有请求之间保持cookies。

# 创建一个session对象 
s = requests.Session() 
# 用session对象发出get请求，设置cookies 
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789') 
# 用session对象发出另外一个get请求，获取cookies 
r = s.get("http://httpbin.org/cookies") 

'''
# 显示结果 
r.text 
 '{"cookies": {"sessioncookie": "123456789"}}'
'''

requests库的session对象还能为我们提供请求方法的缺省数据，通过设置session对象的属性来实现。

# 创建一个session对象 
s = requests.Session() 
# 设置session对象的auth属性，用来作为请求的默认参数 
s.auth = ('user', 'pass') 
# 设置session的headers属性，通过update方法，将其余请求方法中的headers属性合并起来作为最终的请求方法的headers 
s.headers.update({'x-test': 'true'}) 
# 发送请求，这里没有设置auth会默认使用session对象的auth属性，这里的headers属性会与session对象的headers属性合并 
r = s.get('http://httpbin.org/headers', headers={'x-test2': 'true'}) 
# 查看发送请求的请求头 
r.request.headers

'''
得到的请求头部是这样的：
{'Authorization': 'Basic dXNlcjpwYXNz', 'x-test': 'false'}
'''

方法层的参数覆盖会话的参数

将上面的请求中加上auth参数：

r = s.get('http://httpbin.org/headers', auth=('user','hah'), headers={'x-test2': 'true'})


'''

获取该请求的请求头
{'Authorization': 'Basic dXNlcjpoYWg=', 'x-test': 'false'}

'''

在request请求中，省略session对象中设置的属性，只需简单地在方法层参数中将那个键的值设置为 None ，那个键就会被自动省略掉。

赞赏

微信赞赏支付宝赞赏

超级详细的Python爬虫介绍(Requests请求)

目录

一、爬虫简介

二、爬虫基本操作方法

1、GET请求

2、POST请求

3、Requests模块的其他方法

4、直接使用Request

5、session对象

常见问题FAQ

郭然钻石

目录

一、爬虫简介

二、爬虫基本操作方法

1、GET请求

2、POST请求

3、Requests模块的其他方法

4、直接使用Request

5、session对象

常见问题FAQ

郭然 钻石

相关推荐

提供最优质的资源集合

郭然钻石