五月天性,日本中出视频,久久精品国产99国产精品小说

一、Beautiful Soup的介紹

Beautiful Soup是一個強大的解析工具，它借助網頁結構和屬性等特性來解析網頁。

它提供一些函數來處理導航、搜索、修改分析樹等功能，Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時實際上需要依賴解析器，常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html測試實例：

				?

									<!DOCTYPE html>

									<html>

									<head>

									    <meta content="text/html;charset=utf-8" http-equiv="content-type" />

									    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />

									    <meta content="always" name="referrer" />

									    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />

									    <title>百度一下，你就知道 </title>

									</head>

									<body link="#0000cc">

									  <div id="wrapper">

									    <div id="head">

									        <div class="head_wrapper">

									          <div id="u1">

									            <a class="mnav" href="http://news.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新聞 </a>

									            <a class="mnav" href="https://www.hao123.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>

									            <a class="mnav" href="http://map.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地圖 </a>

									            <a class="mnav" href="http://v.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">視頻 </a>

									            <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">貼吧 </a>

									            <a class="bri" href="//www.baidu.com/more/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多產品 </a>

									          </div>

									        </div>

									    </div>

									  </div>

									</body>

									</html>

1、節點選擇器

我們之前了解到，一個網頁是由若干個元素節點組成的，通過提取某個節點的具體內容，就可以獲取到界面呈現的一些數據。使用節點選擇器能夠簡化我們獲取數據的過程，在不使用正則表達式的前提下，精準的獲取數據。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.head)

									print(soup.head.title)

									print(soup.a)

【運行結果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>
<title>百度一下，你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>

分析：

第一條打印數據為獲取網頁的head節點；

第二條打印內容是獲取head節點中title節點，獲取該節點使用了一個嵌套選擇，因為title節點是嵌套在head節點里面的；

第三條打印內容是獲取a節點，在源碼中我們看到有許多條a節點，而只匹配到第一個a節點就結束了。當有多個節點時，這種選擇方式指只會選擇第一個匹配的節點，其他后面節點會忽略。

2、提取信息

一般我們需要的數據位于節點名、屬性值、文本值中，以下代碼展示了如何獲取這三個地方的數據：

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.body.name)

									print(soup.body.a.attrs['class'])

									print(soup.body.a.attrs['href'])

									print(soup.body.a.string)

【運行結果】

body
['mnav']
http://news.baidu.com
新聞

分析：

第一條獲取body節點名；

第二條獲取a節點class屬性值；

第三條獲取a節點href屬性值；

第四條獲取a節點的文本值；

3、關聯選擇

（1）子節點和子孫節點

子節點可以調用contents屬性和children屬性，子孫節點可以調用descendants屬性，他們返回結果都是生成器類型，通過for循環輸出匹配到的信息。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									# print(soup.body.contents)

									for i,content in enumerate(soup.body.contents):

									    print(i,content)

【運行結果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>
</div>
</div>
</div>
</div>
2

（2）父節點和祖先節點

獲取某個節點的父節點可以調用parent屬性，例如獲取實例中title節點的父節點：

				?

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.title.parent)

【運行結果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>

同理，如果是想要獲取節點的祖先節點，則可調用parents屬性。

（3）兄弟節點

調用next_sibling獲取節點的下一個兄弟元素；

調用previous_sibling獲取節點的上一個兄弟元素；

調用next_siblings取節點的下一個兄弟節點；

調用previous_siblings獲取節點的上一個兄弟節點；

4、方法選擇器

find_all（）

查找所有符合條件的元素，其使用方法如下：

				?

									find_all(name,attrs,recursive,text,**kwargs)

（1）name

根據節點名來查詢元素，例如查詢實例中a標簽元素：

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a"))

									for a in soup.find_all(name = "a"):

									    print(a)

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>

（2）attrs

在查詢時我們還可以傳入標簽的屬性，attrs參數的數據類型是字典。

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【運行結果】

[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]

可以看到，在加上class=“bri”屬性時，查詢結果就只剩一條a標簽元素。

（3）text

text參數可以用來匹配節點的文本，傳入的可以是字符串，也可以是正則表達式對象。

				?

									import re

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.find_all(name = "a",text = re.compile('新聞')))

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]

只包含文本內容為“新聞”的a標簽。

find（）

find（）的使用與前者相似，唯一不同的是，find進匹配搜索到的第一個元素，然后返回單個元素，find_all（）則是匹配所有符合條件的元素，返回一個列表。

5、CSS選擇器

使用CSS選擇器時，調用select（）方法，傳入相應的CSS選擇器；

例如使用CSS選擇器獲取實例中的a標簽

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									print(soup.select('a'))

									for a in soup.select('a'):

									    print(a)

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>

獲取屬性

獲取上述a標簽中的href屬性

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									for a in soup.select('a'):

									    print(a['href'])

【運行結果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

獲取文本

獲取上述a標簽的文本內容，使用get_text()方法，或者是string獲取文本內容

				?

									from bs4 import BeautifulSoup

									file = open("./test03.html",'rb')

									html = file.read()

									soup = BeautifulSoup(html,'lxml')

									for a in soup.select('a'):

									    print(a.get_text())

									    print(a.string)