一区二区三区在线-一区二区三区亚洲视频-一区二区三区亚洲-一区二区三区午夜-一区二区三区四区在线视频-一区二区三区四区在线免费观看

腳本之家,腳本語言編程技術及教程分享平臺!
分類導航

Python|VBS|Ruby|Lua|perl|VBA|Golang|PowerShell|Erlang|autoit|Dos|bat|

服務器之家 - 腳本之家 - Python - python網絡爬蟲精解之Beautiful Soup的使用說明

python網絡爬蟲精解之Beautiful Soup的使用說明

2022-01-12 00:39小狐貍夢想去童話鎮 Python

簡單來說,Beautiful Soup 是 python 的一個庫,最主要的功能是從網頁抓取數據,Beautiful Soup 提供一些簡單的、python 式的函數用來處理導航、搜索、修改分析樹等功能,需要的朋友可以參考下

一、Beautiful Soup的介紹

Beautiful Soup是一個強大的解析工具,它借助網頁結構和屬性等特性來解析網頁。

它提供一些函數來處理導航、搜索、修改分析樹等功能,Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時實際上需要依賴解析器,常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html測試實例:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />
    <title>百度一下,你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新聞 </a>
            <a class="mnav" href="https://www.hao123.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>
            <a class="mnav" href="http://map.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地圖 </a>
            <a class="mnav" href="http://v.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">視頻 </a>
            <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">貼吧 </a>
            <a class="bri" href="//www.baidu.com/more/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多產品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

1、節點選擇器

我們之前了解到,一個網頁是由若干個元素節點組成的,通過提取某個節點的具體內容,就可以獲取到界面呈現的一些數據。使用節點選擇器能夠簡化我們獲取數據的過程,在不使用正則表達式的前提下,精準的獲取數據。

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)

【運行結果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>

分析:

第一條打印數據為獲取網頁的head節點;

第二條打印內容是獲取head節點中title節點,獲取該節點使用了一個嵌套選擇,因為title節點是嵌套在head節點里面的;

第三條打印內容是獲取a節點,在源碼中我們看到有許多條a節點,而只匹配到第一個a節點就結束了。當有多個節點時,這種選擇方式指只會選擇第一個匹配的節點,其他后面節點會忽略。

2、提取信息

一般我們需要的數據位于節點名、屬性值、文本值中,以下代碼展示了如何獲取這三個地方的數據:

?
1
2
3
4
5
6
7
8
9
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)

【運行結果】

body
['mnav']
http://news.baidu.com
新聞

分析:

第一條獲取body節點名;

第二條獲取a節點class屬性值;

第三條獲取a節點href屬性值;

第四條獲取a節點的文本值;

3、關聯選擇

(1)子節點和子孫節點

子節點可以調用contents屬性和children屬性,子孫節點可以調用descendants屬性,他們返回結果都是生成器類型,通過for循環輸出匹配到的信息。

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
    print(i,content)

【運行結果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>
</div>
</div>
</div>
</div>
2

(2)父節點和祖先節點

獲取某個節點的父節點可以調用parent屬性,例如獲取實例中title節點的父節點:

?
1
2
3
4
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)

【運行結果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>

同理,如果是想要獲取節點的祖先節點,則可調用parents屬性。

(3)兄弟節點

調用next_sibling獲取節點的下一個兄弟元素;

調用previous_sibling獲取節點的上一個兄弟元素;

調用next_siblings取節點的下一個兄弟節點;

調用previous_siblings獲取節點的上一個兄弟節點;

4、方法選擇器

find_all()

查找所有符合條件的元素,其使用方法如下:

?
1
find_all(name,attrs,recursive,text,**kwargs)

(1)name

根據節點名來查詢元素,例如查詢實例中a標簽元素:

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
    print(a)

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>

(2)attrs

在查詢時我們還可以傳入標簽的屬性,attrs參數的數據類型是字典。

?
1
2
3
4
5
6
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【運行結果】

[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]

可以看到,在加上class=“bri”屬性時,查詢結果就只剩一條a標簽元素。

(3)text

text參數可以用來匹配節點的文本,傳入的可以是字符串,也可以是正則表達式對象。

?
1
2
3
4
5
6
7
import re
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新聞')))

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]

只包含文本內容為“新聞”的a標簽。

find()

find()的使用與前者相似,唯一不同的是,find進匹配搜索到的第一個元素,然后返回單個元素,find_all()則是匹配所有符合條件的元素,返回一個列表。

5、CSS選擇器

使用CSS選擇器時,調用select()方法,傳入相應的CSS選擇器;

例如使用CSS選擇器獲取實例中的a標簽

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
    print(a)

【運行結果】

[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產品 </a>

獲取屬性

獲取上述a標簽中的href屬性

?
1
2
3
4
5
6
7
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a['href'])

【運行結果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

獲取文本

獲取上述a標簽的文本內容,使用get_text()方法,或者是string獲取文本內容

?
1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
 
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a.get_text())
    print(a.string)

【運行結果】

新聞
新聞
hao123
hao123
地圖
地圖
視頻
視頻
貼吧
貼吧
更多產品
更多產品

到此這篇關于python網絡爬蟲精解之Beautiful Soup的使用說明的文章就介紹到這了,更多相關python Beautiful Soup 內容請搜索服務器之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持服務器之家!

原文鏈接:https://blog.csdn.net/gets_s/article/details/120372061

延伸 · 閱讀

精彩推薦
主站蜘蛛池模板: 91短视频在线观看2019 | ass日本乱妇ass| 国产自在线拍 | 欧美日韩精品一区二区三区视频播放 | 亚洲精品丝袜在线一区波多野结衣 | aaa黄色| 视频免费 | 香蕉久草在线 | 国产亚洲精品第一综合另类 | 成人免费播放 | 99久久免费国内精品 | 石原莉奈被店长侵犯免费 | 91视频免费网站 | 欧美透逼视频 | 亚洲精品一区二区观看 | 欧美日韩精品免费一区二区三区 | 好男人好资源在线观看免费 | 吃胸膜奶视频456 | 精品一成人岛国片在线观看 | 亚洲视频在线观看不卡 | 嫩草香味| 国产真实乱子伦xxxxchina | 日本在线色| 三级理论在线观看 | 国产成人亚洲精品一区二区在线看 | xxxxxx国产精品视频 | 北岛玲在线视频 | 国产在线激情视频 | 色花堂中文字幕98堂网址 | 草草在线视频 | 国产在线视频福利 | 9久re热视频这里只有精品 | 嫩草精品| 国产一区在线 | 亚洲2023无矿砖码砖区 | 亚洲成人第一页 | 色小孩导航 | 欧美一级片在线免费观看 | 亚州中文字幕 | lubuntu网页版在线 | 免费午夜影片在线观看影院 |