獲取首頁元素信息:
目標 test_URL:http://www.xxx.com.cn/
首先檢查元素,a 標簽下是我們需要爬取得鏈接,通過獲取鏈接路徑,定位出我們需要的信息
1
2
|
soup = Bs4(reaponse.text, "lxml" ) urls_li = soup.select( "#mainmenu_top > div > div > ul > li" ) |
首頁的URL鏈接獲取:
完成首頁的URL鏈接獲取,具體代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
''' 遇到不懂的問題?Python學習交流群:821460695滿足你的需求,資料都已經上傳群文件,可以自行下載! ''' def get_first_url(): list_href = [] reaponse = requests.get( "http://www.xxx.com.cn" , headers = headers) soup = Bs4(reaponse.text, "lxml" ) urls_li = soup.select( "#mainmenu_top > div > div > ul > li" ) for url_li in urls_li: urls = url_li.select( "a" ) for url in urls: url_href = url.get( "href" ) list_href.append(head_url + url_href) out_url = list ( set (list_href)) for reg in out_url: print (reg) |
遍歷第一次返回的結果:
從第二步獲取URL的基礎上,遍歷請求每個頁面,獲取頁面中的URL鏈接,過濾掉不需要的信息
具體代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
def get_next_url(urllist): url_list = [] for url in urllist: response = requests.get(url,headers = headers) soup = Bs4(response.text, "lxml" ) urls = soup.find_all( "a" ) if urls: for url2 in urls: url2_1 = url2.get( "href" ) if url2_1: if url2_1[ 0 ] = = "/" : url2_1 = head_url + url2_1 url_list.append(url2_1) if url2_1[ 0 : 24 ] = = "http://www.xxx.com.cn" : url2_1 = url2_1 url_list.append(url2_1) else : pass else : pass else : pass else : pass url_list2 = set (url_list) for url_ in url_list2: res = requests.get(url_) if res.status_code = = 200 : print (url_) print ( len (url_list2)) |
遞歸循環遍歷:
遞歸實現爬取所有url,在get_next_url()函數中調用自身,代碼如下:
1
|
get_next_url(url_list2) |
全部代碼如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
import requests from bs4 import BeautifulSoup as Bs4 head_url = "http://www.xxx.com.cn" headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36" } def get_first_url(): list_href = [] reaponse = requests.get(head_url, headers = headers) soup = Bs4(reaponse.text, "lxml" ) urls_li = soup.select( "#mainmenu_top > div > div > ul > li" ) for url_li in urls_li: urls = url_li.select( "a" ) for url in urls: url_href = url.get( "href" ) list_href.append(head_url + url_href) out_url = list ( set (list_href)) return out_url def get_next_url(urllist): url_list = [] for url in urllist: response = requests.get(url,headers = headers) soup = Bs4(response.text, "lxml" ) urls = soup.find_all( "a" ) if urls: for url2 in urls: url2_1 = url2.get( "href" ) if url2_1: if url2_1[ 0 ] = = "/" : url2_1 = head_url + url2_1 url_list.append(url2_1) if url2_1[ 0 : 24 ] = = "http://www.xxx.com.cn" : url2_1 = url2_1 url_list.append(url2_1) else : pass else : pass else : pass else : pass url_list2 = set (url_list) for url_ in url_list2: res = requests.get(url_) if res.status_code = = 200 : print (url_) print ( len (url_list2)) get_next_url(url_list2) if __name__ = = "__main__" : urllist = get_first_url() get_next_url(urllist) |
以上這篇Python3 實現爬取網站下所有URL方式就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支持服務器之家。
原文鏈接:https://blog.csdn.net/fei347795790/article/details/99471972