91情侣在线视频,欧美一二三区,麻豆伦理视频

python爬蟲高效爬取某趣閣小說
這次的代碼是根據我之前的筆趣閣爬取的基礎上修改的，因為使用的是自己的ip，所以在請求每個章節的時候需要設置sleep（4~5）才不會被封ip，那么在計算保存的時間，每個章節會花費6-7秒，如果爬取一部較長的小說時，時間會特別的長，所以這次我使用了代理ip。這樣就可以不需要設置睡眠時間，直接大量訪問。

一，獲取免費ip

關于免費ip，我選擇的是站大爺。因為免費ip的壽命很短，所以盡量要使用實時的ip，這里我專門使用getip.py來獲取免費ip，代碼會爬取最新的三十個ip，并以字典的形式返回兩種，如{'http‘：'ip‘}，{'https‘：'ip‘}

python爬蟲之爬取筆趣閣小說升級版

?。。。。?！這里是另寫了一個py文件，后續正式寫爬蟲的時候會調用。

				?

									import requests

									from lxml import etree

									from time import sleep

									def getip():

									    base_url = 'https://www.zdaye.com'

									    url = 'https://www.zdaye.com/dayproxy.html'

									    headers = {

									        "user-agent": "mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/93.0.4577.63 safari/537.36"

									    }

									    res = requests.get(url, headers=headers)

									    res.encoding = "utf-8"

									    dom = etree.html(res.text)

									    sub_urls = dom.xpath('//h3[@class ="thread_title"]/a/@href')

									    sub_pages =[]

									    for sub_url in sub_urls:

									        for i in range(1, 11):

									            sub_page = (base_url + sub_url).rstrip('.html') + '/' + str(i) + '.html'

									            sub_pages.append(sub_page)

									    http_list = []

									    https_list = []

									    for sub in sub_pages[:3]:

									        sub_res = requests.get(sub, headers=headers)

									        sub_res.encoding = 'utf-8'

									        sub_dom = etree.html(sub_res.text)

									        ips = sub_dom.xpath('//tbody/tr/td[1]/text()')

									        ports = sub_dom.xpath('//tbody/tr/td[2]/text()')

									        types = sub_dom.xpath('//tbody/tr/td[4]/text()')

									        sleep(3)

									        sub_res.close()

									        for ip,port,type in zip(ips, ports,types):

									            proxies_http = {}

									            proxies_https= {}

									            http = 'http://' + ip + ':' + port

									            https = 'https://' + ip + ':' + port

									            #分別存儲http和https兩種

									            proxies_http['http'] = http

									            http_list.append(proxies_http)

									            proxies_https['https'] = https

									            https_list.append(proxies_https)

									    return  http_list,https_list

									if __name__ == '__main__':

									    http_list,https_list = getip()

									    print(http_list)

									    print(https_list)

二，具體實現

完整代碼放在最后后面了，這里的 from getip import getip 就是前面獲取ip部分。
這里我收集數十個常用的請求頭，將它們與三十個ip隨機組合，共可以得到300個左右的組合。

這里我定義了三個函數用于實現功能。
biquge_get()函數：輸入搜索頁面的url，關于搜索的實現是修改url中的kw，在main函數中有體現。
--------------------------返回書籍首頁的url和書名。

get_list()函數：輸入biquge_get返回的url。
---------------------返回每個章節的url集合。

info_get()函數：輸入url，ip池，請求頭集，書名。
---------------------將每次的信息保存到本地。

info_get()函數中我定義四個變量a,b,c,d用于判斷每個章節是否有信息返回，在代碼中有寫足夠清晰的注釋。
這里我講一下我的思路，在for循環中，我循環的是章節長度的十倍。a，b，c的初始值都是0。
通過索引，url=li_list[a]可以請求每個章節內容，a的自增實現跳到下一個url。但是在大量的請求中也會有無法訪問的情況，所以在返回的信息 ' text1 ‘ 為空的情況a-=1，那么在下一次循環是依舊會訪問上次沒有結果的url。

python爬蟲之爬取筆趣閣小說升級版

這里我遇到了一個坑，我在測試爬取的時候會打印a的值用于觀察，出現它一直打印同一個章節數‘340'直到循環結束的情況，此時我以為是無法訪問了。后來我找到網頁對照，發現這個章節本來就沒有內容，是空的，所以程序會一直卡在這里。所以我設置了另外兩個變量b，c。

1，使用變量b來存放未變化的a，若下次循環b與a相等，說明此次請求沒有成功，c++，因為某些頁面本身存在錯誤沒有數據，則需要跳過。
2，若c大于10，說明超過十次的請求,都因為一些緣由失敗了，則a++，跳過這一章節，同時變量d減一，避免后續跳出循環時出現索引錯誤

python爬蟲之爬取筆趣閣小說升級版

最后是變量d，d的初始值設置為章節長度，d = len(li_list)，a增加到與d相同時說明此時li_list的所有url都使用完了，那么就需要跳出循環。
然后就是將取出的數據保存了。

python爬蟲之爬取筆趣閣小說升級版

最后測試，一共1676章，初始速度大概一秒能下載兩章內容左右。

python爬蟲之爬取筆趣閣小說升級版

爬取完成，共計用了10分鐘左右。

python爬蟲之爬取筆趣閣小說升級版

				?

									import requests

									from lxml import etree

									from getip import getip

									import random

									import time

									headers= {

									        "user-agent":"mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, like gecko) chrome/93.0.4577.63 safari/537.36"

									    }

									'''

									kw輸入完成搜索,打印所有的搜索結果

									返回選擇的書籍的url

									'''

									def biquge_get(url):

									    book_info = []

									    r = requests.get(url =url,

									                     headers = headers,

									                     timeout = 20

									                     )

									    r.encoding = r.apparent_encoding

									    html = etree.html(r.text)

									    # 獲取搜索結果的書名

									    bookname = html.xpath('//td[@class = "odd"]/a/text()')

									    bookauthor = html.xpath('//td[@class = "odd"]/text()')

									    bookurl = html.xpath('//td[@class = "odd"]/a/@href')

									    print('搜索結果如下:\n')

									    a = 1

									    b = 1

									    for i in bookname:

									        print(str(a) + ':', i, '\t作者：', bookauthor[int(b - 1)])

									        book_info.append([str(a),i,bookurl[a-1]])

									        a = a + 1

									        b = b + 2

									    c = input('請選擇你要下載的小說(輸入對應書籍的編號):')

									    book_name = str(bookname[int(c) - 1])

									    print(book_name, '開始檢索章節')

									    url2 = html.xpath('//td[@class = "odd"]/a/@href')[int(c) - 1]

									    r.close()

									    return url2,book_name

									'''

									輸入書籍的url，返回每一章節的url

									'''

									def get_list(url):

									    r = requests.get(url = url,

									                     headers = headers,

									                     timeout = 20)

									    r.encoding = r.apparent_encoding

									    html = etree.html(r.text)

									    # 解析章節

									    li_list = html.xpath('//*[@id="list"]/dl//a/@href')[9:]

									    return li_list

									#請求頭集

									user_agent = [

									       "mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)",

									       "mozilla/4.0 (compatible; msie 6.0; windows nt 5.1; sv1; acoobrowser; .net clr 1.1.4322; .net clr 2.0.50727)",

									       "mozilla/4.0 (compatible; msie 7.0; windows nt 6.0; acoo browser; slcc1; .net clr 2.0.50727; media center pc 5.0; .net clr 3.0.04506)",

									       "mozilla/4.0 (compatible; msie 7.0; aol 9.5; aolbuild 4337.35; windows nt 5.1; .net clr 1.1.4322; .net clr 2.0.50727)",

									       "mozilla/5.0 (windows; u; msie 9.0; windows nt 9.0; en-us)",

									       "mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; win64; x64; trident/5.0; .net clr 3.5.30729; .net clr 3.0.30729; .net clr 2.0.50727; media center pc 6.0)",

									       "mozilla/5.0 (compatible; msie 8.0; windows nt 6.0; trident/4.0; wow64; trident/4.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; .net clr 1.0.3705; .net clr 1.1.4322)",

									       "mozilla/4.0 (compatible; msie 7.0b; windows nt 5.2; .net clr 1.1.4322; .net clr 2.0.50727; infopath.2; .net clr 3.0.04506.30)",

									       "mozilla/5.0 (windows; u; windows nt 5.1; zh-cn) applewebkit/523.15 (khtml, like gecko, safari/419.3) arora/0.3 (change: 287 c9dfb30)",

									       "mozilla/5.0 (x11; u; linux; en-us) applewebkit/527+ (khtml, like gecko, safari/419.3) arora/0.6",

									       "mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1",

									       "mozilla/5.0 (windows; u; windows nt 5.1; zh-cn; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0",

									       "mozilla/5.0 (x11; linux i686; u;) gecko/20070322 kazehakase/0.4.5",

									       "mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.8) gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",

									       "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/535.11 (khtml, like gecko) chrome/17.0.963.56 safari/535.11",

									       "mozilla/5.0 (macintosh; intel mac os x 10_7_3) applewebkit/535.20 (khtml, like gecko) chrome/19.0.1036.7 safari/535.20",

									       "opera/9.80 (macintosh; intel mac os x 10.6.8; u; fr) presto/2.9.168 version/11.52"]

									'''

									參數：url，ip池，請求頭集，書名

									'''

									def info_get(li_list,ip_list,headers,book_name):

									    print('共計'+str(len(li_list))+'章')

									    '''

									    a,用于計數，成功請求到html并完成后續的存寫數據才會繼續請求下一個url

									    b,在循環中存放未經過信息返回存儲判斷的a，用于與下一次循環的a作比較，判斷a是否有變化

									    c,若超過10次b=a，c會自增，則說明應該跳過此章節，同時d減一

									    d,章節長度

									    '''

									    a = 0

									    b = 0

									    c = 0

									    d = len(li_list)

									    fp = open('./'+str(book_name)+'.txt', 'w', encoding='utf-8')

									    #這里循環了10倍次數的章節，防止無法爬取完所有的信息。

									    for i in range(10*len(li_list)):

									        url = li_list[a]

									        #判斷使用http還是https

									        if url[4:5] == "s":

									            proxies = random.choice(ip_list[0])

									        else:

									            proxies = random.choice(ip_list[1])

									        try:

									            r = requests.get(url=url,

									                             headers={'user-agent': random.choice(headers)},

									                             proxies=proxies,

									                             timeout=5

									                            )

									            r.encoding = r.apparent_encoding

									            r_text = r.text

									            html = etree.html(r_text)

									            try:

									                title = html.xpath('/html/body/div/div/div/div/h1/text()')[0]

									            except:

									                title = html.xpath('/html/body/div/div/div/div/h1/text()')

									            text = html.xpath('//*[@id="content"]/p/text()')

									            text1 = []

									            for i in text:

									                text1.append(i[2:])

									            '''

									            使用變量b來存放未變化的a，若下次循環b與a相等，說明此次請求沒有成功，c++，因為某些頁面本身存在錯誤沒有數據，則需要跳過。

									            若c大于10，說明超過十次的請求,都因為一些緣由失敗了，則a++，跳過這一章節，同時變量d減一，避免后續跳出循環時出現索引錯誤

									            '''

									            if b == a:

									                c += 1

									            if c > 10:

									                a += 1

									                c = 0

									                d -=1

									            b = a

									            #a+1，跳到下一個url，若沒有取出信息則a-1.再次請求,若有數據返回則保存

									            a+=1

									            if len(text1) ==0:

									                a-=1

									            else:

									                fp.write('第'+str(a+1)+'章'+str(title) + ':\n' +'\t'+str(','.join(text1) + '\n\n'))

									                print('《'+str(title)+'》','下載成功！')

									            r.close()

									        except environmenterror as e:

									            pass

									        # a是作為索引在li_list中取出對應的url，所以最后a的值等于li_list長度-1，并以此為判斷標準是否跳出循環。

									        if a == d:

									            break

									    fp.close()

									if __name__ == '__main__':

									    kw = input('請輸入你要搜索的小說:')

									    url = f'http://www.b520.cc/modules/article/search.php?searchkey={kw}'

									    bookurl,book_name = biquge_get(url)

									    li_list = get_list(bookurl)

									    ip_list = getip()

									    t1 = time.time()

									    info_get(li_list,ip_list,user_agent,book_name)

									    t2 = time.time()

									    print('耗時'+str((t2-t1)/60)+'min')