接著第一篇繼續(xù)學(xué)習(xí)。
一、數(shù)據(jù)分類
正確數(shù)據(jù):id、性別、活動時間三者都有
放在這個文件里file1 = 'ruisi\\correct%s-%s.txt' % (startNum, endNum)
數(shù)據(jù)格式為293001 男 2015-5-1 19:17
- 沒有時間:有id、有性別,無活動時間
放這個文件里file2 = 'ruisi\\errTime%s-%s.txt' % (startNum, endNum)
數(shù)據(jù)格式為2566 女 notime
- 用戶不存在:該id沒有對應(yīng)的用戶
放這個文件里file3 = 'ruisi\\notexist%s-%s.txt' % (startNum, endNum)
數(shù)據(jù)格式為29005 notexist
- 未知性別:有id,但是性別從網(wǎng)頁上無法得知(經(jīng)檢查,這種情況也沒有活動時間)
放這個文件里 file4 = 'ruisi\\unkownsex%s-%s.txt' % (startNum, endNum)
數(shù)據(jù)格式 221794 unkownsex
- 網(wǎng)絡(luò)錯誤:網(wǎng)斷了,或者服務(wù)器故障,需要對這些id重新檢查
放這個文件里 file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)
數(shù)據(jù)格式 271004 httperror
如何不間斷得爬蟲信息
- 本項目有一個考慮:是不間斷爬取信息,如果因為斷網(wǎng)、BBS服務(wù)器故障啥的,我的爬蟲程序就退出的話。那我們還得從間斷的地方繼續(xù)爬,或者更麻煩的是從頭開始爬。
- 所以,我采取的方法是,如果遇到故障,就把這些異常的id記錄下來。等一次遍歷之后,才對這些異常的id進行重新爬取性別。
- 本文系列(一)給出了一個 getInfo(myurl, seWord),通過給定鏈接和給定正則表達式爬取信息。
- 這個函數(shù)可以用來查看性別的最后活動時間信息。
- 我們再定義一個安全的爬取函數(shù),不會間斷程序運行的,這兒用到try except異常處理。
這兒代碼試了兩次getInfo(myurl, seWord),如果第2次還是拋出異常了,就把這個id保存在file5里面
如果能獲取到信息,就返回信息
1
2
3
4
5
6
7
8
9
10
11
12
13
|
file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum) def safeGet(myid, myurl, seWord): try : return getInfo(myurl, seWord) except : try : return getInfo(myurl, seWord) except : httperrorfile = open (file5, 'a' ) info = '%d %s\n' % (myid, 'httperror' ) httperrorfile.write(info) httperrorfile.close() return 'httperror' |
依次遍歷,獲取id從[1,300,000]的用戶信息
我們定義一個函數(shù),這兒的思路是獲取sex和time,如果有sex,進而繼續(xù)判斷是否有time;如果沒sex,判斷是否這個用戶不存在還是性別無法爬取。
其中要考慮到斷網(wǎng)或者BBS服務(wù)器故障的情況。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s' url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile' def searchWeb(idArr): for id in idArr: sexUrl = url1 % ( id ) #將%s替換為id timeUrl = url2 % ( id ) sex = safeGet( id ,sexUrl, sexRe) if not sex: #如果sexUrl里面找不到性別,在timeUrl再嘗試找一下 sex = safeGet( id ,timeUrl, sexRe) time = safeGet( id ,timeUrl, timeRe) #如果出現(xiàn)了httperror,需要重新爬取 if (sex is 'httperror' ) or (time is 'httperror' ) : pass else : if sex: info = '%d %s' % ( id , sex) if time: info = '%s %s\n' % (info, time) wfile = open (file1, 'a' ) wfile.write(info) wfile.close() else : info = '%s %s\n' % (info, 'notime' ) errtimefile = open (file2, 'a' ) errtimefile.write(info) errtimefile.close() else : #這兒是性別是None,然后確定一下是不是用戶不存在 #斷網(wǎng)的時候加上這個,會導(dǎo)致4個重復(fù)httperror #可能用戶的性別我們無法知道,他沒有填寫 notexist = safeGet( id ,sexUrl, notexistRe) if notexist is 'httperror' : pass else : if notexist: notexistfile = open (file3, 'a' ) info = '%d %s\n' % ( id , 'notexist' ) notexistfile.write(info) notexistfile.close() else : unkownsexfile = open (file4, 'a' ) info = '%d %s\n' % ( id , 'unkownsex' ) unkownsexfile.write(info) unkownsexfile.close() |
這兒后期檢查發(fā)現(xiàn)了一個問題
1
2
3
4
|
sex = safeGet( id ,sexUrl, sexRe) if not sex: sex = safeGet( id ,timeUrl, sexRe) time = safeGet( id ,timeUrl, timeRe) |
這個代碼如果斷網(wǎng)的時候,調(diào)用了3次safeGet,每次調(diào)用都會往文本里面同一個id寫多次httperror
1
2
3
4
|
251538 httperror 251538 httperror 251538 httperror 251538 httperror |
多線程爬取信息?
數(shù)據(jù)統(tǒng)計可以用多線程,因為是獨立的多個文本
1、Popen介紹
使用Popen可以自定義標(biāo)準(zhǔn)輸入、標(biāo)準(zhǔn)輸出和標(biāo)準(zhǔn)錯誤輸出。我在SAP實習(xí)的時候,項目組在linux平臺下經(jīng)常使用Popen,可能是因為可以方便重定向輸出。
下面這段代碼借鑒了以前項目組的實現(xiàn)方法,Popen可以調(diào)用系統(tǒng)cmd命令。下面3個communicate()連在一起表示要等這3個線程都結(jié)束。
疑惑?
試驗了一下,必須3個communicate()緊挨著才能保證3個線程同時開啟,最后等待3個線程都結(jié)束。
1
2
3
4
5
6
7
8
9
|
p1 = Popen([ 'python' , 'ruisi.py' , str (s0), str (s1)],bufsize = 10000 , stdout = subprocess.PIPE) p2 = Popen([ 'python' , 'ruisi.py' , str (s1), str (s2)],bufsize = 10000 , stdout = subprocess.PIPE) p3 = Popen([ 'python' , 'ruisi.py' , str (s2), str (s3)],bufsize = 10000 , stdout = subprocess.PIPE) p1.communicate() p2.communicate() p3.communicate() |
2、定義一個單線程的爬蟲
用法:python ruisi.py <startNum> <endNum>
這段代碼就是爬取[startNum, endNum)信息,輸出到相應(yīng)的文本里。它是一個單線程的程序,若要實現(xiàn)多線程的話,在外部調(diào)用它的地方實現(xiàn)多線程。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
|
# ruisi.py # coding=utf-8 import urllib2, re, sys, threading, time,thread # myurl as 指定鏈接 # seWord as 正則表達式,用unicode表示 # 返回根據(jù)正則表達式匹配的信息或者None def getInfo(myurl, seWord): headers = { 'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } req = urllib2.Request( url = myurl, headers = headers ) time.sleep( 0.3 ) response = urllib2.urlopen(req) html = response.read() html = unicode (html, 'utf-8' ) timeMatch = seWord.search(html) if timeMatch: s = timeMatch.groups() return s[ 0 ] else : return None #嘗試兩次getInfo() #第2次失敗后,就把這個id標(biāo)記為httperror def safeGet(myid, myurl, seWord): try : return getInfo(myurl, seWord) except : try : return getInfo(myurl, seWord) except : httperrorfile = open (file5, 'a' ) info = '%d %s\n' % (myid, 'httperror' ) httperrorfile.write(info) httperrorfile.close() return 'httperror' #輸出一個 idArr 范圍,比如[1,1001) def searchWeb(idArr): for id in idArr: sexUrl = url1 % ( id ) timeUrl = url2 % ( id ) sex = safeGet( id ,sexUrl, sexRe) if not sex: sex = safeGet( id ,timeUrl, sexRe) time = safeGet( id ,timeUrl, timeRe) if (sex is 'httperror' ) or (time is 'httperror' ) : pass else : if sex: info = '%d %s' % ( id , sex) if time: info = '%s %s\n' % (info, time) wfile = open (file1, 'a' ) wfile.write(info) wfile.close() else : info = '%s %s\n' % (info, 'notime' ) errtimefile = open (file2, 'a' ) errtimefile.write(info) errtimefile.close() else : notexist = safeGet( id ,sexUrl, notexistRe) if notexist is 'httperror' : pass else : if notexist: notexistfile = open (file3, 'a' ) info = '%d %s\n' % ( id , 'notexist' ) notexistfile.write(info) notexistfile.close() else : unkownsexfile = open (file4, 'a' ) info = '%d %s\n' % ( id , 'unkownsex' ) unkownsexfile.write(info) unkownsexfile.close() def main(): reload (sys) sys.setdefaultencoding( 'utf-8' ) if len (sys.argv) ! = 3 : print 'usage: python ruisi.py <startNum> <endNum>' sys.exit( - 1 ) global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5 startNum = int (sys.argv[ 1 ]) endNum = int (sys.argv[ 2 ]) sexRe = re. compile (u 'em>\u6027\u522b</em>(.*?)</li' ) timeRe = re. compile (u 'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li' ) notexistRe = re. compile (u '(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<' ) url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s' url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile' file1 = '..\\newRuisi\\correct%s-%s.txt' % (startNum, endNum) file2 = '..\\newRuisi\\errTime%s-%s.txt' % (startNum, endNum) file3 = '..\\newRuisi\\notexist%s-%s.txt' % (startNum, endNum) file4 = '..\\newRuisi\\unkownsex%s-%s.txt' % (startNum, endNum) file5 = '..\\newRuisi\\httperror%s-%s.txt' % (startNum, endNum) searchWeb( xrange (startNum,endNum)) # numThread = 10 # searchWeb(xrange(endNum)) # total = 0 # for i in xrange(numThread): # data = xrange(1+i,endNum,numThread) # total =+ len(data) # t=threading.Thread(target=searchWeb,args=(data,)) # t.start() # print total main() |
多線程爬蟲
代碼
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
# coding=utf-8 from subprocess import Popen import subprocess import threading,time startn = 1 endn = 300001 step = 1000 total = (endn - startn + 1 ) / step ISOTIMEFORMAT = '%Y-%m-%d %X' #hardcode 3 threads #沒有深究3個線程好還是4或者更多個線程好 #輸出格式化的年月日時分秒 #輸出程序的耗時(以秒為單位) for i in xrange ( 0 ,total, 3 ): startNumber = startn + step * i startTime = time.clock() s0 = startNumber s1 = startNumber + step s2 = startNumber + step * 2 s3 = startNumber + step * 3 p1 = Popen([ 'python' , 'ruisi.py' , str (s0), str (s1)],bufsize = 10000 , stdout = subprocess.PIPE) p2 = Popen([ 'python' , 'ruisi.py' , str (s1), str (s2)],bufsize = 10000 , stdout = subprocess.PIPE) p3 = Popen([ 'python' , 'ruisi.py' , str (s2), str (s3)],bufsize = 10000 , stdout = subprocess.PIPE) startftime = '[ ' + time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] ' print startftime + '%s - %s download start... ' % (s0, s1) print startftime + '%s - %s download start... ' % (s1, s2) print startftime + '%s - %s download start... ' % (s2, s3) p1.communicate() p2.communicate() p3.communicate() endftime = '[ ' + time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] ' print endftime + '%s - %s download end !!! ' % (s0, s1) print endftime + '%s - %s download end !!! ' % (s1, s2) print endftime + '%s - %s download end !!! ' % (s2, s3) endTime = time.clock() print "cost time " + str (endTime - startTime) + " s" time.sleep( 5 ) |
這兒是記錄時間戳的日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
"D:\Program Files\Python27\python.exe" E: / pythonProject / webCrawler / sum .py [ 2015 - 11 - 23 11 : 31 : 15 ] 1 - 1001 download start... [ 2015 - 11 - 23 11 : 31 : 15 ] 1001 - 2001 download start... [ 2015 - 11 - 23 11 : 31 : 15 ] 2001 - 3001 download start... [ 2015 - 11 - 23 11 : 53 : 44 ] 1 - 1001 download end !!! [ 2015 - 11 - 23 11 : 53 : 44 ] 1001 - 2001 download end !!! [ 2015 - 11 - 23 11 : 53 : 44 ] 2001 - 3001 download end !!! cost time 1348.99480677 s [ 2015 - 11 - 23 11 : 53 : 50 ] 3001 - 4001 download start... [ 2015 - 11 - 23 11 : 53 : 50 ] 4001 - 5001 download start... [ 2015 - 11 - 23 11 : 53 : 50 ] 5001 - 6001 download start... [ 2015 - 11 - 23 12 : 16 : 56 ] 3001 - 4001 download end !!! [ 2015 - 11 - 23 12 : 16 : 56 ] 4001 - 5001 download end !!! [ 2015 - 11 - 23 12 : 16 : 56 ] 5001 - 6001 download end !!! cost time 1386.06407734 s [ 2015 - 11 - 23 12 : 17 : 01 ] 6001 - 7001 download start... [ 2015 - 11 - 23 12 : 17 : 01 ] 7001 - 8001 download start... [ 2015 - 11 - 23 12 : 17 : 01 ] 8001 - 9001 download start... |
上面是多線程的Log記錄,從下面可以看出,1000個用戶平均需要500s,一個id需要0.5s。500*300/3600 = 41.666666666667小時,大概需要兩天的時間。
我們再試驗一次單線程爬蟲的耗時,記錄如下:
1
2
3
4
5
6
7
8
9
10
11
|
"D:\Program Files\Python27\python.exe" E: / pythonProject / webCrawler / sum .py 1 - 1001 download start... 1 - 1001 download end !!! cost time 1583.65911889 s 1001 - 2001 download start... 1001 - 2001 download end !!! cost time 1342.46874278 s 2001 - 3001 download start... 2001 - 3001 download end !!! cost time 1327.10885725 s 3001 - 4001 download start... |
我們發(fā)現(xiàn)一次線程爬取1000個用戶耗時的時間也需要1500s,而多線程程序是3*1000個用戶耗時1500s。
故多線程確實能比單線程省很多時間。
Note:
在getInfo(myurl, seWord)里有time.sleep(0.3)這樣一段代碼,是為了防止批判訪問BBS,而被BBS拒絕訪問。這個0.3s對于上文多線程和單線程的統(tǒng)計時間有影響。
最后附上原始的,沒有帶時間戳的記錄。(加上時間戳,可以知道程序什么時候開始爬蟲的,以應(yīng)對線程卡死情況。)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
|
"D:\Program Files\Python27\python.exe" E: / pythonProject / webCrawler / sum .py 1 - 1001 download start... 1001 - 2001 download start... 2001 - 3001 download start... 1 - 1001 download end !!! 1001 - 2001 download end !!! 2001 - 3001 download end !!! cost time 1532.74102812 s 3001 - 4001 download start... 4001 - 5001 download start... 5001 - 6001 download start... 3001 - 4001 download end !!! 4001 - 5001 download end !!! 5001 - 6001 download end !!! cost time 2652.01624951 s 6001 - 7001 download start... 7001 - 8001 download start... 8001 - 9001 download start... 6001 - 7001 download end !!! 7001 - 8001 download end !!! 8001 - 9001 download end !!! cost time 1880.61513696 s 9001 - 10001 download start... 10001 - 11001 download start... 11001 - 12001 download start... 9001 - 10001 download end !!! 10001 - 11001 download end !!! 11001 - 12001 download end !!! cost time 1634.40575553 s 12001 - 13001 download start... 13001 - 14001 download start... 14001 - 15001 download start... 12001 - 13001 download end !!! 13001 - 14001 download end !!! 14001 - 15001 download end !!! cost time 1403.62795496 s 15001 - 16001 download start... 16001 - 17001 download start... 17001 - 18001 download start... 15001 - 16001 download end !!! 16001 - 17001 download end !!! 17001 - 18001 download end !!! cost time 1271.42177906 s 18001 - 19001 download start... 19001 - 20001 download start... 20001 - 21001 download start... 18001 - 19001 download end !!! 19001 - 20001 download end !!! 20001 - 21001 download end !!! cost time 1476.04122024 s 21001 - 22001 download start... 22001 - 23001 download start... 23001 - 24001 download start... 21001 - 22001 download end !!! 22001 - 23001 download end !!! 23001 - 24001 download end !!! cost time 1431.37074164 s 24001 - 25001 download start... 25001 - 26001 download start... 26001 - 27001 download start... 24001 - 25001 download end !!! 25001 - 26001 download end !!! 26001 - 27001 download end !!! cost time 1411.45186874 s 27001 - 28001 download start... 28001 - 29001 download start... 29001 - 30001 download start... 27001 - 28001 download end !!! 28001 - 29001 download end !!! 29001 - 30001 download end !!! cost time 1396.88837788 s 30001 - 31001 download start... 31001 - 32001 download start... 32001 - 33001 download start... 30001 - 31001 download end !!! 31001 - 32001 download end !!! 32001 - 33001 download end !!! cost time 1389.01316718 s 33001 - 34001 download start... 34001 - 35001 download start... 35001 - 36001 download start... 33001 - 34001 download end !!! 34001 - 35001 download end !!! 35001 - 36001 download end !!! cost time 1318.16040825 s 36001 - 37001 download start... 37001 - 38001 download start... 38001 - 39001 download start... 36001 - 37001 download end !!! 37001 - 38001 download end !!! 38001 - 39001 download end !!! cost time 1362.59222822 s 39001 - 40001 download start... 40001 - 41001 download start... 41001 - 42001 download start... 39001 - 40001 download end !!! 40001 - 41001 download end !!! 41001 - 42001 download end !!! cost time 1253.62498539 s 42001 - 43001 download start... 43001 - 44001 download start... 44001 - 45001 download start... 42001 - 43001 download end !!! 43001 - 44001 download end !!! 44001 - 45001 download end !!! cost time 1313.50461988 s 45001 - 46001 download start... 46001 - 47001 download start... 47001 - 48001 download start... 45001 - 46001 download end !!! 46001 - 47001 download end !!! 47001 - 48001 download end !!! cost time 1322.32317331 s 48001 - 49001 download start... 49001 - 50001 download start... 50001 - 51001 download start... 48001 - 49001 download end !!! 49001 - 50001 download end !!! 50001 - 51001 download end !!! cost time 1381.58027296 s 51001 - 52001 download start... 52001 - 53001 download start... 53001 - 54001 download start... 51001 - 52001 download end !!! 52001 - 53001 download end !!! 53001 - 54001 download end !!! cost time 1357.78699459 s 54001 - 55001 download start... 55001 - 56001 download start... 56001 - 57001 download start... 54001 - 55001 download end !!! 55001 - 56001 download end !!! 56001 - 57001 download end !!! cost time 1359.76377246 s 57001 - 58001 download start... 58001 - 59001 download start... 59001 - 60001 download start... 57001 - 58001 download end !!! 58001 - 59001 download end !!! 59001 - 60001 download end !!! cost time 1335.47829775 s 60001 - 61001 download start... 61001 - 62001 download start... 62001 - 63001 download start... 60001 - 61001 download end !!! 61001 - 62001 download end !!! 62001 - 63001 download end !!! cost time 1354.82727645 s 63001 - 64001 download start... 64001 - 65001 download start... 65001 - 66001 download start... 63001 - 64001 download end !!! 64001 - 65001 download end !!! 65001 - 66001 download end !!! cost time 1260.54731607 s 66001 - 67001 download start... 67001 - 68001 download start... 68001 - 69001 download start... 66001 - 67001 download end !!! 67001 - 68001 download end !!! 68001 - 69001 download end !!! cost time 1363.58255686 s 69001 - 70001 download start... 70001 - 71001 download start... 71001 - 72001 download start... 69001 - 70001 download end !!! 70001 - 71001 download end !!! 71001 - 72001 download end !!! cost time 1354.17163074 s 72001 - 73001 download start... 73001 - 74001 download start... 74001 - 75001 download start... 72001 - 73001 download end !!! 73001 - 74001 download end !!! 74001 - 75001 download end !!! cost time 1335.00425259 s 75001 - 76001 download start... 76001 - 77001 download start... 77001 - 78001 download start... 75001 - 76001 download end !!! 76001 - 77001 download end !!! 77001 - 78001 download end !!! cost time 1360.44054978 s 78001 - 79001 download start... 79001 - 80001 download start... 80001 - 81001 download start... 78001 - 79001 download end !!! 79001 - 80001 download end !!! 80001 - 81001 download end !!! cost time 1369.72662457 s 81001 - 82001 download start... 82001 - 83001 download start... 83001 - 84001 download start... 81001 - 82001 download end !!! 82001 - 83001 download end !!! 83001 - 84001 download end !!! cost time 1369.95550676 s 84001 - 85001 download start... 85001 - 86001 download start... 86001 - 87001 download start... 84001 - 85001 download end !!! 85001 - 86001 download end !!! 86001 - 87001 download end !!! cost time 1482.53886433 s 87001 - 88001 download start... 88001 - 89001 download start... 89001 - 90001 download start... |
以上就是關(guān)于python實現(xiàn)爬蟲統(tǒng)計學(xué)校BBS男女比例的第二篇,重點介紹了多線程爬蟲,希望對大家的學(xué)習(xí)有所幫助。