欧美日韩国产在线成人网,插入影院,狠狠色综合网站久久久久久久

接著第一篇繼續(xù)學(xué)習(xí)。

一、數(shù)據(jù)分類

正確數(shù)據(jù)：id、性別、活動時間三者都有

放在這個文件里file1 = 'ruisi\\correct%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為293001 男 2015-5-1 19:17

沒有時間：有id、有性別，無活動時間

放這個文件里file2 = 'ruisi\\errTime%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為2566 女 notime

用戶不存在：該id沒有對應(yīng)的用戶

放這個文件里file3 = 'ruisi\\notexist%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式為29005 notexist

未知性別：有id，但是性別從網(wǎng)頁上無法得知（經(jīng)檢查，這種情況也沒有活動時間）

放這個文件里 file4 = 'ruisi\\unkownsex%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式 221794 unkownsex

網(wǎng)絡(luò)錯誤：網(wǎng)斷了，或者服務(wù)器故障，需要對這些id重新檢查

放這個文件里 file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)

數(shù)據(jù)格式 271004 httperror

如何不間斷得爬蟲信息

本項目有一個考慮：是不間斷爬取信息，如果因為斷網(wǎng)、BBS服務(wù)器故障啥的，我的爬蟲程序就退出的話。那我們還得從間斷的地方繼續(xù)爬，或者更麻煩的是從頭開始爬。
所以，我采取的方法是，如果遇到故障，就把這些異常的id記錄下來。等一次遍歷之后，才對這些異常的id進行重新爬取性別。
本文系列（一）給出了一個 getInfo(myurl, seWord)，通過給定鏈接和給定正則表達式爬取信息。
這個函數(shù)可以用來查看性別的最后活動時間信息。
我們再定義一個安全的爬取函數(shù)，不會間斷程序運行的，這兒用到try except異常處理。

這兒代碼試了兩次getInfo(myurl, seWord),如果第2次還是拋出異常了，就把這個id保存在file5里面
如果能獲取到信息，就返回信息

				?

									file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)

									def safeGet(myid, myurl, seWord):

									  try:

									    return getInfo(myurl, seWord)

									  except:

									    try:

									      return getInfo(myurl, seWord)

									    except:

									      httperrorfile = open(file5, 'a')

									      info = '%d %s\n' % (myid, 'httperror')

									      httperrorfile.write(info)

									      httperrorfile.close()

									      return 'httperror'

依次遍歷，獲取id從[1,300,000]的用戶信息

我們定義一個函數(shù)，這兒的思路是獲取sex和time，如果有sex，進而繼續(xù)判斷是否有time；如果沒sex，判斷是否這個用戶不存在還是性別無法爬取。

其中要考慮到斷網(wǎng)或者BBS服務(wù)器故障的情況。

				?

									url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'

									url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'

									def searchWeb(idArr):

									  for id in idArr:

									    sexUrl = url1 % (id) #將%s替換為id

									    timeUrl = url2 % (id)

									    sex = safeGet(id,sexUrl, sexRe)

									    if not sex: #如果sexUrl里面找不到性別，在timeUrl再嘗試找一下

									      sex = safeGet(id,timeUrl, sexRe)

									    time = safeGet(id,timeUrl, timeRe)

									    #如果出現(xiàn)了httperror，需要重新爬取

									    if (sex is 'httperror') or (time is 'httperror') :

									      pass

									    else:

									      if sex:

									        info = '%d %s' % (id, sex)

									        if time:

									          info = '%s %s\n' % (info, time)

									          wfile = open(file1, 'a')

									          wfile.write(info)

									          wfile.close()

									        else:

									          info = '%s %s\n' % (info, 'notime')

									          errtimefile = open(file2, 'a')

									          errtimefile.write(info)

									          errtimefile.close()

									      else:

									        #這兒是性別是None，然后確定一下是不是用戶不存在

									        #斷網(wǎng)的時候加上這個，會導(dǎo)致4個重復(fù)httperror

									        #可能用戶的性別我們無法知道，他沒有填寫

									        notexist = safeGet(id,sexUrl, notexistRe)

									        if notexist is 'httperror':

									          pass

									        else:

									          if notexist:

									            notexistfile = open(file3, 'a')

									            info = '%d %s\n' % (id, 'notexist')

									            notexistfile.write(info)

									            notexistfile.close()

									          else:

									            unkownsexfile = open(file4, 'a')

									            info = '%d %s\n' % (id, 'unkownsex')

									            unkownsexfile.write(info)

									            unkownsexfile.close()

這兒后期檢查發(fā)現(xiàn)了一個問題

				?

									sex = safeGet(id,sexUrl, sexRe)

									if not sex:

									  sex = safeGet(id,timeUrl, sexRe)

									time = safeGet(id,timeUrl, timeRe)

這個代碼如果斷網(wǎng)的時候，調(diào)用了3次safeGet，每次調(diào)用都會往文本里面同一個id寫多次httperror

				?

									251538 httperror

									251538 httperror

									251538 httperror

									251538 httperror

多線程爬取信息？

數(shù)據(jù)統(tǒng)計可以用多線程，因為是獨立的多個文本
1、Popen介紹

使用Popen可以自定義標(biāo)準(zhǔn)輸入、標(biāo)準(zhǔn)輸出和標(biāo)準(zhǔn)錯誤輸出。我在SAP實習(xí)的時候，項目組在linux平臺下經(jīng)常使用Popen，可能是因為可以方便重定向輸出。

下面這段代碼借鑒了以前項目組的實現(xiàn)方法，Popen可以調(diào)用系統(tǒng)cmd命令。下面3個communicate()連在一起表示要等這3個線程都結(jié)束。

疑惑？
試驗了一下，必須3個communicate()緊挨著才能保證3個線程同時開啟，最后等待3個線程都結(jié)束。

				?

									p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)

									p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)

									p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)

									p1.communicate()

									p2.communicate()

									p3.communicate()

2、定義一個單線程的爬蟲

用法：python ruisi.py <startNum> <endNum>

這段代碼就是爬取[startNum, endNum)信息，輸出到相應(yīng)的文本里。它是一個單線程的程序，若要實現(xiàn)多線程的話，在外部調(diào)用它的地方實現(xiàn)多線程。

				?

									# ruisi.py

									# coding=utf-8

									import urllib2, re, sys, threading, time,thread

									# myurl as 指定鏈接

									# seWord as 正則表達式，用unicode表示

									# 返回根據(jù)正則表達式匹配的信息或者None

									def getInfo(myurl, seWord):

									  headers = {

									    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'

									  }

									  req = urllib2.Request(

									    url=myurl,

									    headers=headers

									  )

									  time.sleep(0.3)

									  response = urllib2.urlopen(req)

									  html = response.read()

									  html = unicode(html, 'utf-8')

									  timeMatch = seWord.search(html)

									  if timeMatch:

									    s = timeMatch.groups()

									    return s[0]

									  else:

									    return None

									#嘗試兩次getInfo()

									#第2次失敗后，就把這個id標(biāo)記為httperror

									def safeGet(myid, myurl, seWord):

									  try:

									    return getInfo(myurl, seWord)

									  except:

									    try:

									      return getInfo(myurl, seWord)

									    except:

									      httperrorfile = open(file5, 'a')

									      info = '%d %s\n' % (myid, 'httperror')

									      httperrorfile.write(info)

									      httperrorfile.close()

									      return 'httperror'

									#輸出一個 idArr 范圍，比如[1,1001)

									def searchWeb(idArr):

									  for id in idArr:

									    sexUrl = url1 % (id)

									    timeUrl = url2 % (id)

									    sex = safeGet(id,sexUrl, sexRe)

									    if not sex:

									      sex = safeGet(id,timeUrl, sexRe)

									    time = safeGet(id,timeUrl, timeRe)

									    if (sex is 'httperror') or (time is 'httperror') :

									      pass

									    else:

									      if sex:

									        info = '%d %s' % (id, sex)

									        if time:

									          info = '%s %s\n' % (info, time)

									          wfile = open(file1, 'a')

									          wfile.write(info)

									          wfile.close()

									        else:

									          info = '%s %s\n' % (info, 'notime')

									          errtimefile = open(file2, 'a')

									          errtimefile.write(info)

									          errtimefile.close()

									      else:

									        notexist = safeGet(id,sexUrl, notexistRe)

									        if notexist is 'httperror':

									          pass

									        else:

									          if notexist:

									            notexistfile = open(file3, 'a')

									            info = '%d %s\n' % (id, 'notexist')

									            notexistfile.write(info)

									            notexistfile.close()

									          else:

									            unkownsexfile = open(file4, 'a')

									            info = '%d %s\n' % (id, 'unkownsex')

									            unkownsexfile.write(info)

									            unkownsexfile.close()

									def main():

									  reload(sys)

									  sys.setdefaultencoding('utf-8')

									  if len(sys.argv) != 3:

									    print 'usage: python ruisi.py <startNum> <endNum>'

									    sys.exit(-1)

									  global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5

									  startNum=int(sys.argv[1])

									  endNum=int(sys.argv[2])

									  sexRe = re.compile(u'em>\u6027\u522b</em>(.*?)</li')

									  timeRe = re.compile(u'em>\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4</em>(.*?)</li')

									  notexistRe = re.compile(u'(p>)\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')

									  url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'

									  url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'

									  file1 = '..\\newRuisi\\correct%s-%s.txt' % (startNum, endNum)

									  file2 = '..\\newRuisi\\errTime%s-%s.txt' % (startNum, endNum)

									  file3 = '..\\newRuisi\\notexist%s-%s.txt' % (startNum, endNum)

									  file4 = '..\\newRuisi\\unkownsex%s-%s.txt' % (startNum, endNum)

									  file5 = '..\\newRuisi\\httperror%s-%s.txt' % (startNum, endNum)

									  searchWeb(xrange(startNum,endNum))

									  # numThread = 10

									  # searchWeb(xrange(endNum))

									  # total = 0

									  # for i in xrange(numThread):

									  # data = xrange(1+i,endNum,numThread)

									  #   total =+ len(data)

									  #   t=threading.Thread(target=searchWeb,args=(data,))

									  #   t.start()

									  # print total

									main()

多線程爬蟲

代碼

				?

									# coding=utf-8

									from subprocess import Popen

									import subprocess

									import threading,time

									startn = 1

									endn = 300001

									step =1000

									total = (endn - startn + 1 ) /step

									ISOTIMEFORMAT='%Y-%m-%d %X'

									#hardcode 3 threads

									#沒有深究3個線程好還是4或者更多個線程好

									#輸出格式化的年月日時分秒

									#輸出程序的耗時（以秒為單位）

									for i in xrange(0,total,3):

									  startNumber = startn + step * i

									  startTime = time.clock()

									  s0 = startNumber

									  s1 = startNumber + step

									  s2 = startNumber + step*2

									  s3 = startNumber + step*3

									  p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)

									  p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)

									  p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)

									  startftime ='[ '+ time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] '

									  print startftime + '%s - %s download start... ' %(s0, s1)

									  print startftime + '%s - %s download start... ' %(s1, s2)

									  print startftime + '%s - %s download start... ' %(s2, s3)

									  p1.communicate()

									  p2.communicate()

									  p3.communicate()

									  endftime = '[ '+ time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] '

									  print endftime + '%s - %s download end !!! ' %(s0, s1)

									  print endftime + '%s - %s download end !!! ' %(s1, s2)

									  print endftime + '%s - %s download end !!! ' %(s2, s3)

									  endTime = time.clock()

									  print "cost time " + str(endTime - startTime) + " s"

									  time.sleep(5)

這兒是記錄時間戳的日志：

				?

									"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py

									[ 2015-11-23 11:31:15 ] 1 - 1001 download start... 

									[ 2015-11-23 11:31:15 ] 1001 - 2001 download start... 

									[ 2015-11-23 11:31:15 ] 2001 - 3001 download start... 

									[ 2015-11-23 11:53:44 ] 1 - 1001 download end !!! 

									[ 2015-11-23 11:53:44 ] 1001 - 2001 download end !!! 

									[ 2015-11-23 11:53:44 ] 2001 - 3001 download end !!! 

									cost time 1348.99480677 s

									[ 2015-11-23 11:53:50 ] 3001 - 4001 download start... 

									[ 2015-11-23 11:53:50 ] 4001 - 5001 download start... 

									[ 2015-11-23 11:53:50 ] 5001 - 6001 download start... 

									[ 2015-11-23 12:16:56 ] 3001 - 4001 download end !!! 

									[ 2015-11-23 12:16:56 ] 4001 - 5001 download end !!! 

									[ 2015-11-23 12:16:56 ] 5001 - 6001 download end !!! 

									cost time 1386.06407734 s

									[ 2015-11-23 12:17:01 ] 6001 - 7001 download start... 

									[ 2015-11-23 12:17:01 ] 7001 - 8001 download start... 

									[ 2015-11-23 12:17:01 ] 8001 - 9001 download start...

上面是多線程的Log記錄，從下面可以看出，1000個用戶平均需要500s，一個id需要0.5s。500*300/3600 = 41.666666666667小時，大概需要兩天的時間。
我們再試驗一次單線程爬蟲的耗時，記錄如下:

				?

									"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py

									1 - 1001 download start... 

									1 - 1001 download end !!! 

									cost time 1583.65911889 s

									1001 - 2001 download start... 

									1001 - 2001 download end !!! 

									cost time 1342.46874278 s

									2001 - 3001 download start... 

									2001 - 3001 download end !!! 

									cost time 1327.10885725 s

									3001 - 4001 download start...

我們發(fā)現(xiàn)一次線程爬取1000個用戶耗時的時間也需要1500s，而多線程程序是3*1000個用戶耗時1500s。

故多線程確實能比單線程省很多時間。

Note:
在getInfo(myurl, seWord)里有time.sleep(0.3)這樣一段代碼，是為了防止批判訪問BBS，而被BBS拒絕訪問。這個0.3s對于上文多線程和單線程的統(tǒng)計時間有影響。
最后附上原始的，沒有帶時間戳的記錄。（加上時間戳，可以知道程序什么時候開始爬蟲的，以應(yīng)對線程卡死情況。）

				?

									"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py

									1 - 1001 download start... 

									1001 - 2001 download start... 

									2001 - 3001 download start... 

									1 - 1001 download end !!! 

									1001 - 2001 download end !!! 

									2001 - 3001 download end !!! 

									cost time 1532.74102812 s

									3001 - 4001 download start... 

									4001 - 5001 download start... 

									5001 - 6001 download start... 

									3001 - 4001 download end !!! 

									4001 - 5001 download end !!! 

									5001 - 6001 download end !!! 

									cost time 2652.01624951 s

									6001 - 7001 download start... 

									7001 - 8001 download start... 

									8001 - 9001 download start... 

									6001 - 7001 download end !!! 

									7001 - 8001 download end !!! 

									8001 - 9001 download end !!! 

									cost time 1880.61513696 s

									9001 - 10001 download start... 

									10001 - 11001 download start... 

									11001 - 12001 download start... 

									9001 - 10001 download end !!! 

									10001 - 11001 download end !!! 

									11001 - 12001 download end !!! 

									cost time 1634.40575553 s

									12001 - 13001 download start... 

									13001 - 14001 download start... 

									14001 - 15001 download start... 

									12001 - 13001 download end !!! 

									13001 - 14001 download end !!! 

									14001 - 15001 download end !!! 

									cost time 1403.62795496 s

									15001 - 16001 download start... 

									16001 - 17001 download start... 

									17001 - 18001 download start... 

									15001 - 16001 download end !!! 

									16001 - 17001 download end !!! 

									17001 - 18001 download end !!! 

									cost time 1271.42177906 s

									18001 - 19001 download start... 

									19001 - 20001 download start... 

									20001 - 21001 download start... 

									18001 - 19001 download end !!! 

									19001 - 20001 download end !!! 

									20001 - 21001 download end !!! 

									cost time 1476.04122024 s

									21001 - 22001 download start... 

									22001 - 23001 download start... 

									23001 - 24001 download start... 

									21001 - 22001 download end !!! 

									22001 - 23001 download end !!! 

									23001 - 24001 download end !!! 

									cost time 1431.37074164 s

									24001 - 25001 download start... 

									25001 - 26001 download start... 

									26001 - 27001 download start... 

									24001 - 25001 download end !!! 

									25001 - 26001 download end !!! 

									26001 - 27001 download end !!! 

									cost time 1411.45186874 s

									27001 - 28001 download start... 

									28001 - 29001 download start... 

									29001 - 30001 download start... 

									27001 - 28001 download end !!! 

									28001 - 29001 download end !!! 

									29001 - 30001 download end !!! 

									cost time 1396.88837788 s

									30001 - 31001 download start... 

									31001 - 32001 download start... 

									32001 - 33001 download start... 

									30001 - 31001 download end !!! 

									31001 - 32001 download end !!! 

									32001 - 33001 download end !!! 

									cost time 1389.01316718 s

									33001 - 34001 download start... 

									34001 - 35001 download start... 

									35001 - 36001 download start... 

									33001 - 34001 download end !!! 

									34001 - 35001 download end !!! 

									35001 - 36001 download end !!! 

									cost time 1318.16040825 s

									36001 - 37001 download start... 

									37001 - 38001 download start... 

									38001 - 39001 download start... 

									36001 - 37001 download end !!! 

									37001 - 38001 download end !!! 

									38001 - 39001 download end !!! 

									cost time 1362.59222822 s

									39001 - 40001 download start... 

									40001 - 41001 download start... 

									41001 - 42001 download start... 

									39001 - 40001 download end !!! 

									40001 - 41001 download end !!! 

									41001 - 42001 download end !!! 

									cost time 1253.62498539 s

									42001 - 43001 download start... 

									43001 - 44001 download start... 

									44001 - 45001 download start... 

									42001 - 43001 download end !!! 

									43001 - 44001 download end !!! 

									44001 - 45001 download end !!! 

									cost time 1313.50461988 s

									45001 - 46001 download start... 

									46001 - 47001 download start... 

									47001 - 48001 download start... 

									45001 - 46001 download end !!! 

									46001 - 47001 download end !!! 

									47001 - 48001 download end !!! 

									cost time 1322.32317331 s

									48001 - 49001 download start... 

									49001 - 50001 download start... 

									50001 - 51001 download start... 

									48001 - 49001 download end !!! 

									49001 - 50001 download end !!! 

									50001 - 51001 download end !!! 

									cost time 1381.58027296 s

									51001 - 52001 download start... 

									52001 - 53001 download start... 

									53001 - 54001 download start... 

									51001 - 52001 download end !!! 

									52001 - 53001 download end !!! 

									53001 - 54001 download end !!! 

									cost time 1357.78699459 s

									54001 - 55001 download start... 

									55001 - 56001 download start... 

									56001 - 57001 download start... 

									54001 - 55001 download end !!! 

									55001 - 56001 download end !!! 

									56001 - 57001 download end !!! 

									cost time 1359.76377246 s

									57001 - 58001 download start... 

									58001 - 59001 download start... 

									59001 - 60001 download start... 

									57001 - 58001 download end !!! 

									58001 - 59001 download end !!! 

									59001 - 60001 download end !!! 

									cost time 1335.47829775 s

									60001 - 61001 download start... 

									61001 - 62001 download start... 

									62001 - 63001 download start... 

									60001 - 61001 download end !!! 

									61001 - 62001 download end !!! 

									62001 - 63001 download end !!! 

									cost time 1354.82727645 s

									63001 - 64001 download start... 

									64001 - 65001 download start... 

									65001 - 66001 download start... 

									63001 - 64001 download end !!! 

									64001 - 65001 download end !!! 

									65001 - 66001 download end !!! 

									cost time 1260.54731607 s

									66001 - 67001 download start... 

									67001 - 68001 download start... 

									68001 - 69001 download start... 

									66001 - 67001 download end !!! 

									67001 - 68001 download end !!! 

									68001 - 69001 download end !!! 

									cost time 1363.58255686 s

									69001 - 70001 download start... 

									70001 - 71001 download start... 

									71001 - 72001 download start... 

									69001 - 70001 download end !!! 

									70001 - 71001 download end !!! 

									71001 - 72001 download end !!! 

									cost time 1354.17163074 s

									72001 - 73001 download start... 

									73001 - 74001 download start... 

									74001 - 75001 download start... 

									72001 - 73001 download end !!! 

									73001 - 74001 download end !!! 

									74001 - 75001 download end !!! 

									cost time 1335.00425259 s

									75001 - 76001 download start... 

									76001 - 77001 download start... 

									77001 - 78001 download start... 

									75001 - 76001 download end !!! 

									76001 - 77001 download end !!! 

									77001 - 78001 download end !!! 

									cost time 1360.44054978 s

									78001 - 79001 download start... 

									79001 - 80001 download start... 

									80001 - 81001 download start... 

									78001 - 79001 download end !!! 

									79001 - 80001 download end !!! 

									80001 - 81001 download end !!! 

									cost time 1369.72662457 s

									81001 - 82001 download start... 

									82001 - 83001 download start... 

									83001 - 84001 download start... 

									81001 - 82001 download end !!! 

									82001 - 83001 download end !!! 

									83001 - 84001 download end !!! 

									cost time 1369.95550676 s

									84001 - 85001 download start... 

									85001 - 86001 download start... 

									86001 - 87001 download start... 

									84001 - 85001 download end !!! 

									85001 - 86001 download end !!! 

									86001 - 87001 download end !!! 

									cost time 1482.53886433 s

									87001 - 88001 download start... 

									88001 - 89001 download start... 

									89001 - 90001 download start...