1.項(xiàng)目需求描述
通過訂單號獲取某系統(tǒng)內(nèi)訂單的詳細(xì)數(shù)據(jù),不需要賬號密碼的登錄驗(yàn)證,但有圖片驗(yàn)證碼的動態(tài)識別,將獲取到的數(shù)據(jù)存到數(shù)據(jù)庫。
2.整體思路
1.通過Selenium技術(shù),無窗口模式打開瀏覽器
2.在輸入框中動態(tài)輸入訂單號
3.將圖片驗(yàn)證碼截圖保存到本地
4.通過Tesseract-OCR技術(shù)去本地識別驗(yàn)證碼轉(zhuǎn)化為文字
5.將獲取的驗(yàn)證碼輸入輸入框
6.點(diǎn)擊查詢獲取列表數(shù)據(jù)
3.功能實(shí)現(xiàn)
1.下載并安裝Google瀏覽器,安裝Google驅(qū)動chromedriver.exe,獲取安裝路徑,配置在項(xiàng)目中
2.使用Selenium進(jìn)行瀏覽器操作
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
System.setProperty(瀏覽器驅(qū)動, 瀏覽器驅(qū)動安裝位置); ChromeOptions options = new ChromeOptions(); options.addArguments( "--headless" ); // 無窗口模式 options.addArguments( "--disable-infobars" ); // 禁言消息條 options.addArguments( "--disable-extensions" ); // 禁用插件 options.addArguments( "--disable-gpu" ); // 禁用GPU options.addArguments( "--no-sandbox" ); // 禁用沙盒模式 options.addArguments( "--disable-dev-shm-usage" ); options.addArguments( "--hide-scrollbars" ); // 隱藏滾動條 WebDriver driver = new ChromeDriver(options); driver.get(爬取網(wǎng)站URL); driver.manage().window().setSize( new Dimension( 450 , 260 )); // 設(shè)置游覽器打開后調(diào)整大小 try { // 保存IMG圖片到本地 saveImgToLocal(driver); Thread.sleep( 2000 ); // OCR智能識別驗(yàn)證碼 String codeByOCR = getCodeByOCR(); if (codeByOCR != null ) { try { WebElement input1 = driver.findElement(By.id(TEXTBOX1)); input1.sendKeys(code); WebElement input2 = driver.findElement(By.id(TEXTBOX2)); input2.sendKeys(codeByOCR); // 獲取table數(shù)據(jù) WebElement addButton = driver.findElement(By.id(SELECT_BUTTON)); addButton.click(); List<WebElement> tRCollection = driver.findElement(By.id(TABLE_ID)).findElements(By.tagName( "tr" )); for ( int t = 1 ; t < tRCollection.size(); t++) { List<WebElement> tDCollection = tRCollection.get(t).findElements(By.tagName( "td" )); VipLogisticsMinHangDetailVo minHangDetailVo = new VipLogisticsMinHangDetailVo(); minHangDetailVo.setLogistics_number(code); for ( int i = 0 ; i < tDCollection.size(); i++) { String text = tDCollection.get(i).getText(); switch (i) { case 0 : minHangDetailVo.setTime(text); case 1 : minHangDetailVo.setOutlet(text); case 2 : minHangDetailVo.setOrganization(text); case 3 : minHangDetailVo.setEvent(text); case 4 : minHangDetailVo.setDetail(text); } } list.add(minHangDetailVo); } log.info( "驗(yàn)證碼識別成功!" ); } catch (Exception e) { if (e.toString().contains( "錯誤提示:驗(yàn)證碼錯誤或已過期!" )) { log.error( "驗(yàn)證碼識別錯誤!" + e.toString()); } else if (e.toString().contains( "錯誤提示:請輸入驗(yàn)證碼!" )) { log.error( "未輸入驗(yàn)證碼!:" + e.toString()); } else { log.error( "其他異常:" + e.toString()); } } } driver.quit(); } catch (Exception e) { e.printStackTrace(); } |
3.將圖片驗(yàn)證碼截圖保存到本地(截屏法)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
private void saveImgToLocal(WebDriver driver) { WebElement element = driver.findElement(By.id(img元素ID)); //創(chuàng)建全屏截圖 WrapsDriver wrapsDriver = (WrapsDriver) element; File screen = ((TakesScreenshot) wrapsDriver.getWrappedDriver()).getScreenshotAs(OutputType.FILE); try { BufferedImage image = ImageIO.read(screen); //創(chuàng)建一個矩形使用上面的高度,和寬度 Point p = element.getLocation(); //元素坐標(biāo) BufferedImage img = image.getSubimage(p.getX(), p.getY(), element.getSize().getWidth(), element.getSize().getHeight()); ImageIO.write(img, "png" , screen); FileUtils.copyFile(screen, new File(保存本地地址 + "imgname.png" )); } catch (IOException e) { e.printStackTrace(); } } |
4.將圖片驗(yàn)證碼保存到本地(鼠標(biāo)法)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
private static void saveImgToLocal1(WebDriver driver) { Actions action = new Actions(driver); action.contextClick(driver.findElement(By.id(img元素ID))).build().perform(); try { Robot robot = new Robot(); Thread.sleep( 1000 ); robot.keyPress(KeyEvent.VK_DOWN); Thread.sleep( 1000 ); robot.keyPress(KeyEvent.VK_DOWN); Thread.sleep( 1000 ); robot.keyPress(KeyEvent.VK_ENTER); Thread.sleep( 1000 ); //釋放向下鍵,不然在此之前的條目將起作用 robot.keyRelease(KeyEvent.VK_DOWN); Thread.sleep( 1000 ); //運(yùn)行保存 Runtime.getRuntime().exec(SAVE_IMG_EXE); Thread.sleep( 10000 ); } catch (Exception e) { e.printStackTrace(); } } |
5.對本地驗(yàn)證碼進(jìn)行OCR識別
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
private String getCodeByOCR() { String result = null ; File file = new File(本地圖片地址); if (!file.exists()) { if (systemFalg != 1 ) { file.setWritable( true , false ); } file.mkdirs(); } File imageFile = new File(本地圖片地址 + "imgname.png" ); if (imageFile.exists()) { ITesseract instance = new Tesseract(); instance.setDatapath(tessdata存放地址); try { String doOCR = instance.doOCR(imageFile); result = replaceBlank(doOCR); log.info( "解析的驗(yàn)證碼為:{}" , result != null ? result : "為空!" ); } catch (Exception e) { e.printStackTrace(); log.error( "解析驗(yàn)證碼異常!" ); } } else { log.error( "解析驗(yàn)證碼的文件不存在!" ); } return result; } |
綜上,該網(wǎng)頁的數(shù)據(jù)就可以獲取了。
到此這篇關(guān)于Selenium+Tesseract-OCR智能識別驗(yàn)證碼爬取網(wǎng)頁數(shù)據(jù)的實(shí)例的文章就介紹到這了,更多相關(guān)Selenium+Tesseract-OCR智能識別驗(yàn)證碼爬取 內(nèi)容請搜索服務(wù)器之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持服務(wù)器之家!
原文鏈接:https://www.cnblogs.com/zhaohadoopone/p/15338813.html