www.日本高清.com,性欧美一级,亚洲成a人

以前也用過爬蟲，比如使用nutch爬取指定種子，基于爬到的數據做搜索，還大致看過一些源碼。當然，nutch對于爬蟲考慮的是十分全面和細致的。每當看到屏幕上唰唰過去的爬取到的網頁信息以及處理信息的時候，總感覺這很黑科技。正好這次借助梳理Spring MVC的機會，想自己弄個小爬蟲，簡單沒關系，有些小bug也無所謂，我需要的只是一個能針對某個種子網站能爬取我想要的信息就可以了。有Exception就去解決，可能是一些API使用不當，也可能是遇到了http請求狀態異常，又或是數據庫讀寫有問題，就是在這個報exception和解決exception的過程中，JewelCrawler（兒子的小名）已經可以能夠獨立的爬取數據，并且還有一項基于Word2Vec算法做個情感分析的小技能。

后面可能還會有未知的Exception等著解決，也有一些性能需要優化，比如和數據庫的交互，數據的讀寫等等。但是目測年內沒有太多精力放這上面了，所以今天做一個簡單的總結，而且前兩篇主要側重的是功能和結果，這篇來說說JewelCrawler是如何誕生的，并將代碼放到Github上（源碼地址在文章最后），有興趣的可以關注下（僅供交流學習，請勿他用，考慮下douban君。多一點真誠，少一點傷害）

環境介紹

開發工具：Intellij idea 14

數據庫: Mysql 5.5 + 數據庫管理工具Navicat（可用來連接查詢數據庫）

詳解Java豆瓣電影爬蟲——小爬蟲成長記（附源碼）

語言：Java

Jar包管理：Maven

版本管理：Git

目錄結構

詳解Java豆瓣電影爬蟲——小爬蟲成長記（附源碼）

其中

　　com.ansj.vec是Word2Vec算法的Java版本實現

　　com.jackie.crawler.doubanmovie是爬蟲實現模塊，其中又包括

詳解Java豆瓣電影爬蟲——小爬蟲成長記（附源碼）

有些包是空的，因為這些模塊還沒有用上，其中

　　　　constants包是存放常量類
　　　　crawl包存放爬蟲入口程序
　　　　entity包映射數據庫表的實體類
　　　　test包存放測試類
　　　　utils包存放工具類

resource模塊存放的是配置文件和資源文件，比如

　　　　beans.xml：Spring上下文的配置文件
　　　　seed.properties：種子文件
　　　　stopwords.dic：停用詞庫
　　　　comment12031715.txt：爬取的短評數據
　　　　tokenizerResult.txt：使用IKAnalyzer分詞后的結果文件
　　　　vector.mod：基于Word2Vec算法訓練的模型數據

test模塊是測試模塊，用于編寫UT.

數據庫配置

1. 添加依賴的包

JewelCrawler使用的maven管理，所以只需要在pom.xml中添加相應的依賴就可以了

									<dependency>

									  <groupId>org.springframework</groupId>

									  <artifactId>spring-jdbc</artifactId>

									  <version>4.1.1.RELEASE</version>

									</dependency>

									<dependency>

									  <groupId>commons-pool</groupId>

									  <artifactId>commons-pool</artifactId>

									  <version>1.6</version>

									</dependency>

									<dependency>

									  <groupId>commons-dbcp</groupId>

									  <artifactId>commons-dbcp</artifactId>

									  <version>1.4</version>

									</dependency>

									<dependency>

									  <groupId>mysql</groupId>

									  <artifactId>mysql-connector-java</artifactId>

									  <version>5.1.38</version>

									</dependency>

									<dependency>

									  <groupId>mysql</groupId>

									  <artifactId>mysql-connector-java</artifactId>

									  <version>5.1.38</version>

									</dependency>

2. 聲明數據源bean

我們需要在beans.xml中聲明數據源的bean

									<context:property-placeholder location="classpath*:*.properties"/>

									<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">

									  <property name="driverClassName" value="${jdbc.driver}"/>

									  <property name="url" value="${jdbc.url}"/>

									  <property name="username" value="${jdbc.username}"/>

									  <property name="password" value="${jdbc.password}"/>

									</bean>

注意: 這里是綁定了外部配置文件jdbc.properties，具體數據源的參數從該文件讀取。

如果遇到問題“SQL [insert into user(id) values(?)]; Field 'name' doesn't have a default value;”解決方法是設置表的相應字段為自增長字段。

解析頁面遇到的問題

對于爬到的網頁數據需要解析dom結構，拿到自己想要的數據，期間遇到如下錯誤

org.htmlparser.Node不識別

解決方法：添加jar包依賴

									<dependency>

									  <groupId>org.htmlparser</groupId>

									  <artifactId>htmlparser</artifactId>

									  <version>1.6</version>

									</dependency>

org.apache.http.HttpEntity不識別

解決方法：添加jar包依賴

									<dependency>

									  <groupId>org.apache.httpcomponents</groupId>

									  <artifactId>httpclient</artifactId>

									  <version>4.5.2</version>

									</dependency>

當然這是期間遇到的問題，最后用的是Jsoup做的頁面解析。

maven倉庫下載速度慢

之前使用的是默認的maven中央倉庫，下載jar包的速度很慢，不知道是我的網絡問題還是其他原因，后來在網上找到了阿里云的maven倉庫，更新后，相比之前簡直是秒下，吐血推薦。

									<mirrors>

									  <mirror>

									   <id>alimaven</id>

									   <name>aliyun maven</name>

									   <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

									   <mirrorOf>central</mirrorOf>    

									  </mirror>

									</mirrors>

找到maven的settings.xml文件，添加這個鏡像即可。

讀取resource模塊下文件的一種方法

比如讀取seed.properties文件

									@Test

									  public void testFile(){

									    File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());

									    System.out.print("===========" + seedFile.length() + "===========" );

									  }

有關正則表達式

使用regrex正則表達式的時候，如果匹配上了定義的Pattern，則需要先調用matcher的find方法然后才能使用group方法找到子串。直接調用group方法是沒有辦法找到你想要的結果的。

　　我看了下上面Matcher類的源碼

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

									package java.util.regex;

									import java.util.Objects;

									public final class Matcher implements MatchResult {

									  /**

									   * The Pattern object that created this Matcher.

									   */

									  Pattern parentPattern;

									  /**

									   * The storage used by groups. They may contain invalid values if

									   * a group was skipped during the matching.

									   */

									  int[] groups;

									  /**

									   * The range within the sequence that is to be matched. Anchors

									   * will match at these "hard" boundaries. Changing the region

									   * changes these values.

									   */

									  int from, to;

									  /**

									   * Lookbehind uses this value to ensure that the subexpression

									   * match ends at the point where the lookbehind was encountered.

									   */

									  int lookbehindTo;

									  /**

									   * The original string being matched.

									   */

									  CharSequence text;

									  /**

									   * Matcher state used by the last node. NOANCHOR is used when a

									   * match does not have to consume all of the input. ENDANCHOR is

									   * the mode used for matching all the input.

									   */

									  static final int ENDANCHOR = 1;

									  static final int NOANCHOR = 0;

									  int acceptMode = NOANCHOR;

									  /**

									   * The range of string that last matched the pattern. If the last

									   * match failed then first is -1; last initially holds 0 then it

									   * holds the index of the end of the last match (which is where the

									   * next search starts).

									   */

									  int first = -1, last = 0;

									  /**

									   * The end index of what matched in the last match operation.

									   */

									  int oldLast = -1;

									  /**

									   * The index of the last position appended in a substitution.

									   */

									  int lastAppendPosition = 0;

									  /**

									   * Storage used by nodes to tell what repetition they are on in

									   * a pattern, and where groups begin. The nodes themselves are stateless,

									   * so they rely on this field to hold state during a match.

									   */

									  int[] locals;

									  /**

									   * Boolean indicating whether or not more input could change

									   * the results of the last match.

									   *

									   * If hitEnd is true, and a match was found, then more input

									   * might cause a different match to be found.

									   * If hitEnd is true and a match was not found, then more

									   * input could cause a match to be found.

									   * If hitEnd is false and a match was found, then more input

									   * will not change the match.

									   * If hitEnd is false and a match was not found, then more

									   * input will not cause a match to be found.

									   */

									  boolean hitEnd;

									  /**

									   * Boolean indicating whether or not more input could change

									   * a positive match into a negative one.

									   *

									   * If requireEnd is true, and a match was found, then more

									   * input could cause the match to be lost.

									   * If requireEnd is false and a match was found, then more

									   * input might change the match but the match won't be lost.

									   * If a match was not found, then requireEnd has no meaning.

									   */

									  boolean requireEnd;

									  /**

									   * If transparentBounds is true then the boundaries of this

									   * matcher's region are transparent to lookahead, lookbehind,

									   * and boundary matching constructs that try to see beyond them.

									   */

									  boolean transparentBounds = false;

									  /**

									   * If anchoringBounds is true then the boundaries of this

									   * matcher's region match anchors such as ^ and $.

									   */

									  boolean anchoringBounds = true;

									  /**

									   * No default constructor.

									   */

									  Matcher() {

									  }

									/**

									 * All matchers have the state used by Pattern during a match.

									 */

									Matcher(Pattern parent, CharSequence text) {

									  this.parentPattern = parent;

									  this.text = text;

									  // Allocate state storage

									  int parentGroupCount = Math.max(parent.capturingGroupCount, 10);

									  groups = new int[parentGroupCount * 2];

									  locals = new int[parent.localCount];

									  // Put fields into initial states

									  reset();

									}

									....

									/**

									 * Returns the input subsequence matched by the previous match.

									 *

									 * <p> For a matcher <i>m</i> with input sequence <i>s</i>,

									 * the expressions <i>m.</i><tt>group()</tt> and

									 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>

									 * are equivalent. </p>

									 *

									 * <p> Note that some patterns, for example <tt>a*</tt>, match the empty

									 * string. This method will return the empty string when the pattern

									 * successfully matches the empty string in the input. </p>

									 *

									 * @return The (possibly empty) subsequence matched by the previous match,

									 *     in string form

									 *

									 * @throws IllegalStateException

									 *     If no match has yet been attempted,

									 *     or if the previous match operation failed

									 */

									public String group() {

									  return group(0);

									}

									/**

									 * Returns the input subsequence captured by the given group during the

									 * previous match operation.

									 *

									 * <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index

									 * <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and

									 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>

									 * are equivalent. </p>

									 *

									 * <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left

									 * to right, starting at one. Group zero denotes the entire pattern, so

									 * the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.

									 * </p>

									 *

									 * <p> If the match was successful but the group specified failed to match

									 * any part of the input sequence, then <tt>null</tt> is returned. Note

									 * that some groups, for example <tt>(a*)</tt>, match the empty string.

									 * This method will return the empty string when such a group successfully

									 * matches the empty string in the input. </p>

									 *

									 * @param group

									 *     The index of a capturing group in this matcher's pattern

									 *

									 * @return The (possibly empty) subsequence captured by the group

									 *     during the previous match, or <tt>null</tt> if the group

									 *     failed to match part of the input

									 *

									 * @throws IllegalStateException

									 *     If no match has yet been attempted,

									 *     or if the previous match operation failed

									 *

									 * @throws IndexOutOfBoundsException

									 *     If there is no capturing group in the pattern

									 *     with the given index

									 */

									public String group(int group) {

									  if (first < 0)

									    throw new IllegalStateException("No match found");

									  if (group < 0 || group > groupCount())

									    throw new IndexOutOfBoundsException("No group " + group);

									  if ((groups[group*2] == -1) || (groups[group*2+1] == -1))

									    return null;

									  return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();

									}

									/**

									 * Attempts to find the next subsequence of the input sequence that matches

									 * the pattern.

									 *

									 * <p> This method starts at the beginning of this matcher's region, or, if

									 * a previous invocation of the method was successful and the matcher has

									 * not since been reset, at the first character not matched by the previous

									 * match.

									 *

									 * <p> If the match succeeds then more information can be obtained via the

									 * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>

									 *

									 * @return <tt>true</tt> if, and only if, a subsequence of the input

									 *     sequence matches this matcher's pattern

									 */

									public boolean find() {

									  int nextSearchIndex = last;

									  if (nextSearchIndex == first)

									    nextSearchIndex++;

									  // If next search starts before region, start it at region

									  if (nextSearchIndex < from)

									    nextSearchIndex = from;

									  // If next search starts beyond region then it fails

									  if (nextSearchIndex > to) {

									    for (int i = 0; i < groups.length; i++)

									      groups[i] = -1;

									    return false;

									  }

									  return search(nextSearchIndex);

									}

									/**

									 * Initiates a search to find a Pattern within the given bounds.

									 * The groups are filled with default values and the match of the root

									 * of the state machine is called. The state machine will hold the state

									 * of the match as it proceeds in this matcher.

									 *

									 * Matcher.from is not set here, because it is the "hard" boundary

									 * of the start of the search which anchors will set to. The from param

									 * is the "soft" boundary of the start of the search, meaning that the

									 * regex tries to match at that index but ^ won't match there. Subsequent

									 * calls to the search methods start at a new "soft" boundary which is

									 * the end of the previous match.

									 */

									boolean search(int from) {

									  this.hitEnd = false;

									  this.requireEnd = false;

									  from    = from < 0 ? 0 : from;

									  this.first = from;

									  this.oldLast = oldLast < 0 ? from : oldLast;

									  for (int i = 0; i < groups.length; i++)

									    groups[i] = -1;

									  acceptMode = NOANCHOR;

									  boolean result = parentPattern.root.match(this, from, text);

									  if (!result)

									    this.first = -1;

									  this.oldLast = this.last;

									  return result;

									}

									...

									}

原因是這樣的：這里如果不先調用find方法，直接調用group，可以發現group方法調用group(int group)，該方法的方法體中有if first<0,顯然這里這個條件是成立的，因為first的初始值就是-1，所以這里會拋異常。但是如果調用find方法，可以發現，最終會調用search(nextSearchIndex)，注意這里的nextSearchIndex已被last賦值，而last的值為0，再跳轉到search方法中

									boolean search(int from) {

									  this.hitEnd = false;

									  this.requireEnd = false;

									  from    = from < 0 ? 0 : from;

									  this.first = from;

									  this.oldLast = oldLast < 0 ? from : oldLast;

									  for (int i = 0; i < groups.length; i++)

									    groups[i] = -1;

									  acceptMode = NOANCHOR;

									  boolean result = parentPattern.root.match(this, from, text);

									  if (!result)

									    this.first = -1;

									  this.oldLast = this.last;

									  return result;

									}