国产欧美日本亚洲精品五区,国产va在线观看,久久精品加勒比中文字幕

Lucene是apache軟件基金會(huì)4 jakarta項(xiàng)目組的一個(gè)子項(xiàng)目，是一個(gè)開放源代碼的全文檢索引擎工具包，但它不是一個(gè)完整的全文檢索引擎，而是一個(gè)全文檢索引擎的架構(gòu)，提供了完整的查詢引擎和索引引擎，部分文本分析引擎（英文與德文兩種西方語言）。Lucene的目的是為軟件開發(fā)人員提供一個(gè)簡單易用的工具包，以方便的在目標(biāo)系統(tǒng)中實(shí)現(xiàn)全文檢索的功能，或者是以此為基礎(chǔ)建立起完整的全文檢索引擎

全文檢索概述

比如，我們一個(gè)文件夾中，或者一個(gè)磁盤中有很多的文件，記事本、world、Excel、pdf，我們想根據(jù)其中的關(guān)鍵詞搜索包含的文件。例如，我們輸入Lucene，所有內(nèi)容含有Lucene的文件就會(huì)被檢查出來。這就是所謂的全文檢索。

因此，很容易的我們想到，應(yīng)該建立一個(gè)關(guān)鍵字與文件的相關(guān)映射，盜用ppt中的一張圖，很明白的解釋了這種映射如何實(shí)現(xiàn)。

倒排索引

詳解Spring Boot 中使用 Java API 調(diào)用 lucene

有了這種映射關(guān)系，我們就來看看Lucene的架構(gòu)設(shè)計(jì)。

下面是Lucene的資料必出現(xiàn)的一張圖，但也是其精髓的概括。

詳解Spring Boot 中使用 Java API 調(diào)用 lucene

我們可以看到，Lucene的使用主要體現(xiàn)在兩個(gè)步驟：

1 創(chuàng)建索引，通過IndexWriter對(duì)不同的文件進(jìn)行索引的創(chuàng)建，并將其保存在索引相關(guān)文件存儲(chǔ)的位置中。

2 通過索引查尋關(guān)鍵字相關(guān)文檔。

在Lucene中，就是使用這種“倒排索引”的技術(shù)，來實(shí)現(xiàn)相關(guān)映射。

Lucene數(shù)學(xué)模型

文檔、域、詞元

文檔是Lucene搜索和索引的原子單位，文檔為包含一個(gè)或者多個(gè)域的容器，而域則是依次包含“真正的”被搜索的內(nèi)容，域值通過分詞技術(shù)處理，得到多個(gè)詞元。

For Example，一篇小說（斗破蒼穹）信息可以稱為一個(gè)文檔，小說信息又包含多個(gè)域，例如：標(biāo)題（斗破蒼穹）、作者、簡介、最后更新時(shí)間等等，對(duì)標(biāo)題這個(gè)域采用分詞技術(shù)又可以得到一個(gè)或者多個(gè)詞元（斗、破、蒼、穹）。

Lucene文件結(jié)構(gòu)

層次結(jié)構(gòu)

index
一個(gè)索引存放在一個(gè)目錄中

segment
一個(gè)索引中可以有多個(gè)段，段與段之間是獨(dú)立的，添加新的文檔可能產(chǎn)生新段，不同的段可以合并成一個(gè)新段

document
文檔是創(chuàng)建索引的基本單位，不同的文檔保存在不同的段中，一個(gè)段可以包含多個(gè)文檔

field
域，一個(gè)文檔包含不同類型的信息，可以拆分開索引

term
詞，索引的最小單位，是經(jīng)過詞法分析和語言處理后的數(shù)據(jù)。

正向信息

按照層次依次保存了從索引到詞的包含關(guān)系：index-->segment-->document-->field-->term。

反向信息

反向信息保存了詞典的倒排表映射：term-->document

IndexWriter
lucene中最重要的的類之一，它主要是用來將文檔加入索引，同時(shí)控制索引過程中的一些參數(shù)使用。

Analyzer
分析器,主要用于分析搜索引擎遇到的各種文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory
索引存放的位置;lucene提供了兩種索引存放的位置，一種是磁盤，一種是內(nèi)存。一般情況將索引放在磁盤上；相應(yīng)地lucene提供了FSDirectory和RAMDirectory兩個(gè)類。

Document
文檔;Document相當(dāng)于一個(gè)要進(jìn)行索引的單元，任何可以想要被索引的文件都必須轉(zhuǎn)化為Document對(duì)象才能進(jìn)行索引。

Field
字段。

IndexSearcher
是lucene中最基本的檢索工具，所有的檢索都會(huì)用到IndexSearcher工具;

Query
查詢，lucene中支持模糊查詢，語義查詢，短語查詢，組合查詢等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些類。

QueryParser
是一個(gè)解析用戶輸入的工具，可以通過掃描用戶輸入的字符串，生成Query對(duì)象。

Hits
在搜索完成之后，需要把搜索結(jié)果返回并顯示給用戶，只有這樣才算是完成搜索的目的。在lucene中，搜索的結(jié)果的集合是用Hits類的實(shí)例來表示的。

測試用例

Github 代碼

代碼我已放到 Github ，導(dǎo)入spring-boot-lucene-demo 項(xiàng)目

github spring-boot-lucene-demo

添加依賴

									<!--對(duì)分詞索引查詢解析-->

									<dependency>

									  <groupId>org.apache.lucene</groupId>

									  <artifactId>lucene-queryparser</artifactId>

									  <version>7.1.0</version>

									</dependency>

									<!--高亮 -->

									<dependency>

									  <groupId>org.apache.lucene</groupId>

									  <artifactId>lucene-highlighter</artifactId>

									  <version>7.1.0</version>

									</dependency>

									<!--smartcn 中文分詞器 SmartChineseAnalyzer smartcn分詞器 需要lucene依賴 且和lucene版本同步-->

									<dependency>

									  <groupId>org.apache.lucene</groupId>

									  <artifactId>lucene-analyzers-smartcn</artifactId>

									  <version>7.1.0</version>

									</dependency>

									<!--ik-analyzer 中文分詞器-->

									<dependency>

									  <groupId>cn.bestwu</groupId>

									  <artifactId>ik-analyzers</artifactId>

									  <version>5.1.0</version>

									</dependency>

									<!--MMSeg4j 分詞器-->

									<dependency>

									  <groupId>com.chenlb.mmseg4j</groupId>

									  <artifactId>mmseg4j-solr</artifactId>

									  <version>2.4.0</version>

									  <exclusions>

									    <exclusion>

									      <groupId>org.apache.solr</groupId>

									      <artifactId>solr-core</artifactId>

									    </exclusion>

									  </exclusions>

									</dependency>

配置 lucene

									private Directory directory;

									private IndexReader indexReader;

									private IndexSearcher indexSearcher;

									@Before

									public void setUp() throws IOException {

									  //索引存放的位置，設(shè)置在當(dāng)前目錄中

									  directory = FSDirectory.open(Paths.get("indexDir/"));

									  //創(chuàng)建索引的讀取器

									  indexReader = DirectoryReader.open(directory);

									  //創(chuàng)建一個(gè)索引的查找器，來檢索索引庫

									  indexSearcher = new IndexSearcher(indexReader);

									}

									@After

									public void tearDown() throws Exception {

									  indexReader.close();

									}

									**

									 * 執(zhí)行查詢，并打印查詢到的記錄數(shù)

									 *

									 * @param query

									 * @throws IOException

									 */

									public void executeQuery(Query query) throws IOException {

									  TopDocs topDocs = indexSearcher.search(query, 100);

									  //打印查詢到的記錄數(shù)

									  System.out.println("總共查詢到" + topDocs.totalHits + "個(gè)文檔");

									  for (ScoreDoc scoreDoc : topDocs.scoreDocs) {

									    //取得對(duì)應(yīng)的文檔對(duì)象

									    Document document = indexSearcher.doc(scoreDoc.doc);

									    System.out.println("id：" + document.get("id"));

									    System.out.println("title：" + document.get("title"));

									    System.out.println("content：" + document.get("content"));

									  }

									}

									/**

									 * 分詞打印

									 *

									 * @param analyzer

									 * @param text

									 * @throws IOException

									 */

									public void printAnalyzerDoc(Analyzer analyzer, String text) throws IOException {

									  TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));

									  CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

									  try {

									    tokenStream.reset();

									    while (tokenStream.incrementToken()) {

									      System.out.println(charTermAttribute.toString());

									    }

									    tokenStream.end();

									  } finally {

									    tokenStream.close();

									    analyzer.close();

									  }

									}

創(chuàng)建索引

									@Test

									public void indexWriterTest() throws IOException {

									  long start = System.currentTimeMillis();

									  //索引存放的位置，設(shè)置在當(dāng)前目錄中

									  Directory directory = FSDirectory.open(Paths.get("indexDir/"));

									  //在 6.6 以上版本中 version 不再是必要的，并且，存在無參構(gòu)造方法，可以直接使用默認(rèn)的 StandardAnalyzer 分詞器。

									  Version version = Version.LUCENE_7_1_0;

									  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文

									  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞

									  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞

									  //Analyzer analyzer = new IKAnalyzer();//中文分詞

									  Analyzer analyzer = new IKAnalyzer();//中文分詞

									  //創(chuàng)建索引寫入配置

									  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

									  //創(chuàng)建索引寫入對(duì)象

									  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

									  //創(chuàng)建Document對(duì)象，存儲(chǔ)索引

									  Document doc = new Document();

									  int id = 1;

									  //將字段加入到doc中

									  doc.add(new IntPoint("id", id));

									  doc.add(new StringField("title", "Spark", Field.Store.YES));

									  doc.add(new TextField("content", "Apache Spark 是專為大規(guī)模數(shù)據(jù)處理而設(shè)計(jì)的快速通用的計(jì)算引擎", Field.Store.YES));

									  doc.add(new StoredField("id", id));

									  //將doc對(duì)象保存到索引庫中

									  indexWriter.addDocument(doc);

									  indexWriter.commit();

									  //關(guān)閉流

									  indexWriter.close();

									  long end = System.currentTimeMillis();

									  System.out.println("索引花費(fèi)了" + (end - start) + " 毫秒");

									}

響應(yīng)

									17:58:14.655 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴(kuò)展詞典：ext.dic

									17:58:14.660 [main] DEBUG org.wltea.analyzer.dic.Dictionary - 加載擴(kuò)展停止詞典：stopword.dic

									索引花費(fèi)了879 毫秒

刪除文檔

									@Test

									public void deleteDocumentsTest() throws IOException {

									  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文

									  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞

									  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞

									  //Analyzer analyzer = new IKAnalyzer();//中文分詞

									  Analyzer analyzer = new IKAnalyzer();//中文分詞

									  //創(chuàng)建索引寫入配置

									  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

									  //創(chuàng)建索引寫入對(duì)象

									  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

									  // 刪除title中含有關(guān)鍵詞“Spark”的文檔

									  long count = indexWriter.deleteDocuments(new Term("title", "Spark"));

									  // 除此之外IndexWriter還提供了以下方法：

									  // DeleteDocuments(Query query):根據(jù)Query條件來刪除單個(gè)或多個(gè)Document

									  // DeleteDocuments(Query[] queries):根據(jù)Query條件來刪除單個(gè)或多個(gè)Document

									  // DeleteDocuments(Term term):根據(jù)Term來刪除單個(gè)或多個(gè)Document

									  // DeleteDocuments(Term[] terms):根據(jù)Term來刪除單個(gè)或多個(gè)Document

									  // DeleteAll():刪除所有的Document

									  //使用IndexWriter進(jìn)行Document刪除操作時(shí)，文檔并不會(huì)立即被刪除，而是把這個(gè)刪除動(dòng)作緩存起來，當(dāng)IndexWriter.Commit()或IndexWriter.Close()時(shí)，刪除操作才會(huì)被真正執(zhí)行。

									  indexWriter.commit();

									  indexWriter.close();

									  System.out.println("刪除完成:" + count);

									}

響應(yīng)

刪除完成:1

更新文檔

									/**

									 * 測試更新

									 * 實(shí)際上就是刪除后新增一條

									 *

									 * @throws IOException

									 */

									@Test

									public void updateDocumentTest() throws IOException {

									  //Analyzer analyzer = new StandardAnalyzer(); // 標(biāo)準(zhǔn)分詞器，適用于英文

									  //Analyzer analyzer = new SmartChineseAnalyzer();//中文分詞

									  //Analyzer analyzer = new ComplexAnalyzer();//中文分詞

									  //Analyzer analyzer = new IKAnalyzer();//中文分詞

									  Analyzer analyzer = new IKAnalyzer();//中文分詞

									  //創(chuàng)建索引寫入配置

									  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

									  //創(chuàng)建索引寫入對(duì)象

									  IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

									  Document doc = new Document();

									  int id = 1;

									  doc.add(new IntPoint("id", id));

									  doc.add(new StringField("title", "Spark", Field.Store.YES));

									  doc.add(new TextField("content", "Apache Spark 是專為大規(guī)模數(shù)據(jù)處理而設(shè)計(jì)的快速通用的計(jì)算引擎", Field.Store.YES));

									  doc.add(new StoredField("id", id));

									  long count = indexWriter.updateDocument(new Term("id", "1"), doc);

									  System.out.println("更新文檔:" + count);

									  indexWriter.close();

									}

響應(yīng)

更新文檔:1

按詞條搜索

									/**

									 * 按詞條搜索

									 * <p>

									 * TermQuery是最簡單、也是最常用的Query。TermQuery可以理解成為“詞條搜索”，

									 * 在搜索引擎中最基本的搜索就是在索引中搜索某一詞條，而TermQuery就是用來完成這項(xiàng)工作的。

									 * 在Lucene中詞條是最基本的搜索單位，從本質(zhì)上來講一個(gè)詞條其實(shí)就是一個(gè)名/值對(duì)。

									 * 只不過這個(gè)“名”是字段名，而“值”則表示字段中所包含的某個(gè)關(guān)鍵字。

									 *

									 * @throws IOException

									 */

									@Test

									public void termQueryTest() throws IOException {

									  String searchField = "title";

									  //這是一個(gè)條件查詢的api，用于添加條件

									  TermQuery query = new TermQuery(new Term(searchField, "Spark"));

									  //執(zhí)行查詢，并打印查詢到的記錄數(shù)

									  executeQuery(query);

									}

響應(yīng)