全文搜索技术：dotLucene中文分词的highlight显示(2)

时间:2009-12-21 11:47来源:未知作者:admin 点击: 次

分享到：

1 、问题的来源增加分词以后结果的准确度提高了，但是用户反映返回结果的速度很慢。原因是， Lucene 做每一篇文档的相关关键词的高亮显示时，在运行

1、问题的来源

增加分词以后结果的准确度提高了，但是用户反映返回结果的速度很慢。原因是，Lucene做每一篇文档的相关关键词的高亮显示时，在运行时执行了很多遍的分词操作。这样降低了性能。

2、解决方法

在Lucene1.4.3版本中的一个新功能可以解决这个问题。Term Vector现在支持保存Token.getPositionIncrement() 和Token.startOffset() 以及Token.endOffset() 信息。利用Lucene中新增加的Token信息的保存结果以后，就不需要为了高亮显示而在运行时解析每篇文档。通过Field方法控制是否保存该信息。修改HighlighterTest.java的代码如下：

//增加文档时保存Term位置信息。

private void addDoc(IndexWriter writer, String text) throws IOException

{

Document d = new Document();

//Field f = new Field(FIELD_NAME, text, true, true, true);

Field f = new Field(FIELD_NAME, text ,

Field.Store.YES, Field.Index.TOKENIZED,

Field.TermVector.WITH_POSITIONS_OFFSETS);

d.add(f);

writer.addDocument(d);

}

//利用Term位置信息节省Highlight时间。

void doStandardHighlights() throws Exception

{

Highlighter highlighter =new Highlighter(this,new QueryScorer(query));

highlighter.setTextFragmenter(new SimpleFragmenter(20));

for (int i = 0; i < hits.length(); i++)

{

String text = hits.doc(i).get(FIELD_NAME);

int maxNumFragmentsRequired = 2;

String fragmentSeparator = "...";

TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector(hits.id(i),FIELD_NAME);

//如果没有stop words去除还可以改成 TokenSources.getTokenStream(tpv,true); 进一步提速。

TokenStream tokenStream=TokenSources.getTokenStream(tpv);

//analyzer.tokenStream(FIELD_NAME,new StringReader(text));

String result =

highlighter.getBestFragments(

tokenStream,

text,

maxNumFragmentsRequired,

fragmentSeparator);

System.out.println("\t" + result);

}

最后把highlight包中的一个额外的判断去掉。对于中文来说没有明显的单词界限，所以下面这个判断是错误的：

tokenGroup.isDistinct(token)

这样中文分词就不会影响到查询速度了。

编辑推荐DotLucene搜索引擎文章列表：
全文搜索解决方案：DotLucene搜索引擎之创建索引
http://www.xueit.com/html/2009-02/21_606_00.html
DotLucene搜索引擎之搜索索引Demo
http://www.xueit.com/html/2009-02/21_607_00.html
全文搜索技术：dotLucene中文分词的highlight显示
http://www.xueit.com/html/2009-02/21_608_00.html
Lucene.NET增加中文分词
http://www.xueit.com/html/2009-02/21_609_00.html
全文搜索之Lucene增加中文分词功能方法
http://www.xueit.com/html/2009-02/21_610_00.html
简介下基于.NET的全文索引引擎Lucene.NET
http://www.xueit.com/html/2009-02/21_611_00.html
使用dotlucene为数据库建立全文索引
http://www.xueit.com/html/2009-02/21_612_00.html
使用dotlucene多条件检索数据库
http://www.xueit.com/html/2009-02/21_613_00.html
Lucene中文分词实现方法：基于StopWord分割分词
http://www.xueit.com/html/2009-02/21_614_00.html
dotLucene实现增量索引源代码
http://www.xueit.com/html/2009-02/21_615_00.html

上一篇：DotLucene搜索引擎之搜索索引Demo
下一篇：Lucene.NET增加中文分词

分享到： QQ空间新浪微博人人网开心网更多

精彩图集

精彩文章

热点文章

全文搜索技术：dotLucene中文分词的highlight显示(2)

热门标签

赞助商链接