edu.columbia.cs.cg.prdualrank.index.analyzer
Class TokenBasedAnalyzer
java.lang.Object
org.apache.lucene.analysis.Analyzer
edu.columbia.cs.cg.prdualrank.index.analyzer.TokenBasedAnalyzer
- All Implemented Interfaces:
- java.io.Closeable
public class TokenBasedAnalyzer
- extends org.apache.lucene.analysis.Analyzer
For this Class, Apache Lucene Engine is required.
This class is used for our implementation of:
"Searching Patterns for Relation Extraction over the Web: Rediscovering the Pattern-Relation Duality" . Y. Fang and K. C.-C. Chang. In WSDM, pages 825-834, 2011.
For further information, WSDM 2011 Conference Website .
Description
Using the TokenBasedReader, this class tokenizes the stream in order to be either indexed or searched.
- Since:
- 2011-10-07
- Version:
- 0.1
- Author:
- Pablo Barrio, Goncalo Simoes
- See Also:
- Apache Lucene Engine ,
WSDM 2011 Conference Website
Constructor Summary |
TokenBasedAnalyzer(java.util.Set<java.lang.String> stopWords)
Instantiates a new token based analyzer. |
Method Summary |
java.io.Reader |
getReader(Span[] tokenizedSpans,
java.lang.String[] tokenizedString)
Creates an instance of the reader used to create the TokenStream required by Lucene. |
org.apache.lucene.analysis.TokenStream |
tokenStream(java.lang.String fieldName,
java.io.Reader reader)
|
Methods inherited from class org.apache.lucene.analysis.Analyzer |
close, getOffsetGap, getPositionIncrementGap, reusableTokenStream |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TokenBasedAnalyzer
public TokenBasedAnalyzer(java.util.Set<java.lang.String> stopWords)
- Instantiates a new token based analyzer.
- Parameters:
stopWords
- the stop words that are not going to be indexed and therefore searched.
tokenStream
public org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName,
java.io.Reader reader)
- Specified by:
tokenStream
in class org.apache.lucene.analysis.Analyzer
getReader
public java.io.Reader getReader(Span[] tokenizedSpans,
java.lang.String[] tokenizedString)
- Creates an instance of the reader used to create the TokenStream required by Lucene.
- Parameters:
tokenizedSpans
- the spans found in the stream to be indexed or searched.tokenizedString
- the token values matching the spans in tokenizedSpans.
- Returns:
- the reader