edu.columbia.cs.cg.prdualrank.index.analyzer
Class TokenBasedAnalyzer

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
      extended by edu.columbia.cs.cg.prdualrank.index.analyzer.TokenBasedAnalyzer
All Implemented Interfaces:
java.io.Closeable

public class TokenBasedAnalyzer
extends org.apache.lucene.analysis.Analyzer

For this Class, Apache Lucene Engine is required.
This class is used for our implementation of: "Searching Patterns for Relation Extraction over the Web: Rediscovering the Pattern-Relation Duality" . Y. Fang and K. C.-C. Chang. In WSDM, pages 825-834, 2011. For further information, WSDM 2011 Conference Website .

Description

Using the TokenBasedReader, this class tokenizes the stream in order to be either indexed or searched.

Since:
2011-10-07
Version:
0.1
Author:
Pablo Barrio, Goncalo Simoes
See Also:
Apache Lucene Engine , WSDM 2011 Conference Website

Constructor Summary
TokenBasedAnalyzer(java.util.Set<java.lang.String> stopWords)
          Instantiates a new token based analyzer.
 
Method Summary
 java.io.Reader getReader(Span[] tokenizedSpans, java.lang.String[] tokenizedString)
          Creates an instance of the reader used to create the TokenStream required by Lucene.
 org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
           
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getOffsetGap, getPositionIncrementGap, reusableTokenStream
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenBasedAnalyzer

public TokenBasedAnalyzer(java.util.Set<java.lang.String> stopWords)
Instantiates a new token based analyzer.

Parameters:
stopWords - the stop words that are not going to be indexed and therefore searched.
Method Detail

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName,
                                                          java.io.Reader reader)
Specified by:
tokenStream in class org.apache.lucene.analysis.Analyzer

getReader

public java.io.Reader getReader(Span[] tokenizedSpans,
                                java.lang.String[] tokenizedString)
Creates an instance of the reader used to create the TokenStream required by Lucene.

Parameters:
tokenizedSpans - the spans found in the stream to be indexed or searched.
tokenizedString - the token values matching the spans in tokenizedSpans.
Returns:
the reader