edu.columbia.cs.ref.model
Class TokenizedDocument

java.lang.Object
  extended by edu.columbia.cs.ref.model.Document
      extended by edu.columbia.cs.ref.model.TokenizedDocument
All Implemented Interfaces:
Matchable, Writable, java.io.Serializable

public class TokenizedDocument
extends Document

Particular type of Document that went through a tokenization process.

Like a Document, a TokenizedDocument is defined by its path, the name of the file, a list of Segments that represent the content of the document and annotations of entities and relationships in the document. Additionally, a TokenizedDocument is composed by the information that results from the tokenization.

Since:
2011-09-27
Version:
0.1
Author:
Pablo Barrio, Goncalo Simoes
See Also:
Serialized Form

Constructor Summary
TokenizedDocument(Document d, Tokenizer tokenizer)
          Constructor of the Document
 
Method Summary
 Span getEntitySpan(Entity entity)
          Returns the indexes in the tokenization.
 Span[] getTokenizedSpans()
          Returns an array of spans where each entry corresponds to the start and ending indexes of the tokens in the text
 java.lang.String[] getTokenizedString()
          Returns an array of Strings where each entry is the value of each token of the text
 
Methods inherited from class edu.columbia.cs.ref.model.Document
addEntity, addRelationship, equals, getEntities, getEntity, getFilename, getPath, getPlainText, getRelationship, getRelationships, getSubstring, getWritableValue, setFilename, setPath, setPlainText, toString
 
Methods inherited from class java.lang.Object
getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TokenizedDocument

public TokenizedDocument(Document d,
                         Tokenizer tokenizer)
Constructor of the Document

Parameters:
d - document without tokenization
tokenizer - tokenizer used to tokenize the document
Method Detail

getEntitySpan

public Span getEntitySpan(Entity entity)
Returns the indexes in the tokenization.

Parameters:
entity - Entity that we are trying to find the indexes for
Returns:
start and end indexes of the input entity

getTokenizedString

public java.lang.String[] getTokenizedString()
Returns an array of Strings where each entry is the value of each token of the text

Returns:
tokens of the text

getTokenizedSpans

public Span[] getTokenizedSpans()
Returns an array of spans where each entry corresponds to the start and ending indexes of the tokens in the text

Returns:
indexes of the tokens of the text