public class TreebankFormatParser extends Object
| Modifier and Type | Field and Description |
|---|---|
static String |
cleanUPRegex1 |
static String |
cleanUPRegex2 |
static String |
cleanUPRegex3 |
static String |
cleanUPRegex4 |
static String |
LEAF_NODE_REGEX
used to identify tokens in Penn Treebank labeled constituents.
|
static String |
TYPE_REGEX
used to identify the type of a consituent in a treebank parse tree.
|
| Constructor and Description |
|---|
TreebankFormatParser() |
| Modifier and Type | Method and Description |
|---|---|
static TreebankNode |
getLeafNode(String parseFragment)
Uses the leafNodePattern to identify a string as a terminal.
|
static String |
getType(String parseFragment)
Returns the type of a constituent of some fragment of a treebank parse.
|
static String |
inferPlainText(String treebankText)
A treebank parse does not preserve whitespace information.
|
static int |
movePastWhiteSpaceChars(String text,
int textOffset) |
static boolean |
parensMatch(String contents) |
static TopTreebankNode |
parse(String parse)
Create TreebankNode objects corresponding to the given TreeBank format parse, e.g.:
|
static TopTreebankNode |
parse(String parse,
String text,
int textOffset)
Create TreebankNode objects corresponding to the given TreeBank format parse, e.g.:
|
static List<TopTreebankNode> |
parseDocument(String parse,
int textOffset,
String text)
This method parses an entire documents worth of treebanked sentences.
|
static String |
prepareString(String parse)
This method was created simply as a way to clean up the parse string for a sentence in the
treebank syntax.
|
static String[] |
splitSentences(String mrgContents)
Generally speaking, we expect one treebanked sentence per line.
|
public static final String cleanUPRegex1
public static final String cleanUPRegex2
public static final String cleanUPRegex3
public static final String cleanUPRegex4
public static final String LEAF_NODE_REGEX
public static final String TYPE_REGEX
public TreebankFormatParser()
public static TreebankNode getLeafNode(String parseFragment)
parseFragment - some fragment of a treebank parse.public static String getType(String parseFragment)
parseFragment - some fragment of a treebank parsepublic static String inferPlainText(String treebankText)
treebankText - One or more parses in Treebank parenthesized format.parse(String, String, int)public static int movePastWhiteSpaceChars(String text, int textOffset)
public static boolean parensMatch(String contents)
public static TopTreebankNode parse(String parse)
( (X (NP (NP (NML (NN Complex ) (NN trait )) (NN analysis )) (PP (IN of ) (NP (DT the ) (NN mouse ) (NN striatum )))) (: : ) (S (NP-SBJ (JJ independent ) (NNS QTLs )) (VP (VBP modulate ) (NP (NP (NN volume )) (CC and ) (NP (NN neuron ) (NN number)))))) )The text will be inferred automatically from the words in the parse.
parse - A TreeBank formatted parseinferPlainText(String),
parse(String, String, int)public static TopTreebankNode parse(String parse, String text, int textOffset)
( (X (NP (NP (NML (NN Complex ) (NN trait )) (NN analysis )) (PP (IN of ) (NP (DT the ) (NN mouse ) (NN striatum )))) (: : ) (S (NP-SBJ (JJ independent ) (NNS QTLs )) (VP (VBP modulate ) (NP (NP (NN volume )) (CC and ) (NP (NN neuron ) (NN number)))))) )The start and end offsets of each TreebankNode will be aligned to the word offsets in the given text.
parse - A TreeBank formatted parsetext - The text to which the parse should be alignedtextOffset - The character offset at which the parse text should start to be aligned. For example,
if the words of the parse start right at the beginning of the text, the appropriate
textOffset is 0.TopTreebankNode,
TreebankNodepublic static List<TopTreebankNode> parseDocument(String parse, int textOffset, String text)
parse - a single document provided as treebank parenthesized parsestextOffset - a value that corresponds to the character offset of the first character of the
document. The appropriate value for this method will typically be 0.text - a single document provided as plain text. If you do not have access to the original
plain text of the document, you can generate some using
inferPlainText(String).public static String prepareString(String parse)
parse - a String in the treebank formatpublic static String[] splitSentences(String mrgContents)
Copyright © 2014. All rights reserved.