Package org.languagetool.dev.dumpcheck
Class WikipediaSentenceSource
java.lang.Object
org.languagetool.dev.dumpcheck.SentenceSource
org.languagetool.dev.dumpcheck.WikipediaSentenceSource
Provides access to the sentences of a Wikipedia XML dump. Note that
conversion exceptions are logged to STDERR and are otherwise ignored.
To get an XML dump, download
pages-articles.xml.bz2 from
http://download.wikimedia.org/backup-index.html, e.g.
http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2.- Since:
- 2.4
-
Method Summary
Methods inherited from class org.languagetool.dev.dumpcheck.SentenceSource
acceptSentence, remove, toStringMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface java.util.Iterator
forEachRemaining
-
Method Details
-
hasNext
public boolean hasNext()- Specified by:
hasNextin interfaceIterator<Sentence>- Specified by:
hasNextin classSentenceSource
-
next
Description copied from class:SentenceSourceReturn the next sentence. Sentences from the source are filtered by length to remove very short and very long sentences.- Specified by:
nextin interfaceIterator<Sentence>- Specified by:
nextin classSentenceSource
-
getSource
- Specified by:
getSourcein classSentenceSource
-