java.lang.Object
org.nasdanika.rag.core.PdfTextSplitter
Extracts text from PDF and splits into chunks.
This class tries to keep paragraphs together and split them into sentences if keeping together is not possible.
-
Nested Class Summary
Nested Classes -
Constructor Summary
ConstructorsConstructorDescriptionPdfTextSplitter(int size, int overlap, int tolerance, Function<String, List<String>> tokenizer) -
Method Summary
Modifier and TypeMethodDescriptionprotected Stringprotected Stringprotected Stringsplit(org.nasdanika.models.pdf.Document document) Splits docuent into chunks.splitIntoSentences(String text) splitIntoWords(String text)
-
Constructor Details
-
PdfTextSplitter
public PdfTextSplitter(int size, int overlap, int tolerance, Function<String, List<String>> tokenizer) - Parameters:
size- Chunk size in tokensoverlap- Chunk overlap in tokens.tolerance- Size tolerance to allow keep paragraphs and sentences together if possibletokenCounter-
-
-
Method Details
-
splitIntoSentences
-
splitIntoWords
-
getWordSeparator
-
getLineSeparator
-
getParagraphSeparator
-
split
Splits docuent into chunks.- Parameters:
document-- Returns:
-