public class TextExtractor extends AbstractTextExtractor
Copyright (c) 2020 xsx All Rights Reserved. x-easypdf-pdfbox is licensed under Mulan PSL v2. You can use this software according to the terms and conditions of the Mulan PSL v2. You may obtain a copy of Mulan PSL v2 at: http://license.coscl.org.cn/MulanPSL2 THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. See the Mulan PSL v2 for more details.
AbstractTextExtractor.DefaultTextStripper, AbstractTextExtractor.Function<R>TABLE_PATTERNlogdocument| 构造器和说明 |
|---|
TextExtractor(Document document)
有参构造
|
| 限定符和类型 | 方法和说明 |
|---|---|
Map<Integer,List<String>> |
extractByRegex(String regex,
int... pageIndexes)
正则提取文本
|
Map<Integer,Map<String,String>> |
extractByRegionArea(String wordSeparator,
Map<String,Rectangle> regionArea,
int... pageIndexes)
提取文本
|
Map<Integer,Map<String,List<List<String>>>> |
extractByTable(String wordSeparator,
Map<String,Rectangle> regionArea,
int... pageIndexes)
表格提取文本
|
extractText, processTextByRegex, processTextByRegionArea, processTextByTablegetDocumentpublic TextExtractor(Document document)
document - 文档public Map<Integer,List<String>> extractByRegex(String regex, int... pageIndexes)
extractByRegex 在类中 AbstractTextExtractorregex - 正则表达式pageIndexes - 页面索引key = 页面索引,value = 提取文本
public Map<Integer,Map<String,String>> extractByRegionArea(String wordSeparator, Map<String,Rectangle> regionArea, int... pageIndexes)
extractByRegionArea 在类中 AbstractTextExtractorwordSeparator - 单词分隔符regionArea - 区域pageIndexes - 页面索引key = 页面索引,value = 提取文本
public Map<Integer,Map<String,List<List<String>>>> extractByTable(String wordSeparator, Map<String,Rectangle> regionArea, int... pageIndexes)
extractByTable 在类中 AbstractTextExtractorwordSeparator - 单词分隔符regionArea - 区域pageIndexes - 页面索引key = 页面索引,value = 提取文本
Copyright © 2024. All rights reserved.