public abstract class PepperImporterImpl extends PepperModuleImpl implements PepperImporter
An importer in Pepper reads data from a format A and maps its data to a Salt
model. An importer must implement the class PepperImporter and can
extend the this class. We strongly recommend to extend this class, since it
contains a lot of helpful functions and methods controlling the workflow.
PepperImporter| Modifier and Type | Field and Description |
|---|---|
protected CorpusDesc |
corpusDesc
TODO make docu
|
isMultithreaded, logger, moduleController, resources, saltProject, sCorpusGraph, symbolicName, temprorariesNEGATIVE_FILE_EXTENSION_MARKERENDING_ALL_FILES, ENDING_FOLDER, ENDING_LEAF_FOLDER, ENDING_TAB, ENDING_TXT, ENDING_XML| Modifier | Constructor and Description |
|---|---|
protected |
PepperImporterImpl()
Creates a
PepperModule of type MODULE_TYPE.IMPORTER. |
protected |
PepperImporterImpl(String name)
Creates a
PepperModule of type MODULE_TYPE.IMPORTER and
sets is name to the passed one. |
| Modifier and Type | Method and Description |
|---|---|
FormatDesc |
addSupportedFormat(String formatName,
String formatVersion,
org.eclipse.emf.common.util.URI formatReference) |
CorpusDesc |
getCorpusDesc()
TODO docu
|
Collection<String> |
getCorpusEndings()
Returns a collection of all file endings for a
SCorpus object. |
Collection<String> |
getDocumentEndings()
Returns list containing all format endings for files, which are
importable and could be mapped to
SDocument or
SDocumentGraph objects by this Pepper module. |
Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> |
getIdentifier2ResourceTable()
Stores
Identifier objects corresponding to either a
SDocument or a SCorpus object, which has been created
during the run of PepperImporter.importCorpusStructure(SCorpusGraph). |
Collection<String> |
getIgnoreEndings()
Returns a collection of filenames, not to be imported.
|
List<FormatDesc> |
getSupportedFormats()
Returns a list of formats, which are importable by this
PepperImporter object. |
void |
importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)
This method is called by Pepper at the start of a conversion process to
create the corpus-structure.
|
protected Boolean |
importCorpusStructureRec(org.eclipse.emf.common.util.URI currURI,
org.corpus_tools.salt.common.SCorpus parent)
Top down traversal in file given structure.
|
Double |
isImportable(org.eclipse.emf.common.util.URI corpusPath)
This method is called by Pepper and returns if a corpus located at the
given
URI is importable by this importer. |
protected void |
readXMLResource(DefaultHandler2 contentHandler,
org.eclipse.emf.common.util.URI documentLocation)
Helper method to read an xml file with a
DefaultHandler2
implementation given as contentHandler. |
protected Collection<String> |
sampleFileContent(org.eclipse.emf.common.util.URI corpusPath,
String... fileEndings)
Returns lines of a
sampled set of files
having the ending specified by
fileEndings recursively from
specified corpus path. |
void |
setCorpusDesc(CorpusDesc newCorpusDefinition)
TODO docu
|
void |
setCorpusPathResolver(CorpusPathResolver corpusPathResolver)
Sets a
CorpusPathResolver which is used by
isImportable(URI). |
org.corpus_tools.salt.SALT_TYPE |
setTypeOfResource(org.eclipse.emf.common.util.URI resource)
This method is a callback and can be overridden by derived importers.
|
void |
start()
Overrides the method
PepperModuleImpl.start() to add the
following, before PepperModuleImpl.start() is called. |
activate, createPepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getDocumentId2DC, getFingerprint, getMapperControllers, getMapperThreadGroup, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setMapperThreadGroup, setName, setPepperModuleController_basic, setPepperModuleController, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, start, toString, uncaughtExceptionclone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitcreatePepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getFingerprint, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setPepperModuleController_basic, setPepperModuleController, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, startprotected CorpusDesc corpusDesc
protected PepperImporterImpl()
PepperModule of type MODULE_TYPE.IMPORTER. The
name is set to "MyImporter".
PepperImporterImpl(String) and pass a proper
name.protected PepperImporterImpl(String name)
PepperModule of type MODULE_TYPE.IMPORTER and
sets is name to the passed one.public List<FormatDesc> getSupportedFormats()
PepperImporter object.getSupportedFormats in interface PepperImporterpublic FormatDesc addSupportedFormat(String formatName, String formatVersion, org.eclipse.emf.common.util.URI formatReference)
addSupportedFormat in interface PepperImporterpublic CorpusDesc getCorpusDesc()
getCorpusDesc in interface PepperImporterpublic void setCorpusDesc(CorpusDesc newCorpusDefinition)
setCorpusDesc in interface PepperImporterpublic Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> getIdentifier2ResourceTable()
Identifier objects corresponding to either a
SDocument or a SCorpus object, which has been created
during the run of PepperImporter.importCorpusStructure(SCorpusGraph).
Corresponding to the Identifier object this table stores the
resource from where the element shall be imported.| corpus_1 | /home/me/corpora/myCorpus |
| corpus_2 | /home/me/corpora/myCorpus/subcorpus |
| doc_1 | /home/me/corpora/myCorpus/subcorpus/document1.xml |
| doc_2 | /home/me/corpora/myCorpus/subcorpus/document2.xml |
getIdentifier2ResourceTable in interface PepperImporterpublic void importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)
throws PepperModuleException
SCorpus), documents
(represented represented via the Salt element SDocument) and a
linking between corpora and a corpus and a document (represented via the
Salt element SCorpusRelation and SCorpusDocumentRelation
). Each corpus corpus can contain 0..* subcorpus and 0..* documents, but
a corpus cannot contain both document and corpus. PepperImporter.setTypeOfResource(URI) is called to set the type of the
resource. If the type is a SALT_TYPE.SDOCUMENT a
SDocument object is created for the resource, if the type is a
SALT_TYPE.SCORPUS a SCorpus object is created, if the
type is null, the resource is ignored.importCorpusStructure in interface PepperImportercorpusGraph - an empty graph given by Pepper, which shall contains the
corpus structurePepperModuleExceptionprotected Boolean importCorpusStructureRec(org.eclipse.emf.common.util.URI currURI, org.corpus_tools.salt.common.SCorpus parent)
importCorpusStructure(SCorpusGraph) and creates the
corpus-structure via a top down traversal in file structure. For each
found file (real file and folder), the method
setTypeOfResource(URI) is called to set the type of the
resource. If the type is a SALT_TYPE.SDOCUMENT a
SDocument object is created for the resource, if the type is a
SALT_TYPE.SCORPUS a SCorpus object is created, if the
type is null, the resource is ignored.currURI - parentsID - endings - IOExceptionpublic void start()
throws PepperModuleException
PepperModuleImpl.start() to add the
following, before PepperModuleImpl.start() is called.
start in interface PepperModulestart in class PepperModuleImplPepperModuleExceptionpublic Collection<String> getDocumentEndings()
SDocument or
SDocumentGraph objects by this Pepper module.getDocumentEndings in interface PepperImporterpublic Collection<String> getCorpusEndings()
SCorpus object.
See . This list contains per default value
. To remove the default value, call
Collection.remove(Object) on PepperImporter.getCorpusEndings(). To add
endings to the collection, call Collection#add(Ending) and to
remove endings from the collection, call
Collection#remove(Ending).getCorpusEndings in interface PepperImporterpublic org.corpus_tools.salt.SALT_TYPE setTypeOfResource(org.eclipse.emf.common.util.URI resource)
PepperImporter.importCorpusStructure(SCorpusGraph)). During the traversal of
the file-structure the method
PepperImporter.importCorpusStructure(SCorpusGraph) calls this method for each
resource, to determine if the resource either represents a
SCorpus, a SDocument object or shall be ignored. PepperImporter.getDocumentEndings() SALT_TYPE.SDOCUMENT is returned
PepperImporter.getCorpusEndings() SALT_TYPE#SCorpus is returnedPepperImporter.getDocumentEndings() contains PepperModule.ENDING_ALL_FILES,
for each file (which is not a folder) SALT_TYPE.SDOCUMENT is
returnedPepperImporter.getDocumentEndings() contains PepperModule.ENDING_LEAF_FOLDER
, for each leaf folder SALT_TYPE.SDOCUMENT is returnedPepperImporter.getCorpusEndings() contains PepperModule.ENDING_FOLDER, for
each folder SALT_TYPE.SCORPUS is returnedsetTypeOfResource in interface PepperImporterresource - URI resource to be specifiedSALT_TYPE.SCORPUS if resource represents a
SCorpus object, SALT_TYPE.SDOCUMENT if resource
represents a SDocument object or null, if it shall be
igrnored.public Collection<String> getIgnoreEndings()
Collection#add(Ending) and to remove endings from the collection,
call Collection#remove(Ending). .getIgnoreEndings in interface PepperImporterprotected void readXMLResource(DefaultHandler2 contentHandler, org.eclipse.emf.common.util.URI documentLocation)
DefaultHandler2
implementation given as contentHandler. It is assumed, that the
file encoding is set to UTF-8.contentHandler - DefaultHandler2 implementationdocumentLocation - location of the xml-filepublic Double isImportable(org.eclipse.emf.common.util.URI corpusPath)
URI is importable by this importer. If yes, 1 must be
returned, if no 0 must be returned. If it is not quite sure, if the given
corpus is importable by this importer any value between 0 and 1 can be
returned. If this method is not overridden, null is returned.isImportable in interface PepperImporterpublic void setCorpusPathResolver(CorpusPathResolver corpusPathResolver)
CorpusPathResolver which is used by
isImportable(URI). With a CorpusPathResolver it is
possible, to share read lines of files between multiple importers. Doing
this saves time for retrieving the content of the corpus path and the
reading of the first x lines of the files.corpusPathResolver - protected Collection<String> sampleFileContent(org.eclipse.emf.common.util.URI corpusPath, String... fileEndings)
fileEndings recursively from
specified corpus path.
This method only delegates to
IsImportableUtil#sampleFileContent(URI, int, int, String...). The
class IsImportableUtil also contains further helper methods, in
case this method is too unprecise.
corpusPath - directory to be searched infileEndings - endings to be considered. If no endings specified, all files
are considerednumberOfLines lines of
numberOfSampledFiles filesCopyright © 2009–2019 Humboldt-Universität zu Berlin, INRIA. All rights reserved.