public interface PepperImporter extends PepperModule
A mapping task in the Pepper workflow is not a monolithic block. It consists of several smaller steps.
public MyModule() {
super("Name of the module");
setSupplierContact(URI.createURI("Contact address of the module's supplier"));
setSupplierHomepage(URI.createURI("homepage of the module"));
setDesc("A short description of what is the intention of this module, for instance which formats are importable. ");
this.addSupportedFormat("The name of a format which is importable e.g. txt",
"The version corresponding to the format name", null);
}
public boolean isReadyToStart() {
return (true);
}
public Double isImportable(URI corpusPath) {
return null;
}
PepperImporterImpl and
PepperExporterImpl provide an automatic mechanism to im- or export
the corpus-structure. This mechanism is adaptable step by step, according to
your specific purpose. Since many formats do not care about the
corpus-structure and they only encode the document-structure, the
corpus-structure is simultaneous to the file structure of a corpus. Pepper's
default mapping maps the root-folder to a root-corpus (SCorpus
object). A sub-folder then corresponds to a sub-corpus (SCorpus
object). The relation between super- and sub-corpus, is represented as a
SCorpusRelation object. Following the assumption, that files contain
the document-structure, there is one SDocument corresponding to each
file in a sub-folder. The SCorpus and the SDocument objects
are linked with a SCorpusDocumentRelation.getIdentifier2ResourceTable().
this.getDocumentEndings().add("file ending");
You can also add the value PepperModule.ENDING_LEAF_FOLDER to import
not files but leaf folders as SDocument objects. Another possibility
is to add the value PepperModule.ENDING_ALL_FILES to import all files
no matter their ending.
PepperModule.createPepperMapper(Identifier) a PepperMapper object needs
to be initialized and returned. The PepperMapper is the major part
major part doing the mapping. It provides the methods
PepperMapper.mapSCorpus() to handle the mapping of a single
SCorpus object and PepperMapper.mapSDocument() to handle a
single SDocument object. Both methods are invoked by the Pepper
framework. To set the PepperMapper.getResourceURI(), which offers the
mapper the file or folder of the current SCorpus or SDocument
object, this filed needs to be set in the
PepperModule.createPepperMapper(Identifier) method. The following snippet shows a
dummy of that method:
public PepperMapper createPepperMapper(Identifier sElementId) {
PepperMapper mapper = new PepperMapperImpl() {
@Override
public DOCUMENT_STATUS mapSCorpus() {
// handling the mapping of a single corpus
// accessing the current file or folder
getResourceURI();
// returning, that the corpus was mapped successfully
return (DOCUMENT_STATUS.COMPLETED);
}
@Override
public DOCUMENT_STATUS mapSDocument() {
// handling the mapping of a single document
// accessing the current file or folder
getResourceURI();
// returning, that the document was mapped successfully
return (DOCUMENT_STATUS.COMPLETED);
}
};
// pass current file or folder to mapper. When using
// PepperImporter.importCorpusStructure or
// PepperExporter.exportCorpusStructure, the mapping between file or
// folder
// and SCorpus or SDocument was stored here
mapper.setResourceURI(getIdentifier2ResourceTable().get(sElementId));
return (mapper);
}
public void end() {
super.end();
// do some clean up like closing of streams etc.
}
| Modifier and Type | Field and Description |
|---|---|
static String |
NEGATIVE_FILE_EXTENSION_MARKER
A character or character sequence to mark a file extension as not to be
one of the imported ones.
|
ENDING_ALL_FILES, ENDING_FOLDER, ENDING_LEAF_FOLDER, ENDING_TAB, ENDING_TXT, ENDING_XML| Modifier and Type | Method and Description |
|---|---|
FormatDesc |
addSupportedFormat(String formatName,
String formatVersion,
org.eclipse.emf.common.util.URI formatReference) |
CorpusDesc |
getCorpusDesc()
TODO docu
|
Collection<String> |
getCorpusEndings()
Returns a collection of all file endings for a
SCorpus object. |
Collection<String> |
getDocumentEndings()
Returns list containing all format endings for files, which are
importable and could be mapped to
SDocument or
SDocumentGraph objects by this Pepper module. |
Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> |
getIdentifier2ResourceTable()
Stores
Identifier objects corresponding to either a
SDocument or a SCorpus object, which has been created
during the run of importCorpusStructure(SCorpusGraph). |
Collection<String> |
getIgnoreEndings()
Returns a collection of filenames, not to be imported.
|
List<FormatDesc> |
getSupportedFormats()
Returns a list of formats, which are importable by this
PepperImporter object. |
void |
importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)
This method is called by Pepper at the start of a conversion process to
create the corpus-structure.
|
Double |
isImportable(org.eclipse.emf.common.util.URI corpusPath)
This method is called by Pepper and returns if a corpus located at the
given
URI is importable by this importer. |
void |
setCorpusDesc(CorpusDesc corpusDesc)
TODO docu
|
org.corpus_tools.salt.SALT_TYPE |
setTypeOfResource(org.eclipse.emf.common.util.URI resource)
This method is a callback and can be overridden by derived importers.
|
createPepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getFingerprint, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setPepperModuleController_basic, setPepperModuleController, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, start, startstatic final String NEGATIVE_FILE_EXTENSION_MARKER
List<FormatDesc> getSupportedFormats()
PepperImporter object.CorpusDesc getCorpusDesc()
void setCorpusDesc(CorpusDesc corpusDesc)
Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> getIdentifier2ResourceTable()
Identifier objects corresponding to either a
SDocument or a SCorpus object, which has been created
during the run of importCorpusStructure(SCorpusGraph).
Corresponding to the Identifier object this table stores the
resource from where the element shall be imported.| corpus_1 | /home/me/corpora/myCorpus |
| corpus_2 | /home/me/corpora/myCorpus/subcorpus |
| doc_1 | /home/me/corpora/myCorpus/subcorpus/document1.xml |
| doc_2 | /home/me/corpora/myCorpus/subcorpus/document2.xml |
Collection<String> getDocumentEndings()
SDocument or
SDocumentGraph objects by this Pepper module.Collection<String> getCorpusEndings()
SCorpus object.
See . This list contains per default value
. To remove the default value, call
Collection.remove(Object) on getCorpusEndings(). To add
endings to the collection, call Collection#add(Ending) and to
remove endings from the collection, call
Collection#remove(Ending).Collection<String> getIgnoreEndings()
Collection#add(Ending) and to remove endings from the collection,
call Collection#remove(Ending).org.corpus_tools.salt.SALT_TYPE setTypeOfResource(org.eclipse.emf.common.util.URI resource)
importCorpusStructure(SCorpusGraph)). During the traversal of
the file-structure the method
importCorpusStructure(SCorpusGraph) calls this method for each
resource, to determine if the resource either represents a
SCorpus, a SDocument object or shall be ignored. getDocumentEndings() SALT_TYPE.SDOCUMENT is returned
getCorpusEndings() SALT_TYPE#SCorpus is returnedgetDocumentEndings() contains PepperModule.ENDING_ALL_FILES,
for each file (which is not a folder) SALT_TYPE.SDOCUMENT is
returnedgetDocumentEndings() contains PepperModule.ENDING_LEAF_FOLDER
, for each leaf folder SALT_TYPE.SDOCUMENT is returnedgetCorpusEndings() contains PepperModule.ENDING_FOLDER, for
each folder SALT_TYPE.SCORPUS is returnedresource - URI resource to be specifiedSALT_TYPE.SCORPUS if resource represents a
SCorpus object, SALT_TYPE.SDOCUMENT if resource
represents a SDocument object or null, if it shall be
igrnored.void importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)
throws PepperModuleException
SCorpus), documents
(represented represented via the Salt element SDocument) and a
linking between corpora and a corpus and a document (represented via the
Salt element SCorpusRelation and SCorpusDocumentRelation
). Each corpus corpus can contain 0..* subcorpus and 0..* documents, but
a corpus cannot contain both document and corpus. setTypeOfResource(URI) is called to set the type of the
resource. If the type is a SALT_TYPE.SDOCUMENT a
SDocument object is created for the resource, if the type is a
SALT_TYPE.SCORPUS a SCorpus object is created, if the
type is null, the resource is ignored.corpusGraph - an empty graph given by Pepper, which shall contains the
corpus structurePepperModuleExceptionFormatDesc addSupportedFormat(String formatName, String formatVersion, org.eclipse.emf.common.util.URI formatReference)
Double isImportable(org.eclipse.emf.common.util.URI corpusPath)
URI is importable by this importer. If yes, 1 must be
returned, if no 0 must be returned. If it is not quite sure, if the given
corpus is importable by this importer any value between 0 and 1 can be
returned. If this method is not overridden, null is returned.Copyright © 2009–2019 Humboldt-Universität zu Berlin, INRIA. All rights reserved.