Package org.corpus_tools.pepper.modules
Interface PepperImporter
-
- All Superinterfaces:
PepperModule
- All Known Implementing Classes:
DoNothingImporter,PepperImporterImpl,SaltXMLImporter,TextImporter
public interface PepperImporter extends PepperModule
A mapping task in the Pepper workflow is not a monolithic block. It consists of several smaller steps.
- Declare the fingerprint of the module. This is part of the constructor.
- Check readyness of the module.
- Analyze whether the files in the passed corpus path is importable by this importer.
- Import the corpus structure.
- Import the document structure and create a mapper for each corpus and document.
- clean-up
Declare the fingerprint
Initialize the module and set the modules name, its description and the format description of data which are importable. This is part of the constructor:public MyModule() { super("Name of the module"); setSupplierContact(URI.createURI("Contact address of the module's supplier")); setSupplierHomepage(URI.createURI("homepage of the module")); setDesc("A short description of what is the intention of this module, for instance which formats are importable. "); this.addSupportedFormat("The name of a format which is importable e.g. txt", "The version corresponding to the format name", null); }Check readyness of the module
This method is invoked by the Pepper framework before the mapping process is started. This method must return true, otherwise, this Pepper module could not be used in a Pepper workflow. At this point problems which prevent the module from being used you can report all problems to the user, for instance a database connection could not be established.public boolean isReadyToStart() { return (true); }Analyze data
Depending on the formats you want to support with your importer the detection can be very different. In the simplest case, it only is necessary, to search through the files at the given location (or to recursively traverse through directories, in case the location points to a directory), and to read their header section. For instance some formats like the xml formats PAULA (see: http:// www.sfb632.uni-potsdam.de/en/paula.html ) or TEI (see: http://www.tei-c.org/Guidelines/P5/). The method should return a value between 0 and 1, where 0 means not importable and 1 means definitely importable. If null is returned, Pepper interprets this as unknown and will never suggest this module to the user.public Double isImportable(URI corpusPath) { return null; }Import corpus structure
The classesPepperImporterImplandPepperExporterImplprovide an automatic mechanism to im- or export the corpus-structure. This mechanism is adaptable step by step, according to your specific purpose. Since many formats do not care about the corpus-structure and they only encode the document-structure, the corpus-structure is simultaneous to the file structure of a corpus. Pepper's default mapping maps the root-folder to a root-corpus (SCorpusobject). A sub-folder then corresponds to a sub-corpus (SCorpusobject). The relation between super- and sub-corpus, is represented as aSCorpusRelationobject. Following the assumption, that files contain the document-structure, there is oneSDocumentcorresponding to each file in a sub-folder. TheSCorpusand theSDocumentobjects are linked with aSCorpusDocumentRelation.
For keeping the correspondance between the corpus-structure and the file structure, both the im- and the exporter make use of a map, which can be accessed viagetIdentifier2ResourceTable().
To adapt the behavior, you can set the file endings in the constructor as follows:this.getDocumentEndings().add("file ending");You can also add the valuePepperModule.ENDING_LEAF_FOLDERto import not files but leaf folders asSDocumentobjects. Another possibility is to add the valuePepperModule.ENDING_ALL_FILESto import all files no matter their ending.Import the document structure
In the methodPepperModule.createPepperMapper(Identifier)aPepperMapperobject needs to be initialized and returned. ThePepperMapperis the major part major part doing the mapping. It provides the methodsPepperMapper.mapSCorpus()to handle the mapping of a singleSCorpusobject andPepperMapper.mapSDocument()to handle a singleSDocumentobject. Both methods are invoked by the Pepper framework. To set thePepperMapper.getResourceURI(), which offers the mapper the file or folder of the currentSCorpusorSDocumentobject, this filed needs to be set in thePepperModule.createPepperMapper(Identifier)method. The following snippet shows a dummy of that method:public PepperMapper createPepperMapper(Identifier sElementId) { PepperMapper mapper = new PepperMapperImpl() { @Override public DOCUMENT_STATUS mapSCorpus() { // handling the mapping of a single corpus // accessing the current file or folder getResourceURI(); // returning, that the corpus was mapped successfully return (DOCUMENT_STATUS.COMPLETED); } @Override public DOCUMENT_STATUS mapSDocument() { // handling the mapping of a single document // accessing the current file or folder getResourceURI(); // returning, that the document was mapped successfully return (DOCUMENT_STATUS.COMPLETED); } }; // pass current file or folder to mapper. When using // PepperImporter.importCorpusStructure or // PepperExporter.exportCorpusStructure, the mapping between file or // folder // and SCorpus or SDocument was stored here mapper.setResourceURI(getIdentifier2ResourceTable().get(sElementId)); return (mapper); }clean-up
Sometimes it might be necessary to clean up after the module did the job. For instance when writing an im- or an exporter it might be necessary to close file streams, a db connection etc. Therefore, after the processing is done, the Pepper framework calls the method described in the following snippet:public void end() { super.end(); // do some clean up like closing of streams etc. }- Author:
- Florian Zipser
-
-
Field Summary
Fields Modifier and Type Field Description static StringNEGATIVE_FILE_EXTENSION_MARKERA character or character sequence to mark a file extension as not to be one of the imported ones.-
Fields inherited from interface org.corpus_tools.pepper.modules.PepperModule
ENDING_ALL_FILES, ENDING_FOLDER, ENDING_LEAF_FOLDER, ENDING_TAB, ENDING_TXT, ENDING_XML
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description FormatDescaddSupportedFormat(String formatName, String formatVersion, org.eclipse.emf.common.util.URI formatReference){@inheritDoc PepperModuleDesc#addSupportedFormat(String, String, URI)}CorpusDescgetCorpusDesc()TODO docuCollection<String>getCorpusEndings()Returns a collection of all file endings for aSCorpusobject.Collection<String>getDocumentEndings()Returns list containing all format endings for files, which are importable and could be mapped toSDocumentorSDocumentGraphobjects by this Pepper module.Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI>getIdentifier2ResourceTable()StoresIdentifierobjects corresponding to either aSDocumentor aSCorpusobject, which has been created during the run ofimportCorpusStructure(SCorpusGraph).Collection<String>getIgnoreEndings()Returns a collection of filenames, not to be imported.List<FormatDesc>getSupportedFormats()Returns a list of formats, which are importable by thisPepperImporterobject.voidimportCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)This method is called by Pepper at the start of a conversion process to create the corpus-structure.DoubleisImportable(org.eclipse.emf.common.util.URI corpusPath)This method is called by Pepper and returns if a corpus located at the givenURIis importable by this importer.voidsetCorpusDesc(CorpusDesc corpusDesc)TODO docuorg.corpus_tools.salt.SALT_TYPEsetTypeOfResource(org.eclipse.emf.common.util.URI resource)This method is a callback and can be overridden by derived importers.-
Methods inherited from interface org.corpus_tools.pepper.modules.PepperModule
createPepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getFingerprint, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setPepperModuleController, setPepperModuleController_basic, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, start, start
-
-
-
-
Field Detail
-
NEGATIVE_FILE_EXTENSION_MARKER
static final String NEGATIVE_FILE_EXTENSION_MARKER
A character or character sequence to mark a file extension as not to be one of the imported ones.- See Also:
- Constant Field Values
-
-
Method Detail
-
getSupportedFormats
List<FormatDesc> getSupportedFormats()
Returns a list of formats, which are importable by thisPepperImporterobject.- Returns:
-
getCorpusDesc
CorpusDesc getCorpusDesc()
TODO docu- Returns:
-
setCorpusDesc
void setCorpusDesc(CorpusDesc corpusDesc)
TODO docu
-
getIdentifier2ResourceTable
Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> getIdentifier2ResourceTable()
StoresIdentifierobjects corresponding to either aSDocumentor aSCorpusobject, which has been created during the run ofimportCorpusStructure(SCorpusGraph). Corresponding to theIdentifierobject this table stores the resource from where the element shall be imported.
For instance:corpus_1 /home/me/corpora/myCorpus corpus_2 /home/me/corpora/myCorpus/subcorpus doc_1 /home/me/corpora/myCorpus/subcorpus/document1.xml doc_2 /home/me/corpora/myCorpus/subcorpus/document2.xml
-
getDocumentEndings
Collection<String> getDocumentEndings()
Returns list containing all format endings for files, which are importable and could be mapped toSDocumentorSDocumentGraphobjects by this Pepper module.- Returns:
- a collection of endings
-
getCorpusEndings
Collection<String> getCorpusEndings()
Returns a collection of all file endings for aSCorpusobject. See {@inheritDoc #sCorpusEndings}. This list contains per default value "FOLDER". To remove the default value, callCollection.remove(Object)ongetCorpusEndings(). To add endings to the collection, callCollection#add(Ending)and to remove endings from the collection, callCollection#remove(Ending).- Returns:
- a collection of endings
-
getIgnoreEndings
Collection<String> getIgnoreEndings()
Returns a collection of filenames, not to be imported. {@inheritDoc #importIgnoreList} . To add endings to the collection, callCollection#add(Ending)and to remove endings from the collection, callCollection#remove(Ending).- Returns:
- a collection of endings to be ignored
-
setTypeOfResource
org.corpus_tools.salt.SALT_TYPE setTypeOfResource(org.eclipse.emf.common.util.URI resource)
This method is a callback and can be overridden by derived importers. This method is called via the import of the corpus-structure (importCorpusStructure(SCorpusGraph)). During the traversal of the file-structure the methodimportCorpusStructure(SCorpusGraph)calls this method for each resource, to determine if the resource either represents aSCorpus, aSDocumentobject or shall be ignored.
If this method is not overridden, the default behavior is:- For each file having an ending, which is contained in
getDocumentEndings()SALT_TYPE.SDOCUMENTis returned - For each file having an ending, which is contained in
getCorpusEndings()SALT_TYPE#SCorpusis returned - If
getDocumentEndings()containsPepperModule.ENDING_ALL_FILES, for each file (which is not a folder)SALT_TYPE.SDOCUMENTis returned - If
getDocumentEndings()containsPepperModule.ENDING_LEAF_FOLDER, for each leaf folderSALT_TYPE.SDOCUMENTis returned - If
getCorpusEndings()containsPepperModule.ENDING_FOLDER, for each folderSALT_TYPE.SCORPUSis returned - null otherwise
- Parameters:
resource-URIresource to be specified- Returns:
SALT_TYPE.SCORPUSif resource represents aSCorpusobject,SALT_TYPE.SDOCUMENTif resource represents aSDocumentobject or null, if it shall be igrnored.
- For each file having an ending, which is contained in
-
importCorpusStructure
void importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph) throws PepperModuleExceptionThis method is called by Pepper at the start of a conversion process to create the corpus-structure. A corpus-structure consists of corpora (represented via the Salt elementSCorpus), documents (represented represented via the Salt elementSDocument) and a linking between corpora and a corpus and a document (represented via the Salt elementSCorpusRelationandSCorpusDocumentRelation). Each corpus corpus can contain 0..* subcorpus and 0..* documents, but a corpus cannot contain both document and corpus.
For many cases the creation of the corpus-struccture can be done automatically, therefore, just adopt the two lists #gets
This method creates the corpus-structure via a top down traversal in file structure. For each found file (real file and folder), the methodsetTypeOfResource(URI)is called to set the type of the resource. If the type is aSALT_TYPE.SDOCUMENTaSDocumentobject is created for the resource, if the type is aSALT_TYPE.SCORPUSaSCorpusobject is created, if the type is null, the resource is ignored.- Parameters:
corpusGraph- an empty graph given by Pepper, which shall contains the corpus structure- Throws:
PepperModuleException
-
addSupportedFormat
FormatDesc addSupportedFormat(String formatName, String formatVersion, org.eclipse.emf.common.util.URI formatReference)
{@inheritDoc PepperModuleDesc#addSupportedFormat(String, String, URI)}
-
isImportable
Double isImportable(org.eclipse.emf.common.util.URI corpusPath)
This method is called by Pepper and returns if a corpus located at the givenURIis importable by this importer. If yes, 1 must be returned, if no 0 must be returned. If it is not quite sure, if the given corpus is importable by this importer any value between 0 and 1 can be returned. If this method is not overridden, null is returned.- Returns:
- 1 if corpus is importable, 0 if corpus is not importable, 0 < X < 1, if no definitiv answer is possible, null if method is not overridden
-
-