org.ow2.weblab.service.language
Class LanguageExtraction
java.lang.Object
org.ow2.weblab.service.language.LanguageExtraction
- All Implemented Interfaces:
- org.ow2.weblab.core.services.Analyser
public class LanguageExtraction
- extends java.lang.Object
- implements org.ow2.weblab.core.services.Analyser
This class is a WebLab Web service for identifying the language of a Text.
It's a wrapper of the NGramJ project: "http://ngramj.sourceforge.net/". It uses the CNGram system that can computes character string instead of raw
text files.
This algorithm return for each input text a score associated to every language profile previously learned (.ngp files). The score is a double between 0 and
1. 1 meaning that this text is written in this language for sure. 0 on the opposite means that this text is not written in this language. The sum of score
equals 1.
Our wrapper annotate every Text section of a ComposedUnit in input (or the Text if the input is a Text). It fails if the input is something else. On each
Text it uses CGram to determine which language profile are the best candidate to be annotated (using DC:language property).
It can be configured using a property file named ngram.properties. In this file you can handle 7 properties.
- minSingleValue: It's a double value between 0 and 1. If the best language score is greater than this value, it will be the only one annotated on a given
Text
- minMultipleValue: It's a double value between 0 and 1. Every language score that are greater than this value, will be annotated on a given Text.
- maxNbValues: It's a positive integer value. The list of annotated language on a given Text could not be greater that this value.
- profilesFolderPath: It's a String that represents a folder path; This folder contains .ngp files that will be loaded instead of default CNGram 28
languages.
- addTopLevelAnnot: It's a boolean value. It defines whether or not to annotate the whole document with the language extracted from the concatenation of
every Text content.
- addMediaUnitLevelAnnot: It's a boolean value. It defines whether or not to annotate the each Text section with the language guessed.
- isProducedByObject: It's a String value that should be a valid URI. It defines the URI to be used as object of every isProducedBy statements on
annotations created by the service.
Those 7 properties are optional. Default values are:
- minSingleValue: '0.75'
- minMultipleValue: '0.15'
- maxNbValues: '1'
- profilesFolderPath: in this case, we use the default constructor for CNGram profile that will use default profile given in their jar file. These 28
profiles are named using ISO 639-1 two letters language code; it means that the DC:language annotation resulting will be in this format. If you want to use
another format, you have use a custom profiles folder (containing .ngp files).
- addTopLevelAnnot:
false
- addTopLevelAnnot:
true
- isProducedByObject:
null in this case, no isProducedBy annotation will be created.
- Author:
- EADS IPCC Team
- Date:
- 2009-11-05
|
Method Summary |
void |
init()
Read the property file to get fields values. |
org.ow2.weblab.core.services.analyser.ProcessReturn |
process(org.ow2.weblab.core.services.analyser.ProcessArgs processArgs)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LanguageExtraction
public LanguageExtraction()
init
@PostConstruct
public void init()
throws LanguageExtractionException
- Read the property file to get fields values.
- Throws:
LanguageExtractionException
process
public org.ow2.weblab.core.services.analyser.ProcessReturn process(org.ow2.weblab.core.services.analyser.ProcessArgs processArgs)
throws org.ow2.weblab.core.services.AccessDeniedException,
org.ow2.weblab.core.services.ContentNotAvailableException,
org.ow2.weblab.core.services.InsufficientResourcesException,
org.ow2.weblab.core.services.InvalidParameterException,
org.ow2.weblab.core.services.ServiceNotConfiguredException,
org.ow2.weblab.core.services.UnexpectedException,
org.ow2.weblab.core.services.UnsupportedRequestException
- Specified by:
process in interface org.ow2.weblab.core.services.Analyser
- Throws:
org.ow2.weblab.core.services.AccessDeniedException
org.ow2.weblab.core.services.ContentNotAvailableException
org.ow2.weblab.core.services.InsufficientResourcesException
org.ow2.weblab.core.services.InvalidParameterException
org.ow2.weblab.core.services.ServiceNotConfiguredException
org.ow2.weblab.core.services.UnexpectedException
org.ow2.weblab.core.services.UnsupportedRequestException
Copyright © 2004-2012. All Rights Reserved.