org.ow2.weblab.service.language
Class LanguageExtraction
java.lang.Object
org.ow2.weblab.service.language.LanguageExtraction
- All Implemented Interfaces:
- org.weblab_project.services.analyser.Analyser
public class LanguageExtraction
- extends java.lang.Object
- implements org.weblab_project.services.analyser.Analyser
This class is a WebLab Web service for identifying the language of a Text.
It's a wrapper of the NGramJ project: "http://ngramj.sourceforge.net/". It uses the CNGram system that can computes character string instead of raw text files.
This algorithm return for each input text a score associated to every language profile previously learned (.ngp files). The score is a double between 0 and 1. 1 meaning that this text is written in
this language for sure. 0 on the opposite means that this text is not written in this language. The sum of score equals 1.
Our wrapper annotate every Text section of a ComposedUnit in input (or the Text if the input is a Text). It fails if the input is something else. On each Text it uses CGram to determine which
language profile are the best candidate to be annotated (using DC:language property).
It can be configured using a property file named ngram.properties. In this file you can handle 6 properties.
- minSingleValue: It's a double value between 0 and 1. If the best language score is greater than this value, it will be the only one annotated on a given Text
- minMultipleValue: It's a double value between 0 and 1. Every language score that are greater than this value, will be annotated on a given Text.
- maxNbValues: It's a positive integer value. The list of annotated language on a given Text could not be greater that this value.
- profilesFolderPath: It's a String that represents a folder path; This folder contains .ngp files that will be loaded instead of default CNGram 28 languages.
- addTopLevelAnnot: It's a boolean value. It defines whether or not to annotate the whole document with the language extracted from the concatenation of every Text content.
- isProducedByObject: It's a String value that should be a valid URI. It defines the URI to be used as object of every isProducedBy statements on annotations created by the service.
Those 6 properties are optional. Default values are:
- minSingleValue: '0.75'
- minMultipleValue: '0.15'
- maxNbValues: '1'
- profilesFolderPath: in this case, we use the default constructor for CNGram profile that will use default profile given in their jar file. These 28 profiles are named using ISO 639-1 two
letters language code; it means that the DC:language annotation resulting will be in this format. If you want to use another format, you have use a custom profiles folder (containing .ngp files).
- addTopLevelAnnot: false
- isProducedByObject: in this case, no isProducedBy annotation will be created.
- Author:
- EADS IPCC Team
- Date:
- 2009-11-05
|
Method Summary |
void |
init()
Read the property file to get fields values. |
org.weblab_project.services.analyser.types.ProcessReturn |
process(org.weblab_project.services.analyser.types.ProcessArgs processArgs)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LanguageExtraction
public LanguageExtraction()
init
@PostConstruct
public void init()
throws LanguageExtractionException
- Read the property file to get fields values.
- Throws:
LanguageExtractionException
process
public org.weblab_project.services.analyser.types.ProcessReturn process(org.weblab_project.services.analyser.types.ProcessArgs processArgs)
throws org.weblab_project.services.analyser.ProcessException
- Specified by:
process in interface org.weblab_project.services.analyser.Analyser
- Throws:
org.weblab_project.services.analyser.ProcessException
Copyright © 2004-2010. All Rights Reserved.