Class LuceneCandidateRetrieval
- java.lang.Object
-
- de.julielab.gene.candidateretrieval.LuceneCandidateRetrieval
-
- All Implemented Interfaces:
CandidateRetrieval,de.julielab.geneexpbase.candidateretrieval.CandidateRetrieval,Closeable,AutoCloseable
public class LuceneCandidateRetrieval extends Object implements CandidateRetrieval
-
-
Field Summary
Fields Modifier and Type Field Description static org.slf4j.LoggercandidateLogstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorCONJUNCTIONstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorDISJUNCTIONstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorDISJUNCTION_MINUS_1static de.julielab.geneexpbase.candidateretrieval.QueryGeneratorDISJUNCTION_MINUS_2static de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_CNFConjunctive normal form query where all tokens must be found in any field.static de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_CNF_WITH_SYNONYMSJust likeGENE_RECORDS_CNFbut with an additionalBooleanClause.Occur.SHOULDclause only for synonym matches created viaGeneRecordSynonymsQueryGenerator.static de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_DISMAXstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_FLAT_DISJUNCTIONPuts all tokens on all fields in one large disjunction.static de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_SYNONYMS_APPROXstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorGENE_RECORDS_SYNONYMS_EXACTstatic intJAROWINKLER_SCORERstatic intLEVENSHTEIN_SCORERstatic StringLOGGER_NAME_CANDIDATESstatic intLUCENE_MAX_HITSthe maximal number of hits lucene returns for a querystatic intLUCENE_SCORERstatic intMAXENT_SCORERstatic StringMAXENT_SCORER_MODELdefault model for MaxEntScorerstatic StringNAME_PRIO_DELIMITERstatic de.julielab.geneexpbase.candidateretrieval.QueryGeneratorNGRAM_2_3static intSIMPLE_SCORERstatic booleanTEST_MODEstatic intTFIDFstatic intTOKEN_JAROWINKLER_SCORERstatic Set<String>UNIT_TEST_GENE_ID_ACCUMULATION_SET
-
Constructor Summary
Constructors Constructor Description LuceneCandidateRetrieval(Configuration config, ExecutorService executorService, de.julielab.geneexpbase.services.CacheService cacheService)LuceneCandidateRetrieval(org.apache.lucene.search.IndexSearcher mentionIndexSearcher, de.julielab.geneexpbase.scoring.Scorer scorer)Deprecated.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()static AtomicLonggetCacheHits()static AtomicLonggetCacheMisses()List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, String organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> organisms, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention gm, Collection<String> taxId, de.julielab.geneexpbase.configuration.Parameters parameters, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> geneIdsFilter, Collection<String> organisms, boolean loadFields, de.julielab.geneexpbase.configuration.Parameters parameters, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> geneIdsFilter, Collection<String> organisms, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(String originalSearchTerm, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(String geneMentionText, String organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(String geneMentionText, Collection<String> organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(String geneMentionText, Collection<String> geneIdsFilter, Collection<String> organism, boolean loadFields, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getCandidates(String geneMentionText, Collection<String> geneIdsFilter, Collection<String> organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)ConfigurationgetConfiguration()List<de.julielab.geneexpbase.candidateretrieval.SynHit>getFamilyNames(de.julielab.geneexpbase.genemodel.GeneMention gm, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)Searches the index for the given gene mention filtered for family names.Set<GeneRecordHit>getGeneRecords(Collection<String> ids)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getIndexRecords(Collection<String> ids)ReturnsGeneRecordHitinstances with all fields loaded.List<de.julielab.geneexpbase.candidateretrieval.SynHit>getIndexRecords(Collection<String> ids, de.julielab.geneexpbase.genemodel.GeneName geneName, Function<de.julielab.geneexpbase.genemodel.GeneName,String> geneNameFunc, org.apache.lucene.search.IndexSearcher indexSearcher)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getIndexRecords(Collection<String> ids, org.apache.lucene.search.IndexSearcher indexSearcher)de.julielab.geneexpbase.TermNormalizergetNormalizer()List<de.julielab.geneexpbase.candidateretrieval.SynHit>getOriginalNamesIndexRecords(Collection<String> ids)List<de.julielab.geneexpbase.candidateretrieval.SynHit>getOriginalNamesIndexRecords(Collection<String> ids, de.julielab.geneexpbase.genemodel.GeneName geneName)de.julielab.geneexpbase.scoring.ScorergetScorer()StringgetScorerInfo()intgetScorerType()org.apache.lucene.search.spell.SpellCheckergetSpellingChecker()de.julielab.geneexpbase.scoring.TFIDFScorergetTFIDFOnGeneRecordNames()de.julielab.geneexpbase.scoring.TFIDFScorergetTFIDFOnGeneSynonyms()static AtomicLonggetTotalCacheGettime()static AtomicLonggetTotalCachePuttime()static AtomicLonggetTotalGeneRecordFieldLoadingTime()static AtomicLonggetTotalLuceneQueryTime()StringmapGeneIdToTaxId(String geneId)List<de.julielab.geneexpbase.candidateretrieval.SynHit>scoreIdsByBoWSynonyms(Collection<String> allSynonyms, Set<String> ids, de.julielab.geneexpbase.candidateretrieval.QueryGenerator qg)org.apache.commons.lang3.tuple.Pair<Map<String,Double>,Map<String,Set<String>>>scoreSynonymsRecordIndex(String queryType, Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Function<GeneRecordHit,String[]> synhit2namesFunc, de.julielab.geneexpbase.candidateretrieval.QueryGenerator qg)Scores each synonym in allSynonym against the IDs in ids.voidsetFulltextFieldsToRecordHits(Collection<? extends de.julielab.geneexpbase.candidateretrieval.SynHit> recordHits, Collection<String> fieldsToLoad)Sets the full text / gene context fields (generif, summary, interactions) to instances ofGeneRecordHit.voidsetNormalizer(de.julielab.geneexpbase.TermNormalizer normalizer)de.julielab.geneexpbase.scoring.ScorersetScorerType(int type)
-
-
-
Field Detail
-
TEST_MODE
public static final boolean TEST_MODE
- See Also:
- Constant Field Values
-
CONJUNCTION
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator CONJUNCTION
-
DISJUNCTION
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator DISJUNCTION
-
DISJUNCTION_MINUS_1
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator DISJUNCTION_MINUS_1
-
DISJUNCTION_MINUS_2
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator DISJUNCTION_MINUS_2
-
NGRAM_2_3
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator NGRAM_2_3
-
GENE_RECORDS_CNF
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_CNF
Conjunctive normal form query where all tokens must be found in any field.
-
GENE_RECORDS_CNF_WITH_SYNONYMS
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_CNF_WITH_SYNONYMS
Just likeGENE_RECORDS_CNFbut with an additionalBooleanClause.Occur.SHOULDclause only for synonym matches created viaGeneRecordSynonymsQueryGenerator.
-
GENE_RECORDS_FLAT_DISJUNCTION
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_FLAT_DISJUNCTION
Puts all tokens on all fields in one large disjunction. Thus, not every token needs to match. Used for context scoring of gene names.
-
GENE_RECORDS_DISMAX
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_DISMAX
-
GENE_RECORDS_SYNONYMS_APPROX
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_SYNONYMS_APPROX
-
GENE_RECORDS_SYNONYMS_EXACT
public static final de.julielab.geneexpbase.candidateretrieval.QueryGenerator GENE_RECORDS_SYNONYMS_EXACT
-
NAME_PRIO_DELIMITER
public static final String NAME_PRIO_DELIMITER
- See Also:
- Constant Field Values
-
LOGGER_NAME_CANDIDATES
public static final String LOGGER_NAME_CANDIDATES
- See Also:
- Constant Field Values
-
SIMPLE_SCORER
public static final int SIMPLE_SCORER
- See Also:
- Constant Field Values
-
TOKEN_JAROWINKLER_SCORER
public static final int TOKEN_JAROWINKLER_SCORER
- See Also:
- Constant Field Values
-
MAXENT_SCORER
public static final int MAXENT_SCORER
- See Also:
- Constant Field Values
-
JAROWINKLER_SCORER
public static final int JAROWINKLER_SCORER
- See Also:
- Constant Field Values
-
LEVENSHTEIN_SCORER
public static final int LEVENSHTEIN_SCORER
- See Also:
- Constant Field Values
-
TFIDF
public static final int TFIDF
- See Also:
- Constant Field Values
-
LUCENE_SCORER
public static final int LUCENE_SCORER
- See Also:
- Constant Field Values
-
MAXENT_SCORER_MODEL
public static final String MAXENT_SCORER_MODEL
default model for MaxEntScorer- See Also:
- Constant Field Values
-
candidateLog
public static final org.slf4j.Logger candidateLog
-
LUCENE_MAX_HITS
public static final int LUCENE_MAX_HITS
the maximal number of hits lucene returns for a query- See Also:
- Constant Field Values
-
-
Constructor Detail
-
LuceneCandidateRetrieval
@Deprecated public LuceneCandidateRetrieval(org.apache.lucene.search.IndexSearcher mentionIndexSearcher, de.julielab.geneexpbase.scoring.Scorer scorer)
Deprecated.
-
LuceneCandidateRetrieval
@Inject public LuceneCandidateRetrieval(Configuration config, ExecutorService executorService, de.julielab.geneexpbase.services.CacheService cacheService) throws de.julielab.geneexpbase.candidateretrieval.GeneCandidateRetrievalException
- Throws:
de.julielab.geneexpbase.candidateretrieval.GeneCandidateRetrievalException
-
-
Method Detail
-
getTotalCacheGettime
public static AtomicLong getTotalCacheGettime()
-
getTotalGeneRecordFieldLoadingTime
public static AtomicLong getTotalGeneRecordFieldLoadingTime()
-
getTotalCachePuttime
public static AtomicLong getTotalCachePuttime()
-
getTotalLuceneQueryTime
public static AtomicLong getTotalLuceneQueryTime()
-
getCacheMisses
public static AtomicLong getCacheMisses()
-
getCacheHits
public static AtomicLong getCacheHits()
-
getTFIDFOnGeneRecordNames
public de.julielab.geneexpbase.scoring.TFIDFScorer getTFIDFOnGeneRecordNames()
- Specified by:
getTFIDFOnGeneRecordNamesin interfaceCandidateRetrieval
-
getTFIDFOnGeneSynonyms
public de.julielab.geneexpbase.scoring.TFIDFScorer getTFIDFOnGeneSynonyms()
-
getConfiguration
public Configuration getConfiguration()
-
getNormalizer
public de.julielab.geneexpbase.TermNormalizer getNormalizer()
-
setNormalizer
public void setNormalizer(de.julielab.geneexpbase.TermNormalizer normalizer)
-
getScorer
public de.julielab.geneexpbase.scoring.Scorer getScorer()
-
getSpellingChecker
public org.apache.lucene.search.spell.SpellChecker getSpellingChecker()
- Specified by:
getSpellingCheckerin interfaceCandidateRetrieval
-
setScorerType
public de.julielab.geneexpbase.scoring.Scorer setScorerType(int type) throws de.julielab.geneexpbase.candidateretrieval.GeneCandidateRetrievalException- Throws:
de.julielab.geneexpbase.candidateretrieval.GeneCandidateRetrievalException
-
getScorerInfo
public String getScorerInfo()
-
getScorerType
public int getScorerType()
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(String originalSearchTerm, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfacede.julielab.geneexpbase.candidateretrieval.CandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> organisms, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfacede.julielab.geneexpbase.candidateretrieval.CandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> geneIdsFilter, Collection<String> organisms, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfacede.julielab.geneexpbase.candidateretrieval.CandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, Collection<String> geneIdsFilter, Collection<String> organisms, boolean loadFields, de.julielab.geneexpbase.configuration.Parameters parameters, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(String geneMentionText, Collection<String> geneIdsFilter, Collection<String> organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(String geneMentionText, Collection<String> geneIdsFilter, Collection<String> organism, boolean loadFields, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention geneMention, String organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(String geneMentionText, String organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(String geneMentionText, Collection<String> organism, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
mapGeneIdToTaxId
public String mapGeneIdToTaxId(String geneId)
- Specified by:
mapGeneIdToTaxIdin interfacede.julielab.geneexpbase.candidateretrieval.CandidateRetrieval
-
getIndexRecords
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getIndexRecords(Collection<String> ids)
ReturnsGeneRecordHitinstances with all fields loaded.- Parameters:
ids- IDs of the gene records to return.- Returns:
- The records for the given IDs.
-
getOriginalNamesIndexRecords
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getOriginalNamesIndexRecords(Collection<String> ids)
- Specified by:
getOriginalNamesIndexRecordsin interfaceCandidateRetrieval
-
getOriginalNamesIndexRecords
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getOriginalNamesIndexRecords(Collection<String> ids, de.julielab.geneexpbase.genemodel.GeneName geneName)
- Specified by:
getOriginalNamesIndexRecordsin interfaceCandidateRetrieval
-
getIndexRecords
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getIndexRecords(Collection<String> ids, org.apache.lucene.search.IndexSearcher indexSearcher)
-
getIndexRecords
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getIndexRecords(Collection<String> ids, de.julielab.geneexpbase.genemodel.GeneName geneName, Function<de.julielab.geneexpbase.genemodel.GeneName,String> geneNameFunc, org.apache.lucene.search.IndexSearcher indexSearcher)
- Parameters:
ids- The gene IDs of the index items to retrieve.geneName- The gene name to add as the mapped mention and to use to find the synonym matching the gene name best.geneNameFunc- The function to be applied to geneName in order to retrieve a string for comparison.indexSearcher- The index searcher to use.- Returns:
- The found SynHits matching the input IDs.
-
scoreIdsByBoWSynonyms
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> scoreIdsByBoWSynonyms(Collection<String> allSynonyms, Set<String> ids, de.julielab.geneexpbase.candidateretrieval.QueryGenerator qg)
- Specified by:
scoreIdsByBoWSynonymsin interfaceCandidateRetrieval
-
scoreSynonymsRecordIndex
public org.apache.commons.lang3.tuple.Pair<Map<String,Double>,Map<String,Set<String>>> scoreSynonymsRecordIndex(String queryType, Map<String,Collection<de.julielab.geneexpbase.genemodel.GeneName>> ids2entities, Function<GeneRecordHit,String[]> synhit2namesFunc, de.julielab.geneexpbase.candidateretrieval.QueryGenerator qg)
Scores each synonym in allSynonym against the IDs in ids.
Each resulting SynHit adds its mention score to the ID represented by this SynHit.
- Specified by:
scoreSynonymsRecordIndexin interfaceCandidateRetrieval- Parameters:
queryType-ids2entities-qg-- Returns:
-
getCandidates
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getCandidates(de.julielab.geneexpbase.genemodel.GeneMention gm, Collection<String> taxId, de.julielab.geneexpbase.configuration.Parameters parameters, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
- Specified by:
getCandidatesin interfaceCandidateRetrieval
-
close
public void close()
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCandidateRetrieval- Specified by:
closein interfaceCloseable
-
getFamilyNames
public List<de.julielab.geneexpbase.candidateretrieval.SynHit> getFamilyNames(de.julielab.geneexpbase.genemodel.GeneMention gm, de.julielab.geneexpbase.candidateretrieval.QueryGenerator queryGenerator)
Description copied from interface:CandidateRetrievalSearches the index for the given gene mention filtered for family names.- Specified by:
getFamilyNamesin interfaceCandidateRetrieval- Parameters:
gm- The gene mention to check for family names.queryGenerator- The query generator to use.- Returns:
-
setFulltextFieldsToRecordHits
public void setFulltextFieldsToRecordHits(Collection<? extends de.julielab.geneexpbase.candidateretrieval.SynHit> recordHits, Collection<String> fieldsToLoad)
Sets the full text / gene context fields (generif, summary, interactions) to instances of
GeneRecordHit.Note that this method accepts plain
SynHitinstances for convenience. But the actual objects must be GeneRecordHits.- Specified by:
setFulltextFieldsToRecordHitsin interfaceCandidateRetrieval- Parameters:
recordHits- The GeneRecordHits to set the full text / gene context values for.fieldsToLoad- The gene context fields to load and set. Must be included infullTextFieldSetter.
-
getGeneRecords
public Set<GeneRecordHit> getGeneRecords(Collection<String> ids)
-
-