Package org.miaixz.bus.core.codec.hash
Class Simhash
java.lang.Object
org.miaixz.bus.core.codec.hash.Simhash
- All Implemented Interfaces:
Encoder<Collection<? extends CharSequence>,,Number> Hash64<Collection<? extends CharSequence>>
Simhash是一种局部敏感hash,用于海量文本去重。
算法实现来自:https://github.com/xlturing/Simhash4J
局部敏感hash定义:假定两个字符串具有一定的相似性,在hash之后,仍然能保持这种相似性,就称之为局部敏感hash。
- Since:
- Java 17+
- Author:
- Kimi Liu
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanequals(Collection<? extends CharSequence> segList) 判断文本是否与已存储的数据重复longhash64(Collection<? extends CharSequence> segList) 指定文本计算simhash值void按照(frac, simhash, content)索引进行存储
-
Constructor Details
-
Simhash
public Simhash()构造 -
Simhash
public Simhash(int fracCount, int hammingThresh) 构造- Parameters:
fracCount- 存储段数hammingThresh- 汉明距离的衡量标准
-
-
Method Details
-
hash64
指定文本计算simhash值- Specified by:
hash64in interfaceHash64<Collection<? extends CharSequence>>- Parameters:
segList- 分词的词列表- Returns:
- Hash值
-
equals
判断文本是否与已存储的数据重复- Parameters:
segList- 文本分词后的结果- Returns:
- 是否重复
-
store
按照(frac, simhash, content)索引进行存储- Parameters:
simhash- Simhash值
-