Package org.genesys.taxonomy.checker
Class MostFrequentKChars
java.lang.Object
org.genesys.taxonomy.checker.MostFrequentKChars
Based on pseudocode at https://en.wikipedia.org/wiki/Most_frequent_k_characters and http://rosettacode.org/wiki/Most_frequent_k_chars_distance
Does not handle digits [0-9] for obvious reasons.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic intgetMostFreqKSimilarity(int[] hash1, int[] hash2) Calculate the similarity of the two hashes.static intgetMostFreqKSimilarity(String hash1, String hash2) Calculate the similarity of the two hashes.static StringgetMostFrequentKHash(String string, int k) Get the hash for an input string with at most K most frequent characters.static doublemostFreqKSDF(String inputStr1, String inputStr2, int K) Most freq ksdf.static intmostFreqKSDF(String inputStr1, String inputStr2, int K, int maxDistance) Wrapper function.static StringtoHashString(int[] h1) Encode a hash array to String.
-
Constructor Details
-
MostFrequentKChars
public MostFrequentKChars()
-
-
Method Details
-
getMostFrequentKHash
Get the hash for an input string with at most K most frequent characters.String function MostFreqKHashing (String inputString, int K) def string outputString for each distinct character count occurrence of each character for i := 0 to K char c = next most freq ith character (if two chars have same frequency then get the first occurrence in inputString) int count = number of occurrence of the character append to outputString, c and count end for return outputString- Parameters:
string- the stringk- the k- Returns:
- the most frequent k hash
-
getMostFreqKSimilarity
Calculate the similarity of the two hashes.- Parameters:
hash1- the hash1hash2- the hash2- Returns:
- the most freq k similarity
-
getMostFreqKSimilarity
public static int getMostFreqKSimilarity(int[] hash1, int[] hash2) Calculate the similarity of the two hashes.int function MostFreqKSimilarity (String inputStr1, String inputStr2, int limit) def int similarity for each c = next character from inputStr1 lookup c in inputStr2 if c is null continue // similarity += frequency of c in inputStr1 similarity += frequency of c in inputStr1 + frequency of c in inputStr2 // return limit - similarity return similarity- Parameters:
hash1- the hash1hash2- the hash2- Returns:
- the most freq k similarity
-
mostFreqKSDF
Wrapper function.int function MostFreqKSDF (string inputStr1, string inputStr2, int K, int maxDistance) return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))- Parameters:
inputStr1- the input str1inputStr2- the input str2K- the kmaxDistance- the max distance- Returns:
- the int
-
mostFreqKSDF
Most freq ksdf.- Parameters:
inputStr1- the input str1inputStr2- the input str2K- the k- Returns:
- the double
-
toHashString
Encode a hash array to String.- Parameters:
h1- hash array as generated- Returns:
- String representation of the hash array (e.g. "i3b2")
-