Class MostFrequentKChars

java.lang.Object
org.genesys.taxonomy.checker.MostFrequentKChars

public class MostFrequentKChars extends Object
Based on pseudocode at https://en.wikipedia.org/wiki/Most_frequent_k_characters and http://rosettacode.org/wiki/Most_frequent_k_chars_distance Does not handle digits [0-9] for obvious reasons.
  • Constructor Details

    • MostFrequentKChars

      public MostFrequentKChars()
  • Method Details

    • getMostFrequentKHash

      public static String getMostFrequentKHash(String string, int k)
      Get the hash for an input string with at most K most frequent characters.
              String function MostFreqKHashing (String inputString, int K)
                      def string outputString
                      for each distinct character
                          count occurrence of each character
                      for i := 0 to K
                          char c = next most freq ith character  (if two chars have same frequency then get the first occurrence in inputString)
                          int count = number of occurrence of the character
                          append to outputString, c and count
                      end for
                      return outputString
       
      Parameters:
      string - the string
      k - the k
      Returns:
      the most frequent k hash
    • getMostFreqKSimilarity

      public static int getMostFreqKSimilarity(String hash1, String hash2)
      Calculate the similarity of the two hashes.
      Parameters:
      hash1 - the hash1
      hash2 - the hash2
      Returns:
      the most freq k similarity
    • getMostFreqKSimilarity

      public static int getMostFreqKSimilarity(int[] hash1, int[] hash2)
      Calculate the similarity of the two hashes.
                      int function MostFreqKSimilarity (String inputStr1, String inputStr2, int limit)
                          def int similarity
                          for each c = next character from inputStr1
                              lookup c in inputStr2
                              if c is null
                                   continue
                              // similarity += frequency of c in inputStr1
                              similarity += frequency of c in inputStr1 + frequency of c in inputStr2
                          // return limit - similarity
                          return similarity
       
      Parameters:
      hash1 - the hash1
      hash2 - the hash2
      Returns:
      the most freq k similarity
    • mostFreqKSDF

      public static int mostFreqKSDF(String inputStr1, String inputStr2, int K, int maxDistance)
      Wrapper function.
                      int function MostFreqKSDF (string inputStr1, string inputStr2, int K, int maxDistance)
                          return maxDistance - MostFreqKSimilarity(MostFreqKHashing(inputStr1,K), MostFreqKHashing(inputStr2,K))
       
      Parameters:
      inputStr1 - the input str1
      inputStr2 - the input str2
      K - the k
      maxDistance - the max distance
      Returns:
      the int
    • mostFreqKSDF

      public static double mostFreqKSDF(String inputStr1, String inputStr2, int K)
      Most freq ksdf.
      Parameters:
      inputStr1 - the input str1
      inputStr2 - the input str2
      K - the k
      Returns:
      the double
    • toHashString

      public static String toHashString(int[] h1)
      Encode a hash array to String.
      Parameters:
      h1 - hash array as generated
      Returns:
      String representation of the hash array (e.g. "i3b2")