Class DistributedIdentifyContents


  • public class DistributedIdentifyContents
    extends Object
    Identify the expired and live contents in a distributed way using the spark and bloom filter by walking all the references (both dead and live).
    • Constructor Detail

      • DistributedIdentifyContents

        public DistributedIdentifyContents​(org.apache.spark.sql.SparkSession session,
                                           GCParams gcParams)
    • Method Detail

      • getLiveContentsBloomFilters

        public Map<String,​ContentBloomFilter> getLiveContentsBloomFilters​(List<String> references,
                                                                                long bloomFilterSize,
                                                                                Map<String,​Instant> droppedRefTimeMap)
        Compute the bloom filter per content id by walking all the live references in a distributed way using spark.
        Parameters:
        references - list of all the references (JSON serialized)
        bloomFilterSize - size of bloom filter to be used
        droppedRefTimeMap - map of dropped time for reference@hash (JSON serialized)
        Returns:
        map of ContentBloomFilter per content-id.
      • identifyExpiredContents

        public String identifyExpiredContents​(Map<String,​ContentBloomFilter> liveContentsBloomFilterMap,
                                              List<String> references)
        Gets the expired contents per content id by walking all the live and dead references in a distributed way using spark and checking the contents against the live bloom filter results.
        Parameters:
        liveContentsBloomFilterMap - live contents bloom filter per content id.
        references - list of all the references (JSON serialized) to walk (live and dead)
        Returns:
        current run id of the completed gc task