Package org.projectnessie.gc.base
Class DistributedIdentifyContents
- java.lang.Object
-
- org.projectnessie.gc.base.DistributedIdentifyContents
-
public class DistributedIdentifyContents extends Object
Identify the expired and live contents in a distributed way using the spark and bloom filter by walking all the references (both dead and live).
-
-
Constructor Summary
Constructors Constructor Description DistributedIdentifyContents(org.apache.spark.sql.SparkSession session, GCParams gcParams)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Map<String,ContentBloomFilter>getLiveContentsBloomFilters(List<String> references, long bloomFilterSize, Map<String,Instant> droppedRefTimeMap)Compute the bloom filter per content id by walking all the live references in a distributed way using spark.StringidentifyExpiredContents(Map<String,ContentBloomFilter> liveContentsBloomFilterMap, List<String> references)Gets the expired contents per content id by walking all the live and dead references in a distributed way using spark and checking the contents against the live bloom filter results.
-
-
-
Constructor Detail
-
DistributedIdentifyContents
public DistributedIdentifyContents(org.apache.spark.sql.SparkSession session, GCParams gcParams)
-
-
Method Detail
-
getLiveContentsBloomFilters
public Map<String,ContentBloomFilter> getLiveContentsBloomFilters(List<String> references, long bloomFilterSize, Map<String,Instant> droppedRefTimeMap)
Compute the bloom filter per content id by walking all the live references in a distributed way using spark.- Parameters:
references- list of all the references (JSON serialized)bloomFilterSize- size of bloom filter to be useddroppedRefTimeMap- map of dropped time for reference@hash (JSON serialized)- Returns:
- map of
ContentBloomFilterper content-id.
-
identifyExpiredContents
public String identifyExpiredContents(Map<String,ContentBloomFilter> liveContentsBloomFilterMap, List<String> references)
Gets the expired contents per content id by walking all the live and dead references in a distributed way using spark and checking the contents against the live bloom filter results.- Parameters:
liveContentsBloomFilterMap- live contents bloom filter per content id.references- list of all the references (JSON serialized) to walk (live and dead)- Returns:
- current run id of the completed gc task
-
-