Class GCImpl
- java.lang.Object
-
- org.projectnessie.gc.base.GCImpl
-
public class GCImpl extends java.lang.ObjectEncapsulates the logic to retrieve expired contents by walking over all commits in all named-references.
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringidentifyExpiredContents(org.apache.spark.sql.SparkSession session)Identify the expired contents using a two-step traversal algorithm.
-
-
-
Constructor Detail
-
GCImpl
public GCImpl(GCParams gcParams)
Instantiates a new GCImpl.- Parameters:
gcParams- GC configuration params
-
-
Method Detail
-
identifyExpiredContents
public java.lang.String identifyExpiredContents(org.apache.spark.sql.SparkSession session) throws org.projectnessie.error.NessieNotFoundExceptionIdentify the expired contents using a two-step traversal algorithm.Algorithm for identifying the live contents and return the bloom filter per content-id
Walk through each reference(both live and dead) distributively (one spark task for each reference).
While traversing from the head commit in a reference(use DETACHED reference to fetch commits from dead reference), for each live commit (commit that is not expired based on cutoff time) add the contents of put operation to bloom filter.
Collect the live content keys for this reference just before cutoff time (at first expired commit head). Which is used to identify the commit head for each live content key at the time of cutoff time to support the time travel.
While traversing the expired commits (commit that is expired based on cutoff time), if it is a head commit content for its key, add it to bloom filter. Else move to next expired commit.
Stop traversing the expired commits if each live content key has processed one live commit for it. This is an optimization to avoid traversing all the commits.
Collect bloom filter per content id from each task and merge them.
Algorithm for identifying the expired contents and return the list of globally expired contents per content id per reference
Walk through each reference(both live and dead) distributively (one spark task for each reference).
For each commit in the reference (use DETACHED reference to fetch commits from dead reference) check it against bloom filter to decide whether its contents in put operation are globally expired or not. If globally expired, Add the contents to the expired output for this content id for this reference.
Overall the contents after or equal to cutoff time and the contents that are mapped to commit head of live keys at the time of cutoff timestamp will be retained.
- Parameters:
session- spark session for distributed computation- Returns:
- current run id of the completed gc task
- Throws:
org.projectnessie.error.NessieNotFoundException
-
-