Class GCImpl


  • public class GCImpl
    extends java.lang.Object
    Encapsulates the logic to retrieve expired contents by walking over all commits in all named-references.
    • Constructor Summary

      Constructors 
      Constructor Description
      GCImpl​(GCParams gcParams)
      Instantiates a new GCImpl.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String identifyExpiredContents​(org.apache.spark.sql.SparkSession session)
      Identify the expired contents using a two-step traversal algorithm.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • GCImpl

        public GCImpl​(GCParams gcParams)
        Instantiates a new GCImpl.
        Parameters:
        gcParams - GC configuration params
    • Method Detail

      • identifyExpiredContents

        public java.lang.String identifyExpiredContents​(org.apache.spark.sql.SparkSession session)
                                                 throws org.projectnessie.error.NessieNotFoundException
        Identify the expired contents using a two-step traversal algorithm.

        Algorithm for identifying the live contents and return the bloom filter per content-id

        Walk through each reference(both live and dead) distributively (one spark task for each reference).

        While traversing from the head commit in a reference(use DETACHED reference to fetch commits from dead reference), for each live commit (commit that is not expired based on cutoff time) add the contents of put operation to bloom filter.

        Collect the live content keys for this reference just before cutoff time (at first expired commit head). Which is used to identify the commit head for each live content key at the time of cutoff time to support the time travel.

        While traversing the expired commits (commit that is expired based on cutoff time), if it is a head commit content for its key, add it to bloom filter. Else move to next expired commit.

        Stop traversing the expired commits if each live content key has processed one live commit for it. This is an optimization to avoid traversing all the commits.

        Collect bloom filter per content id from each task and merge them.

        Algorithm for identifying the expired contents and return the list of globally expired contents per content id per reference

        Walk through each reference(both live and dead) distributively (one spark task for each reference).

        For each commit in the reference (use DETACHED reference to fetch commits from dead reference) check it against bloom filter to decide whether its contents in put operation are globally expired or not. If globally expired, Add the contents to the expired output for this content id for this reference.

        Overall the contents after or equal to cutoff time and the contents that are mapped to commit head of live keys at the time of cutoff timestamp will be retained.

        Parameters:
        session - spark session for distributed computation
        Returns:
        current run id of the completed gc task
        Throws:
        org.projectnessie.error.NessieNotFoundException