Class DistributedShufflerPipe<T>

  • Type Parameters:
    T - The item data type
    All Implemented Interfaces:
    Closeable, AutoCloseable, BasePipe

    public class DistributedShufflerPipe<T>
    extends AsyncPipe<T>
    An async pipe that takes input and shuffle it across multiple workers.

    TODO: Add logging.

    Author:
    Zacharya Haitin
    • Constructor Detail

      • DistributedShufflerPipe

        public DistributedShufflerPipe​(AsyncPipe<T> input,
                                       DistributedShufflerConfig<T> config)
        Another Constructor????? YES!
        Parameters:
        input - The input pipe supplying items which will be shuffled between the workers.
        config - The config object specifying the shuffler settings
    • Method Detail

      • start

        public void start()
                   throws PipeException,
                          InterruptedException
        Description copied from interface: BasePipe
        Performs pre-processing prior to item flow throw the pipe. Implementations must call the same method for all their input pipes before accessing their items. This is typically done here.
        Throws:
        PipeException - In case of pipe errors in this pipe or somewhere up-stream.
        InterruptedException - In case that the operation has been interrupted by another thread.
      • getProgress

        public float getProgress()
        Returns:
        The pipe flow progress, as a floating number between 0.0 and 1.0. Important implementation rules: 1) Calling this method before start() call is complete isn't allowed and has an undefined behavior. 2) Implementation should do best effort to provide an estimate of the progress this pipe has made (0.0 - 1.0) 3) When the pipe is fully consumed, getProgress() should return 1.0. 4) Results must be monotonous, i.e. results of consecutive calls may never be decreasing. 5) Thread safety: progress may be maintained by some thread/s but monitoring by other threads. Implementations must be thread safe.
      • getWorkerShardId

        public static int getWorkerShardId​(List<HostPort> workers,
                                           int workerIndex)
        Deprecated.
        Use DistributedShufflerConfig.getWorkerShardId(..) instead
        A utility method for determining the shard id a worker is responsible for. Each worker among the N workers exclusively "owns" a shard, whose id is in the range 0..N. This method is expected to be used only by pipelines which have full control over sharding (i.e. those which provide an explicit sharding function to the shuffler pipe). Such pipelines may need to know the exact shard the worker is working on, in order to for example fetch additional resources belonging to the same shard.
        Parameters:
        workers - The list of all workers. Order isn't important.
        workerIndex - The index of the worker in the given workers list. Must be between 0 and workers.size() - 1
        Returns:
        The shard id owned by the worker at position workerIndex in the workers list.