Class BigQueryConnector

  • All Implemented Interfaces:
    org.pipecraft.infra.monitoring.JsonMonitorable

    public class BigQueryConnector
    extends Object
    implements org.pipecraft.infra.monitoring.JsonMonitorable
    Used for interacting with Google's BigQuery. Supports: 1) Initialization with environment credentials (no option to set credentials explicitly yet) 2) Select and DML queries (through BQQuery and BQDMLQuery) 3) Async query executions 4) Writing query results to specific BQ tables 5) Exporting results to google storage, asynchronously 6) Load local/google-storage files into BQ tables, asynchronously 7) Monitoring of all operations 8) Allows limiting the number of concurrent BQ operations, using constructor parameters
    Author:
    Eyal Schneider
    • Constructor Detail

      • BigQueryConnector

        public BigQueryConnector​(String projectId,
                                 long connTimeoutMs,
                                 long readTimeoutMs,
                                 QueryExecutionConfig defaultExecutionConfig,
                                 Consumer<BQQueryExecutionSummary> observer,
                                 ExecutorService ex)
                          throws IOException
        Constructor
        Parameters:
        projectId - The project that this instance is bound to. All actions will be performed in the scope of this project.
        connTimeoutMs - Timeout (in milliseconds) for connection establishment
        readTimeoutMs - Socket read timeout (in milliseconds) of all requests. This low-level timeout defines how long a blocking read on the socket should wait for data. NOTE: Google's API doesn't seem to always respect this limit, and it's not always clear which timeout applies (The query level timeout here or the global one as provided in the BigQueryConnector's constructor).
        defaultExecutionConfig - The default query execution configuration, in case none is specified when calling the execution methods.
        observer - on the BQ query execution, can be null
        ex - The executor to run BQ requests on. Can be multi-threaded or direct executor, depending on the required threading policy. It's the responsibility of the caller to shut down this executor.
        Throws:
        IOException - In case that the connector can't be initialized
    • Method Detail

      • getProjectId

        public String getProjectId()
        Returns:
        The project that this instance is bound to. All actions will be performed in the scope of this project.
      • getDefaultQueryExecutionConfig

        public QueryExecutionConfig getDefaultQueryExecutionConfig()
        Returns:
        the default query execution configuration. Use toBuilder().setXXX().setYYY().build() to create a copy with a few settings changed.
      • getExecutorService

        public ExecutorService getExecutorService()
        Returns:
        The executor provided in the constructor (and owner by the caller)
      • executeAsync

        public <R,​F> BigQueryConnector.BQQueryResultFuture<R,​F> executeAsync​(BQQuery<R,​F> query,
                                                                                         QueryExecutionConfig config)
        Runs a query asynchronously, returning a future, which is both checked and listenable. Note that once the future terminates successfully and provides its value, it's still not final, in the sense that resultset iteration may still produce errors, since server page requests are used during iteration. The caller may set a destination table reference and table expiration time in the supplied query config object.
        Parameters:
        query - The query to execute
        config - The execution configuration
        Returns:
        The future providing the query result or the query execution exception. This future is both checked and listenable (see CheckedFuture and ListenableFuture). Upon success, an iterator on result rows is provided by the future. Note that the iterator's next() method may throw a QueryResultBrokenException in case that the connection with BQ is broken during result set streaming. In Case of a DML(Data Manipulation Language) query, the future returns an empty iterator.
      • execute

        public <R,​F> BQResultsIterator<R,​F> execute​(BQQuery<R,​F> query,
                                                                QueryExecutionConfig config)
                                                         throws InterruptedException,
                                                                BQException
        Runs a query synchronously. Note that once the call returns successfully and provides its value, it's still not final, in the sense that resultset iteration may still produce errors, since server page requests are used during iteration. The caller may set a destination table reference and table expiration time in the supplied query config object.
        Parameters:
        query - The query to execute
        config - The execution configuration
        Returns:
        The results iterator
        Throws:
        BQException
        InterruptedException
      • executeNoStreamingAsync

        public <R,​F> BigQueryConnector.BQQueryResultFuture<R,​F> executeNoStreamingAsync​(BQQuery<R,​F> query,
                                                                                                    QueryExecutionConfig config)
        Runs a query asynchronously without streaming results back. Recommended for use only for DML queries or for queries which dump results to a table anyway.
        Parameters:
        query - The query to execute
        config - The execution configuration
        Returns:
        The future to use for determining completion and completion type (successful/failed).
      • executeNoStreaming

        public long executeNoStreaming​(BQQuery<?,​?> query,
                                       QueryExecutionConfig config)
                                throws InterruptedException,
                                       BQException
        Runs a query synchronously without streaming results back. Recommended for use only for DML queries or for queries which dump results to a table anyway.
        Parameters:
        query - The query to execute
        config - The execution configuration
        Returns:
        the record count
        Throws:
        BQException
        InterruptedException
      • executeAsync

        public <R,​F> BigQueryConnector.BQQueryResultFuture<R,​F> executeAsync​(BQQuery<R,​F> query)
        Runs a query asynchronously, returning a future, which is both checked and listenable. Note that once the future terminates successfully and provides its value, it's still not final, in the sense that resultset iteration may still produce errors, since server page requests are used during iteration. Uses the default query execution config as defined in the constructor
        Parameters:
        query - The query to execute
        Returns:
        The future providing the query result or the query execution exception. Upon success, an iterator on result rows is provided by the future. Note that the iterator's next() method may throw a QueryResultBrokenException in case that the connection with BQ is broken during result set streaming. In Case of a DML(Data Manipulation Language) query, the future returns an empty iterator.
      • execute

        public <R,​F> BQResultsIterator<R,​F> execute​(BQQuery<R,​F> query)
                                                         throws InterruptedException,
                                                                BQException
        Runs a query synchronously. Note that once the call returns successfully and provides its value, it's still not final, in the sense that resultset iteration may still produce errors, since server page requests are used during iteration. Uses the default query execution config as defined in the constructor
        Parameters:
        query - The query to execute
        Returns:
        The results iterator
        Throws:
        BQException
        InterruptedException
      • executeNoStreamingAsync

        public org.pipecraft.infra.concurrent.CheckedFuture<Void,​BQException> executeNoStreamingAsync​(BQQuery<?,​?> query)
        Runs a query asynchronously without streaming results back. Recommended for use only for DML queries or for queries which dump results to a table anyway.
        Parameters:
        query - The query to execute
        Returns:
        The future to use for determining completion and completion type (successful/failed).
      • executeNoStreaming

        public void executeNoStreaming​(BQQuery<?,​?> query)
                                throws BQException,
                                       InterruptedException
        Runs a query synchronously without streaming results back. Recommended for use only for DML queries or for queries which dump results to a table anyway.
        Parameters:
        query - The query to execute
        Throws:
        BQException
        InterruptedException
      • tableExists

        public boolean tableExists​(String dataset,
                                   String table)
        Parameters:
        dataset - A dataset name
        table - A table name
        Returns:
        true iff the table exists
      • updateTableExpiration

        public void updateTableExpiration​(String datasetId,
                                          String tableName,
                                          Integer duration,
                                          TimeUnit timeUnit)
        Sets table's expiration time. Table must exist.
        Parameters:
        datasetId - the table's dataset id
        tableName - the table's name
        duration - number of units for table's deletion, measured from now. Must be greater than 0. Null means infinite.
        timeUnit - the time unit of the given duration.
      • getOwnMetrics

        public net.minidev.json.JSONObject getOwnMetrics()
        Specified by:
        getOwnMetrics in interface org.pipecraft.infra.monitoring.JsonMonitorable
      • getChildren

        public Map<String,​? extends org.pipecraft.infra.monitoring.JsonMonitorable> getChildren()
        Specified by:
        getChildren in interface org.pipecraft.infra.monitoring.JsonMonitorable
      • exportTableAsync

        public BigQueryConnector.BQExportFuture exportTableAsync​(TableExportConfig config)
        Runs an export job, asynchronously
        Parameters:
        config - The export configuration
        Returns:
        The export future, which is checked and listenable
      • loadTableAsync

        public BigQueryConnector.BQTableLoadFuture loadTableAsync​(TableLoadConfig tableLoadConfig)
        Runs an async table load job
        Parameters:
        tableLoadConfig - the load job configuration. Serves for local or remote load from cloud storage.
        Returns:
        The future for this async operation. This future is both checked and listenable.