Package org.pipecraft.infra.bq
Class TableLoadConfig
- java.lang.Object
-
- org.pipecraft.infra.bq.TableLoadConfig
-
public class TableLoadConfig extends Object
BQ table load configuration. Serves both for local file loading and remote(cloud storage) file loading into BQ. Differences: 1) Source URIs - for remote load use "gs://..." paths only. For local load use full local file system paths only. The URIs format determines whether this is a local load or a remote one. 2) Wildcards - Both expect wildcards in the filename part of the file path only. Both support '*' wildcard, but only local load supports multiple occurrences of '*' and also supports '?'. 3) Parallelism - Note that local load is serial and isn't as efficient as remote load, so for large data prefer using remote load. 4) Compression - remote load supports only gzip. Local load also supports zstd. Files must have the proper suffix to be identified correctly. Immutable.- Author:
- Eyal Schneider
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classTableLoadConfig.Builderstatic classTableLoadConfig.LoadFormat
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description booleangetAllowJaggedRows()Set<String>getClusteringFields()com.google.cloud.bigquery.JobInfo.CreateDispositiongetCreateDisposition()StringgetCsvFieldDelimiter()booleangetCSVHasHeader()Relevant for CSV format only.IntegergetDestinationTableExpirationHs()LocalDategetDestinationTablePartition()com.google.cloud.bigquery.TableIdgetDestinationTableReference()TableLoadConfig.LoadFormatgetLoadFormat()Set<String>getSourceURIs()com.google.cloud.bigquery.SchemagetTableSchema()LonggetTimeoutMs()com.google.cloud.bigquery.JobInfo.WriteDispositiongetWriteDisposition()booleanisRemoteLoad()static TableLoadConfig.BuildernewBuilder(String sourceURI, com.google.cloud.bigquery.TableId destinationTableReference)static TableLoadConfig.BuildernewBuilder(Set<String> sourceURIs, com.google.cloud.bigquery.TableId destinationTableReference)StringtoString()
-
-
-
Method Detail
-
newBuilder
public static TableLoadConfig.Builder newBuilder(Set<String> sourceURIs, com.google.cloud.bigquery.TableId destinationTableReference)
- Parameters:
sourceURIs- the full paths to the source data. Each URI should be fully qualified, and may contain one wildcard ('*') in the file name part of the path. When loading local files, all URIs should have the form of a full local file system path. For remote loading they should all be valid cloud storage paths.destinationTableReference- The details of the table to load data into. In case of a partitioned table, use setPartition(..) to set the partition to load into.- Returns:
- A builder initialized with the given sources and target, having default values in the other fields.
-
newBuilder
public static TableLoadConfig.Builder newBuilder(String sourceURI, com.google.cloud.bigquery.TableId destinationTableReference)
- Parameters:
sourceURI- the full path to the source file. The URI should be fully qualified, and may contain one wildcard ('*') in the file name part of the path. When loading a local file, the URI should have the form of a full local file system path. For remote loading it should be a valid cloud storage paths.destinationTableReference- The details of the table to load data into- Returns:
- A builder initialized with the given source and target, having default values in the other fields.
-
getSourceURIs
public Set<String> getSourceURIs()
- Returns:
- the full paths to the source data. Each URI should be fully qualified, and may contain one wildcard ('*') in the file name part of the path. When loading local files, all URIs should have the form of a full local file system path. For remote loading they should all be valid cloud storage paths.
-
isRemoteLoad
public boolean isRemoteLoad()
- Returns:
- true if and only if this configuration is for a remote (cloud storage) file loading into BQ. This is determined by inspecting the form of the source URIs.
-
getDestinationTableReference
public com.google.cloud.bigquery.TableId getDestinationTableReference()
- Returns:
- the destination table reference
-
getDestinationTablePartition
public LocalDate getDestinationTablePartition()
- Returns:
- The destination table partition to write to, as a date. Should be specified only for partitioned tables.
-
getLoadFormat
public TableLoadConfig.LoadFormat getLoadFormat()
- Returns:
- the format of the input file. Default is CSV.
-
getCsvFieldDelimiter
public String getCsvFieldDelimiter()
- Returns:
- the input file field delimiter. Applies to CSV format only. Default is ",".
-
getCSVHasHeader
public boolean getCSVHasHeader()
Relevant for CSV format only.- Returns:
- True if and only if the csv file has a header line to skip. Default is true.
-
getTableSchema
public com.google.cloud.bigquery.Schema getTableSchema()
- Returns:
- the destination table schema. The schema can be omitted if the destination table already exists. If specified, it can serve for adding columns dynamically.
-
getDestinationTableExpirationHs
public Integer getDestinationTableExpirationHs()
- Returns:
- the destination table expiration in hours. Null means no expiration.
-
getCreateDisposition
public com.google.cloud.bigquery.JobInfo.CreateDisposition getCreateDisposition()
- Returns:
- The table creation mode. Defines how the command deals with a situation where the table to load into already exists. Default is CREATE_IF_NEEDED.
-
getWriteDisposition
public com.google.cloud.bigquery.JobInfo.WriteDisposition getWriteDisposition()
- Returns:
- The mode defining how the command deals with existing rows in the target table. Default is WRITE_APPEND.
-
getAllowJaggedRows
public boolean getAllowJaggedRows()
- Returns:
- true if and only if jagged rows are allowed. Jagged rows are rows that are missing optional columns (trailing columns only). When true, the missing values are treated as nulls. When false, missing values are considered an error. Default is false.
-
getClusteringFields
public Set<String> getClusteringFields()
- Returns:
- The set of names of table fields defined as clustering fields. Mandatory when the table has clustering fields.
-
getTimeoutMs
public Long getTimeoutMs()
- Returns:
- The load execution timeout, in milliseconds.
Null means no timeout (default value).
NOTE: Google's API doesn't seem to always respect this limit, and it's not always clear which timeout applies
(The export level timeout here or the global one as provided in the
BigQueryConnector's constructor.
-
-