Moving data from BigQuery to On-premise Hive

Abhishek Raviprasad
3 min readDec 13, 2022

--

In this article let’s discuss the solution for the use case that we received from our customer, Can you help us move some 200+ tables from BigQuery back to On-Prem Hive ?

Initially when we read about this use case in an email, we thought it’s a typo from our customer. We thought that the customer wanted to move the data from OnPrem to BigQuery and not the other way round. But when we went on to the exploratory call with the customer our assumption was incorrect. Customer did want to move a few tables back to OnPremise Hive for some analytics.

We came up with 2 solutions by using spark compute cluster and I will discuss these in this article.

Solution 1 (Using SPARK JDBC)

Architecture

Read from BigQuery

We can use the native BigQuery spark connector provided by Google Cloud to read the data from BigQuery. Reading data from BigQuery is as simple as reading data from CSV file,

Example:

words = spark.read.format('bigquery')
.option('table', 'bigquery-publicdata:samples.shakespeare').load()

Refer https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example for more details

Writing data to On-Prem Hive

We did not want to reinvent the wheel and wanted to see if there is any existing JDBC driver that can help us connect to remove hive thrift server via JDBC.

We did POCs on multiple Hive JDBC drivers like Apache Hive JDBC, Cloudera JDBC but we faced many issues like data incompatibility, jars incompatibility, spark.write.jdbc fails due to some abrupt failures.

We then took a free trial of CDATA Apache Hive JDBC driver. This driver was very much compatible with spark and we were able to connect to a remote hive server via JDBC and write the data to hive tables.

Refer: https://www.cdata.com/drivers/hive/jdbc/

String HIVE_DRIVER_NAME = "cdata.jdbc.apachehive.ApacheHiveDriver";
Dataset.write()
.format("jdbc")
.option("url", this.jdbc_url)
.option("dbtable", this.table)
.option("driver", HIVE_DRIVER_NAME)
.mode("append")
.save();

Notes:

  1. Make sure to have a license installed and available in the spark classpath.
  2. We faced issues with timestamp and date datatypes. Hence we followed the below workaround,
  • Converted all the datatypes to string before writing the dataset to a temp table.
  • Create main table with proper datatypes
  • Issue “INSERT INTO MAIN_TABLE SELECT * FROM TEMP_TABLE” command to populate the main table from temp table
  • Drop temp table

We handled following in the first release of this target,

  1. Support for overwrite, append, merge mode
  2. Support all the primitive datatypes
  3. Support for Primary partitioning
  4. Support for bucketing

Future enhancements planned are,

  1. Support for array and struct datatype
  2. Support Optimised merge and partition level merge

Solution 2 (Use Spark and Custom Python Scripts)

Architecture

Read from BigQuery

We can use the native BigQuery spark connector provided by Google Cloud to read the data from BigQuery. Reading data from BigQuery is as simple as reading data from CSV file,

Example:

words = spark.read.format('bigquery')
.option('table', 'bigquery-publicdata:samples.shakespeare').load()

Refer https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example for more details

Write the data to GCS

In this step we will write the dataset that read table from BigQuery to Google Cloud Storage in ORC/Parquet format preferably.

Custom Python Script to create external Hive tables

After the spark job writes the data to GCS, we can trigger the Python Script that does the following,

  1. SSH to Master Node of Onprem Hive Clsuter
  2. Pull the data from Cloud Google Storage to OnPrem HDFS using gsutil or hadoop distcp command
  3. Create a spark session and read the ORC/Parquet file. Using the metadata/schema embedded within the ORC/Parquet files, generate the CREATE TABLE statement and execute it.
  4. This creates the external table on top of the data residing in HDFS

Sample codes for both the solutions are available in Github:

Please provide your comments on this topic. Happy reading :-)

--

--

Abhishek Raviprasad

Senior Solution Engineer at Infoworks.io, 4+ years of big data/ETL data warehouse experience building data pipelines