Back-up your BigQuery views in GitHub with Python

Santiago Ravotti
2Performant Tech
Published in
6 min readAug 4, 2023

--

In this tutorial, I will guide you through a comprehensive process on how to run a Python script which creates and updates a views repository in GitHub with your current data project. This handy procedure helps keep your data views fresh, effectively stored, and well-managed.

This process involves leveraging key features from BigQuery as well as integrating API requests with GitHub. Interestingly, the code you will encounter in this guide is built upon a robust Data Catalog system that was developed previously. This system is all about centralizing information about all your data assets, making it simpler to discover, understand, and govern your data.

As a bonus, I will provide the code for the Data Catalog at the end of this guide, so make sure to follow through to the end!

Let’s now dive into the step-by-step process.

1. Creating a New Repository in GitHub

Start by creating a new repository where the project’s information will be stored.

1.1. Visit GitHub’s website and log in with your account credentials.

1.2. After logging in, click the ‘+’ icon located in the upper-right corner of the web page and select ‘New repository’.

1.3. Give your repository a unique name, choose its visibility status (either public or private), and click ‘Create repository’.

2. Generating a GitHub Token

A personal access token (PAT) is necessary for carrying out various operations on your GitHub repository. Here’s how to generate one:

2.1. Click on your profile icon, which is located in the top right corner of GitHub’s webpage.

2.2. Go to “Settings”.

2.3. From the left side menu, navigate to “Developer settings”.

2.4. Click on “Personal access tokens”.

2.5. Click on “Generate new token”.

2.6. Assign a descriptive name to your token and check the box for the “repo” scope, which grants the token permission to access and modify repository details.

2.7. Click on “Generate token”.

2.8. Once the token is generated, copy it right away. For security reasons, GitHub won’t display the token again.

For more information on how to generate your access token, visit this page.

3. Import the Necessary Libraries

To run the code, several libraries need to be imported. Execute the following Python commands:

from google.colab import auth
from google.cloud import bigquery
import requests
import json
import base64
import os

4. Authenticating Google Colab with Google BigQuery

To connect Google Colab with Google BigQuery, use the command auth.authenticate_user().

5. Setting Up the BigQuery Client and Obtaining all Dataset Names

You need to set up your BigQuery Client with the relevant project ID and get all dataset names. Replace 'YOUR PROJECT ID GOES HERE' with your actual project ID.

project_id = 'YOUR PROJECT ID GOES HERE'
client = bigquery.Client(project=project_id)

datasets = list(client.list_datasets()) # API request.
dataset_names = [dataset.dataset_id for dataset in datasets]

# Printing all the dataset names
for dataset in dataset_names:
print(dataset)

6. Defining Functions to Retrieve and Push Files to GitHub

We need now to define two functions that will handle retrieving and pushing files to our GitHub repository.

# Define headers outside the functions
HEADERS = {
"Authorization": None,
"Content-Type": "application/json"
}

def get_from_github(filename, repo, path, token):
"""
Get a file from GitHub.

Args:
filename (str): The name of the file.
repo (str): The repository from which to get the file.
path (str): The path in the repository where the file is located.
token (str): Your GitHub token.

Returns:
str: The content of the file.
"""
url = f"https://api.github.com/repos/{repo}/contents/{path}/{filename}"
HEADERS["Authorization"] = f"token {token}"
response = requests.get(url, headers=HEADERS)

if response.status_code == 200:
file_info = response.json()
content = base64.b64decode(file_info['content']).decode()
return content
elif response.status_code == 404:
return None
else:
response.raise_for_status()

def push_to_github(filename, content, repo, path, token):
"""
Push a file to GitHub.

Args:
filename (str): The name of the file.
content (str): The content of the file.
repo (str): The repository to which to push the file.
path (str): The path in the repository where the file should go.
token (str): Your GitHub token.

Returns:
None
"""
url = f"https://api.github.com/repos/{repo}/contents/{path}/{filename}"
HEADERS["Authorization"] = f"token {token}"
data = {
"message": f"Add {filename}",
"content": base64.b64encode(content.encode('utf-8')).decode('utf-8')
}

response = requests.put(url, headers=HEADERS, data=json.dumps(data))
response.raise_for_status() # Raises stored HTTPError, if one occurred

7. Backing Up View Definitions

Finally, use the following code to iterate through all datasets and tables, saving each view’s SQL definition to a file in the corresponding GitHub repository directory.

Remember to replace 'YOUR GITHUB TOKEN GOES HERE' and 'YOUR GITHUB BACK UP REPO GOES HERE' with your actual GitHub token and backup repository.

GITHUB_TOKEN = "YOUR GITHUB TOKEK GOES HERE" 
GITHUB_REPO = "YOUR GITHUB BACK UP REPO GOES HERE" # this should have your GitHub account name or company name, acompanied by a slash and the repo name

# Initialize a boolean to check if any changes were made
changes_made = False

for dataset_name in dataset_names:
dataset_id = dataset_name
tables_and_views = list(client.list_tables(dataset_id))

for table in tables_and_views:
print(f"Processing dataset: {dataset_id}, table: {table.table_id}")

if table.table_type == 'VIEW':
view = client.get_table(table)
view_definition = view.view_query

# Fetch the current definition from GitHub
filename = f"{table.table_id}.sql"
path = dataset_id.replace('/', '__') # GitHub path does not support '/' in it, replace it with '__'
current_definition = get_from_github(filename, GITHUB_REPO, path, GITHUB_TOKEN)

# If the view definition has changed, or the view is new, save it to GitHub
if current_definition != view_definition:
if current_definition is None:
print(f"New view added: {table.table_id}")
else:
print(f"View updated: {table.table_id}")

push_to_github(filename, view_definition, GITHUB_REPO, path, GITHUB_TOKEN)

# Set changes_made to True, as a change was made
changes_made = True

# After looping through all views, check if any changes were made
if not changes_made:
print("No new views were added or existing views updated.")

print("All views have been successfully backed up to GitHub.")

I hope this guide has effectively illustrated how to automate updating a views repository on GitHub with your current data project, an important part of maintaining your data assets.

The tutorial was based on a Data Catalog system developed earlier, which centralizes metadata management for easier data discovery and governance.

Bonus: Creating a Data Catalog processing metadata from your tables and views.

And as a bonus, you can find below a Python script that can generate a Data Catalog from your BigQuery data warehouse, fetching and structuring metadata from various datasets into an Excel file. This script is an asset for anyone working with numerous datasets and tables.

A particularly unique aspect of this script is its use of regular expressions (regex) to pinpoint parent tables or views referenced in the ‘FROM’ clause of each data asset. This feature makes it easier to trace the lineage of your data assets and understand their dependencies. You can find the steps and explanations on the code as comments in the script itself:

# 1. Import required libraries

from google.cloud import bigquery
import pandas as pd
import re
from google.colab import auth

# Authenticate the user for Colab
auth.authenticate_user()

# 2. Set up BigQuery client

# Replace this with your own project ID
project_id = 'YOUR PROJECT ID GOES HERE'
client = bigquery.Client(project=project_id)

# 3. Fetch the dataset names automatically

# Fetch all datasets within the specified project
datasets = list(client.list_datasets())

# Extract dataset names and store them in a list
dataset_names = [dataset.dataset_id for dataset in datasets]

# 4. Retrieve tables and views metadata

# Initialize empty lists to hold table properties
table_names = []
table_types = []
column_names = []
sources = []
dataset_list = []

# Loop over each dataset in the list
for dataset_name in dataset_names:
dataset_id = dataset_name
# Fetch all tables/views for the given dataset
tables_and_views = list(client.list_tables(dataset_id))

# Loop over each table/view in the list
for table in tables_and_views:
print(f"Processing dataset: {dataset_id}, table: {table.table_id}")

# Append table/view name and dataset name to their respective lists
table_names.append(table.table_id)
dataset_list.append(dataset_id)

# Check if the current object is a table or a view and append relevant info
if table.table_type == 'TABLE':
table_types.append('TABLE')
sources.append(None)
elif table.table_type == 'VIEW':
view = client.get_table(table)
view_definition = view.view_query

# Extract the names of the datasets and tables this view is derived from
matches = re.findall(r'FROM `(.+?)`', view_definition, re.IGNORECASE)
sources.append(matches)
table_types.append('VIEW')
else:
table_types.append(table.table_type)
sources.append(None)

# Fetch the schema of the current table/view
schema = client.get_table(table).schema
# Append the list of column names to the column_names list
column_names.append([field.name for field in schema])

print("All datasets and tables processed.")

# 5. Create DataFrame

# Create a pandas DataFrame using the lists we filled previously
df = pd.DataFrame({
'Dataset': dataset_list,
'Table Name': table_names,
'Table Type': table_types,
'Columns': column_names,
'Sources': sources
})

# Print the first few rows of the DataFrame to check if everything is OK
df.head()

# 6. Export DataFrame to Sheet in Google Drive

from google.colab import drive
# Connect to the user's Google Drive
drive.mount("/content/gdrive")

# Specify where we want to save our DataFrame in Google Drive
file_path = "/content/gdrive/My Drive/DataCatalog.xlsx"
# Save the DataFrame to Google Drive as an Excel file
df.to_excel(file_path, index=False)
print ('The Google Sheet has been successfully uploaded to your Google Drive.')Be sure to replace ‘YOUR PROJECT ID GOES HERE’ with your own project_id.

I hope these techniques enhance your data handling routines. Thank you for your time, and I look forward to sharing more insights in future posts. Happy data handling!

--

--