Revolutionizing Data Privacy: Generate Synthetic Databases with Gretel Relational

Moez Ali
10 min readMay 10, 2023

Generate Synthetic Databases with Gretel Relational — The Future of Data Privacy

Image Credits: Tom Edlund

Introduction

Synthetic data is artificially generated data that is designed to mimic real-world data. It is created using machine learning models that replicate the patterns, characteristics, and relationships found in real-world data. My previous blog Synthetic data is the future of Artificial Intelligence, discusses the key use cases for synthetic data generation (SDG) and how it is a perfect fit for GDPR requirements.

Topic for Today: Synthetic Relational Database

This blog will demonstrate the capability of generating Synthetic Relational Databases that accurately reflect the relationships between many interrelated tables in database. This feature is particularly significant as relational databases play a vital role in enterprise data management. However, due to data privacy concerns, access to these databases is often limited, hampering innovation and necessitating reliance on proxy datasets that may lack the same statistical relationships and insights.

To demonstrate, I will use Relational API provided by Gretel.AI to generate a synthetic relational database.

Gretel.AI

Gretel.ai is a company that provides a platform for creating synthetic data. The platform uses cutting-edge machine learning techniques to generate synthetic data that mimics real-world data, allowing organizations to train machine learning models without compromising data privacy or security.

The platform can be used to create synthetic data for a variety of applications, including tabular data, natural language processing, image, time series data, or even the entire database.

I have opted for Gretel.AI due to its user-friendly interface and its ability to generate synthetic data of exceptional quality, along with metrics to evaluate the privacy and accuracy of the data.

Synthetic Relational Database use-cases in Healthcare

Synthetic data is an emerging technology that has gained significant interest in the healthcare industry. With the increasing amounts of healthcare data being generated every day, synthetic data is becoming an essential tool for researchers, healthcare providers, and data analysts to analyze and share data securely and efficiently.

In the healthcare industry, synthetic data has a wide range of use-cases, from drug discovery to clinical trials, from population health management to disease diagnosis, and from patient care to medical education. In particular, synthetic databases are proving to be a valuable resource for researchers conducting clinical trials.

Clinical trials are essential to the development of new drugs, medical devices, and treatments. However, it requires large amounts of data to be collected from participants, but this data is often sensitive and subject to privacy regulations. Researchers need to be able to analyze the data to make informed decisions about treatment options, but they also need to protect the privacy of the participants. By generating synthetic data at a database level, researchers can preserve the statistical patterns of the original data while protecting the privacy of the participants.

End to End Python Example

Database

In this demonstration, I have generated a mock patient database consisting of three tables, each with exactly 5,000 records.

Table patients as 1 to many relationship with table treatments and lab_results . Patient Id is the primary key that connects all three tables.

Table 1 — Patients

Sample rows

Table 2 — Lab Results

Sample rows

Table 3 — Treatments

Sample rows

Connect to Data source

There are two ways to define relational data — automatically via a database connector or manually with individual csv files. When defining your source data via a connector, the relationships are automatically defined. This approach simplifies the process to just two lines of code.

In this example, I have three files that I will ingest using RelationalData class from gretel_trainer.relational module but first you have to install `gretel-trainer`. Installation is pretty straight forward using pip.

# install gretel-trainer library
pip install -U gretel-trainer

Step 0 — Enter API Key

To begin, you’ll need to enter your API key, which can be found in the Gretel Console.

from gretel_client import configure_session
configure_session(api_key="prompt", cache="yes", validate=True)

Step 1 — Load Data

from gretel_trainer.relational import RelationalData
import pandas as pd

csv_dir = "csv/"

tables = [
("patients", "Patient ID"),
("lab_results", "Lab Result ID"),
("treatments", "Treatment ID"),
]

foreign_keys = [
("lab_results.Patient ID", "patients.Patient ID"),
("treatments.Patient ID", "patients.Patient ID"),
]

relational_data = RelationalData()

for table, pk in tables:
relational_data.add_table(name=table, primary_key=pk, data=pd.read_csv(f"{csv_dir}/{table}.csv"))

for fk, ref in foreign_keys:
relational_data.add_foreign_key(foreign_key=fk, referencing=ref)

All the three files are in folder csv/ . tables define the list of tuples of (table_name, primary_key) and foreign_keys is also a list of tuples (table_name.foreign_key, table_name.primary_key).

If you want to work directly with a relational database, you can use one of the many data connectors in Gretel.AI to do that. One advantage of using a data connector is that relationships between tables are automatically extracted, elimating the need to define them manually in your code. Take a look at the blog post Generate Synthetic Databases with Gretel Relational for an illustration of how to use the SQLite connector to establish a connection with a SQLite database.

Step 2 — Exploratory Analysis

To generate synthetic data we have to train a model but before that, let’s quickly examine the data.

# load dataset
patients_df = pd.read_csv('csv/patients.csv')
lab_results_df = pd.read_csv('csv/lab_results.csv')
treatments_df = pd.read_csv('csv/treatments.csv')

# check shape
print(patients_df.shape)
print(lab_results_df.shape)
print(treatments_df.shape)

Output:

(5000, 6)
(5000, 4)
(5000, 5)

Let’s examine the output when we join patients table with lab_results:

# Join the patients and lab_results tables on Patient ID
patients_lab_df = pd.merge(patients_df, lab_results_df, on='Patient ID', how = 'left')

# sample rows of joined data
patients_lab_df.head()
Sample rows of joined data (patients + lab_results)
# check the shape of patients_lab_df
print(patients_lab_df.shape)

Output:

(6916, 9)

Also, let’s examine the output when we join patients table with treatments:

# Join the patients and treatments tables on Patient ID
patients_treatments_df = pd.merge(patients_df, treatments_df, on='Patient ID', how = 'left')

# sample rows of joined data
patients_treatments_df.head()
Sample rows of joined data (patients + treatments)
# check the shape of patients_lab_df
print(patients_treatments_df.shape)

Output:

(6831, 10)

Note that the relationship of patients table with lab_results and treatments is one-to-many. This means that there could be multiple records for each patient in treatments and lab_results table and hence the output in both the above scenario is greater than original 5,000 records in patients table. After we synthesize the database, we’ll use the same method to confirm that this relational integrity has been maintained.

Age Distribution of Patients

Before we begin training our model, let’s quickly examine the Age distribution of our patients in the original data. After generating the synthetic data, we will compare the Age distribution in the synthetic data to assess how closely it resembles our original data.

import plotly.express as px

fig = px.histogram(patients_df,
x="Age",
nbins=30,
opacity=0.75,
labels={"Age": "Age (years)", "count": "Number of patients"})

fig.update_layout(
title="Distribution of Patient Age",
xaxis_title="Age (years)",
yaxis_title="Number of patients",
font=dict(size=14, family="Arial"),
plot_bgcolor="#F9F9F9",
paper_bgcolor="#F9F9F9",
)

fig.show()

Output:

Step 3 — Model Training

We will use the MultiTable class to set up our relational model and create a project. This class is used to control model training, data generation, and synthetic database evaluation. There are three model options available. In this example, we’re using Gretel Amplify.

from gretel_trainer.relational import MultiTable

multitable = MultiTable(
relational_data,
project_display_name="Clinical Trials",
gretel_model="amplify"
)

The training step is pretty simple. You just have to call train method on MultiTable instance.

# start model training
multitable.train()

This may take 5–10 minutes. When you sign up for Gretel.AI account you get some free credits which should be enough for this run. Logs can be viewed in the Notebook while the training process is in progress.

Logs while training the model

Once the training is completed, synthetic database can be generated with the same amount of data for all tables, or you can subset the existing database and adjust the record_size_ratio parameter to generate more data.

# generate synthetic database
multitable.generate(record_size_ratio=1)

To ensure that the synthetic database retains the same dimensions as the original (i.e., 5,000 records for each table), we have set the record_size_ratio to 1. When generating the synthetic database, you can monitor the live logs within the notebook, as with the training process.

Logs while generating synthetic data

You can now access the synthetic data using synthetic_output_tables attribute which is of type dictionary.

patient_df_transformed = multitable.synthetic_output_tables['patients']
lab_results_df_transformed = multitable.synthetic_output_tables['lab_results']
treatments_df_transformed = multitable.synthetic_output_tables['treatments']

Let’s check the shape of synthetic database:

print(patient_df_transformed.shape)
print(lab_results_df_transformed.shape)
print(treatments_df_transformed.shape)

Output:

(5000, 6)
(5000, 4)
(5000, 5)

BOOM! The output shape matches the input. Let’s now compare the distribution of age of patients in original data with synthetic data.

Verifying Patient Age Distribution

import plotly.figure_factory as ff
import pandas as pd

# Create data arrays for each dataframe
data1 = patients_df['Age'].values
data2 = patient_df_transformed['Age'].values

# Create density plots for each dataset
fig = ff.create_distplot([data1, data2], group_labels=['Original Data', 'Synthetic Data'], show_hist=True)

# Customize the aesthetics
fig.update_layout(
title='Age Distribution Comparison',
xaxis_title='Age',
yaxis_title='Density',
title_font=dict(size=28, family='Arial'),
font=dict(size=16, color='black'),
legend=dict(title='', x=0.5, y=1.2, font=dict(size=16)),
showlegend=True
)

# Display the figure
fig.show()

Verify Referential Integrity

The final and the most important step is what happen when we join patient_df_transformed table with lab_results_transformed and treatments_df_transformed .

What we are looking for is the same shape as original data i.e. (6916, 9) and (6831, 10). Let’s check! This will come the referential integrity of the process.

# Join the synthetic patients and lab_results tables on Patient ID
patient_lab_df_transformed = pd.merge(patient_df_transformed, lab_results_df_transformed, on='Patient ID', how = 'left')

# Join the synthetic patients and treatments tables on Patient ID
patient_treatment_transformed = pd.merge(patient_df_transformed, treatments_df_transformed, on='Patient ID', how = 'left')

# check the shape
print(patient_lab_df_transformed.shape)
print(patient_treatment_transformed.shape)

Output:

(6916, 9)
(6831, 10)

BAM! This is exactly what we were expecting. Let’s visually examine original and synthetic data once before we wrap up this blog.

Table 1 — Patients

Patient Table (Original vs. Synthetic)

Table 2 — Lab Results

Lab Results Table (Original vs. Synthetic)

Table 3 — Treatments

The Final Check

After generating each database, a Gretel Relational Report is produced. This report enables you to evaluate the privacy and accuracy of both the entire synthetic database and its individual tables.

To view the report within the notebook:

import IPython
from smart_open import open

report_path = str(multitable._working_dir / multitable._synthetics_run.identifier / "relational_report.html")
IPython.display.HTML(data=open(report_path).read())

The report includes a visual of the key relationships between tables in the database, as shown below.

Gretel Relational Reports are generated for each table, containing quality scores. The individual report assesses the statistical accuracy of the synthetic table, while the cross-table report evaluates the accuracy of the table and its ancestor tables within the database. The former provides an evaluation of the synthetic table’s quality as a stand-alone entity, while the latter provides context for the accuracy of the table within the overall database.

For this blog, I took inspiration from Generate Synthetic Databases with Gretel Relational blog by Grace King. You can also check the Official Notebook here or on Google Colab.

Conclusion

Generating Synthetic Relational Databases with Gretel Relational is a powerful solution that addresses data privacy concerns. By replicating the patterns, characteristics, and relationships found in real-world data, Gretel Relational’s cutting-edge machine learning techniques enable organizations to create synthetic data for various critical applications.

With Gretel Relational, it is possible to evaluate the quality and privacy of the synthesized database, giving users confidence in the accuracy and security of their data. The end-to-end Python example in this blog demonstrates how to generate synthetic relational databases and verify the referential integrity of the process, making it easier for organizations to innovate and drive insights without compromising data privacy.

Thank you for reading!

Liked the blog? Connect with

Moez Ali is an innovator and technologist. A data scientist turned product manager dedicated to creating modern and cutting-edge data products and growing vibrant open-source communities around them.

Creator of PyCaret, 100+ publications with 500+ citations, keynote speaker and globally recognized for open-source contributions in Python.

Let’s be friends! connect with me:

👉 LinkedIn
👉 Twitter
👉 Medium
👉 YouTube

🔥 Check out my brand new personal website: https://www.moez.ai.

To learn more about my open-source work: PyCaret, you can check out this GitHub repo or you can follow PyCaret’s Official LinkedIn page.

Listen to my talk on Time Series Forecasting with PyCaret in DATA+AI SUMMIT 2022 by Databricks.

🚀 My most read articles:

--

--