Overcoming the Limitations of Real-World Data
Introduction
Synthetic data is artificially generated data that is designed to mimic real-world data. It is created using algorithms and statistical models that replicate the patterns, characteristics, and relationships found in real-world data. Synthetic data can be used to train and test machine learning models, improve privacy and security, and reduce bias in data sets, among other things.
Synthetic data has many use-cases. One of the major ones is that synthetic data can save you millions of dollar by covering risks associated with privacy and protection of data as underlined by GDPR compliance standards. GDPR stands for General Data Protection Regulation. It is a set of rules and regulations set by the European Union (EU) that dictate how companies and organizations must handle and protect the personal data of individuals who live in the EU. The GDPR was officially released on May 25th, 2018. It replaced the 1995 EU Data Protection Directive.
The penalty for non-compliance with GDPR can be significant. Organizations that are found to be in violation of the regulations can be fined up to 4% of their annual global revenue or €20 million (whichever is greater).
In this blog you will learn:
- What is Synthetic Data?
- Major use-cases of synthetic data generation (SDG).
- GDPR requirements and synthetic data.
- Tools and stack to handle the problem.
- Python implementation using Gretel.ai API
What is Synthetic Data?
The term synthetic data refers to the data that has been artificially annotated and was produced by computer algorithms or simulations. Synthetic data is frequently utilized as an alternative source to real-world data. Synthetic data, although artificial, statistically mimics the patterns and characteristics of real-world data. This is a key characteristic of synthetic data.
Synthetic data is a powerful tool that can be used in a variety of applications, but it is particularly useful in artificial intelligence and machine learning. It lets researchers and practitioners get around the problems with real-world data, like bias, incompleteness, and a lack of variety.
One of the main advantages of synthetic data is that it can be generated in large quantities and with different characteristics, making it possible to create diverse data sets that can be used to train machine learning models. This is particularly useful when real-world data is scarce or hard to obtain.
One major use case of synthetic data is that it can be used to protect the privacy of individuals. Synthetic data can be generated by removing sensitive information, such as personally identifiable information (PII), from real-world data or by generating data that does not contain PII. Researchers can then use the data to train machine learning models without putting people's privacy at risk.
Use-cases of Synthetic Data
1. Training Machine Learning Models
One of the most common use cases for synthetic data in artificial intelligence is training machine learning models over synthetic or artificially generated data.
Often, obtaining large amounts of real-world data can be difficult and time-consuming, especially if the data is sensitive or protected by regulations like GDPR. Real-world data can also be biased, incomplete, or contain errors. Synthetic data can be used as an alternative to real-world data in order to train machine learning models.
Synthetic data can be used to supplement or replace real-world data, allowing machine learning models to be trained on a larger and more diverse set of data. This can improve the performance and generalization of the model.
2. Reducing bias in your data
Synthetic data can also be used to reduce bias and increase fairness in data sets. Real-world data can often be biased or unbalanced, leading to machine learning models that are biased or unfair.
A data set can be biased if it doesn't give a true picture of the population it's supposed to study. For example, if most of the information in a data set comes from one group, like a certain race or gender, it might not be a good representation of what other groups go through or how they act. This can lead to machine learning models that are not representative of the population they are intended to serve.
Synthetic data can help reduce bias in data by allowing researchers to create data sets that more accurately represent the population they are studying. With synthetic data, researchers can control how gender, race, and other demographic characteristics are spread across the data set. This can help make sure that the data set is more like the people it is meant to serve.
3. Enhance privacy by protecting Personally Identifiable Information
Another common use case for synthetic data is enhancing privacy and security. Real-world data often contains private or sensitive information that can't be shared with the public. Synthetic data can be used to represent this data in a privacy-preserving manner, allowing it to be used for research or analysis without compromising the privacy of individuals.
PII refers to any information that can be used to identify an individual, such as a name, address, phone number, email address, or Social Security number. PII can also include things like financial information, medical records, and biometric information. Under GDPR, organizations must protect this kind of information and get clear permission from people before they collect, use, or share it.
General Data Protection Regulation (GDPR)
GDPR stands for General Data Protection Regulation. It is a set of rules and regulations set by the European Union (EU) that dictate how companies and organizations must handle and protect personal data of individuals who live in the EU. It gives individuals more control over their personal data and holds organizations accountable for data breaches or misuse of personal data.
The penalty for non-compliance with GDPR can be significant. Organizations that are found to be in violation of the regulations can be fined up to 4% of their annual global revenue or €20 million (whichever is greater). This is intended to be a deterrent to organizations that do not take data protection seriously. Additionally, organizations can also be subject to administrative fines and penalties, and individuals can take legal action against organizations that misuse their personal data.
Before we look at our use case, it is important to understand two major underlying concepts: pseudonymization and anonymization.
Pseudonymization and anonymization are both techniques used to protect the privacy of individuals by removing or obscuring personally identifiable information (PII) from data sets. However, there are some key differences between the two techniques.
Pseudonymization
Pseudonymization is the process of replacing sensitive information, such as names and addresses, with a pseudonym or artificial identifier. This allows the data to be used for research or analysis while still protecting the privacy of individuals. Pseudonymized data can still be linked back to the original data set, but it requires additional information, such as a key or token, to do so.
Anonymization
Anonymization is the process of making data completely anonymous, so that it is impossible to identify individuals. This is typically done by removing or obscuring all information that could be used to identify individuals, such as names, addresses, and other personal information. Anonymized data is not linkable to any other data set, and it is impossible to re-identify individuals based on the anonymized data.
Synthetic data, when generated with appropriate privacy protections, can offer protection against potential adversarial attacks that traditional anonymization techniques such as masking or tokenization cannot guarantee.
Recital 26 of GDPR
Recital 26 of GDPR states that “the principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”.
It goes on to explain that pseudonymous data, which is data that has had personally identifiable information replaced with a pseudonym, can still be considered personal data if the data controller or processor can link the pseudonym to an identified or identifiable natural person through additional information they possess or can easily access.
This means that if synthetic data is generated correctly with appropriate filters, it will be beyond the scope of GDPR and eliminate all the risks you carry by working and passing over real data with PII.
Tools and Stack
The startup ecosystem of companies working on synthetic data problems is made up of a diverse range of companies that are using synthetic data to solve various problems and create new opportunities in a variety of industries.
These companies are generally focused on using synthetic data to improve the performance of artificial intelligence and machine learning systems, as well as to enhance privacy and security, reduce bias, increase fairness, and improve the efficiency and speed of data-driven processes.
For the demo in next section we will be using an API from Gretel.ai
Gretel.AI
Gretel.ai is a company that provides a platform for creating synthetic data. The platform uses cutting-edge machine learning techniques to generate synthetic data that mimics real-world data, allowing organizations to train machine learning models without compromising data privacy or security.
The platform can be used to create synthetic data for a variety of applications, including tabular data, natural language processing, and time series data.
Gretel.AI specifically has workflows and algorithms built-in to enable organizations to comply with regulations such as GDPR by removing sensitive information, such as personally identifiable information (PII), from the data used to train models. This is the use case we will implement in this blog.
Python Implementation
In this example, we will use Gretel.AI’s anonymization workflow through an API in Python. This workflow consists of three separate steps (algorithms):
- Classify
- Transform
- Synthetics
The first step Classify sets up a policy to find and label sensitive data, such as personally identifiable information, credentials, and even custom regular expressions in text, logs, and other structured data. It uses named entity recognition under the hood.
The second step Transform learns how to define a policy to label and transform a dataset, with support for advanced options including custom regular expression search, date shifting, and fake entity replacements.
The final step Synthetic generates synthetic data that cannot be linked back to the original data. Currently, it supports following models:
Gretel LSTM — deep learning model that supports tabular, time-series, and natural language text data.
Gretel ACTGAN — adversarial model that supports tabular data, structured numerical data, and high column count data.
Gretel Amplify — statistical model that supports high volumes of tabular data generation.
Gretel DGAN — adversarial model for time series data.
Gretel GPT — generative pre-trained transformer for natural language text generation.
# Install dependencies
%%capture
!git clone https://github.com/gretelai/gdpr-helpers.git
!cd gdpr-helpers; pip install -Uqq .
import os
if not os.getcwd().endswith('gdpr-helpers'):
os.chdir('gdpr-helpers')
I am using sample bike buying dataset provided by Gretel.ai library.
import glob
from gdpr_helpers import Anonymizer
search_pattern = "data/adventure-works-bike-buying.csv"
am = Anonymizer(
project_name="gdpr-workflow",
run_mode="cloud",
transforms_config="src/config/transforms_config.yaml",
synthetics_config="src/config/synthetics_config.yaml",
endpoint="https://api.gretel.cloud"
)
for dataset_path in glob.glob(search_pattern):
am.anonymize(dataset_path=dataset_path)
Once you run the above code it will ask you to provide API key for which you will have to login to your Gretel.ai account and copy API key from settings. You can sign up for free and get free credits to try this.
The `Anonymizer` class also requires two config files as transforms_config
and synthetics_config
parameter. We have passed yaml files here. If you want to see what they look like you can clone the repo.
You can also see the live dashboard as model trains:
The process will run for few minutes and it will three files. 1) report 2) transformed data csv file 3) synthetic data csv file.
Now that I have synthetic data produced by Gretel.AI API that closely resembles the characteristics of my original data, what I want to test is how comparable the machine learning model performance is on both these datasets.
Imagine you have an idea, you want to outsource the POC within or even outside of your organization. The original data contains PII and you do not want to run into risk of GDPR non-compliance. In situation like this you can create synthetic data and use that instead of original data that contains PII.
To prove this viability, I will train multiple machine learning models on both the dataset separately and observe the cross-validation performance of models. For this step I have used a Python library called PyCaret. PyCaret is an open-source, low-code, machine learning library in Python for automating machine learning workflows.
Performance on original data
Performance on synthetic data
Notice that the ranking of models doesn’t change much. Further more the best model in both scenarios is Linear Discriminant Analysis and the accuracy of model is within 1% change. It proves the viability of using synthetic data for POC instead of original data with PII.
See full Google Colab Notebook.
Conclusion
In conclusion, synthetic data is a powerful tool that is becoming increasingly important in the field of artificial intelligence and machine learning. It allows researchers and practitioners to overcome the limitations of real-world data, such as bias, incompleteness, and lack of diversity.
Synthetic data can also be used to protect the privacy of individuals by removing sensitive information, such as personally identifiable information (PII), from real-world data. With the increasing importance of data privacy and regulations like GDPR, synthetic data is becoming an essential tool for building robust and accurate AI models while protecting personal data.
Thank you for reading!
Liked the blog? Connect with Moez Ali
Moez Ali is an innovator and technologist. A data scientist turned product manager dedicated to creating modern and cutting-edge data products and growing vibrant open-source communities around them.
Creator of PyCaret, 100+ publications with 500+ citations, keynote speaker and globally recognized for open-source contributions in Python.
Let’s be friends! connect with me:
👉 LinkedIn
👉 Twitter
👉 Medium
👉 YouTube
🔥 Check out my brand new personal website: https://www.moez.ai.
To learn more about my open-source work: PyCaret, you can check out this GitHub repo or you can follow PyCaret’s Official LinkedIn page.
Listen to my talk on Time Series Forecasting with PyCaret in DATA+AI SUMMIT 2022 by Databricks.