Synthetic Data
Home | What we do | Solutions | Synthetic Data
Test datasets are used widely in software verification, Machine Learning algorithm training, and data analytics. Data is nowadays the main resource for businesses in areas as wide as marketing, employment, migration patterns, consumer behavior, and professional services. However, using production data (i.e., real data records) brings security and accuracy restrictions on the procedures to test and forecast scenarios, especially if the customer includes Personally Identifiable Information (PII) in its datasets. Synthetic data removes such barriers by producing pre-production data environments with the scale and realism required.
Creating an alternative reality safe for testing and analysis
Synthetic data replicates the original data by maintaining its statistical integrity. The source records are not present in the synthetic dataset, thus removing any trace back to the original entries. In addition, synthetic datasets can be modified to different scales, forecasted into the future, or tweaked to represent variations of the original scenario in a controlled manner.
Using synthetic data for software testing/training and data analytics brings the following benefits compared to real data:
Privacy and Security
Production data should never be used for testing scenarios and policies. Obfuscation and encryption are not valid security techniques. Synthetic data ensures compliance to PII and GDPR regulation and facilitates sharing data among business and technical teams without compromising security.
Agile data generation
The processes of surveying and sharing sensitive production data can lead to weeks or months of data obtention and preparation cycles. A synthetic data set can be produced in minutes with full traceability and can be automated to minimize overhead testing time and quality control.
Flexibility
Data owners have full control over the generation of synthetic datasets. Thus, they can modify the datasets to reduce bias, generate specific test cases, or define forecast scenarios to test hypotheses.
Use cases for synthetic data
Skymantics has been a consultant in the realms of security and privacy for several years with our customers. During this time we have observed the evolving and increasing data privacy regulations around the world.
Skymantics is a pioneer in the use of synthetic data. We are currently bringing early benefits of this approach to risk areas including tax administration and disaster response.
Main use cases for synthetic data are:
True Data anonymization
Masking and data obfuscation techniques are traceable and have been exploited to infer sensitive data from customer or patient data for identity engineering and fraud. Synthetic data is a superior alternative as it completely eliminates the risk to trace back to individual data. Synthetic datasets are safe to share within the organization (with data science or business analytics teams) and outside (with customers and collaborators).
Accelerate Path to Automated Software testing
Traditional test data generators are based on entry randomization and rules-based pattern construction. Artificial Intelligence enhances tremendously the accuracy and richness of test environments by generating synthetic datasets that mirror the stochastic relationships between the variables characterizing the real data, thus allowing to define specific test cases without losing accuracy.
Predictive analytics
A synthetic dataset can be tweaked and branched out to represent hypothetical scenarios for analysis and comparison. Examples of this are populations aged following different demographic patterns, or suffering economic recessions. This capability allows to simulate potential outcomes and perform sensitivity analysis on analyses such as market segment demographics, population vulnerability to disasters, or events in the air traffic network.
Fraud and Revenue Protection
Leverage built in consumer, taxpayer and socioeconomic patterns to train rules on how to detect data anomalies and protect your revenue. Extend core models to custom aspects to support utilities theft, abnormal spending patterns, and financial transactions.
Machine Learning algorithm training
As more business operations rely on Machine Learning models for prediction and decision support, the datasets necessary to train such models become a valuable and scarce resource. Synthetic training datasets are cheaper to scale, and allow to include corner cases and reduce bias, thus improving the accuracy of the models.
Our approach: alive synthetic populations
Skymantics has developed DATAGENESIS, the most advanced generator of synthetic populations in the market. By replicating demographic, geographic, and socioeconomic features of the population from authoritative data sources, DATAGENESIS augments customer data with population insights while we create privacy-compliant synthetic data environments.
- Generation and aging
- Households, individuals and businesses
- Fabrication of names and addresses
- Geospatial attibutes
- What-if scenario to simulate events such as natural disasters, diseases
- 65 entity attributes
- 30 socioeconomic life events
- Automated testing for statistical validity
- Software Development Kit (SDK)
Our synthetic populations are “alive”, as they can be aged for a number of years, producing demographic changes to the population structure which replicate real statistical trends. This novel branching capability enables the forecast of multi-year test scenarios (e.g., recession, immigration, pandemics) and impact over populations of configured scales and geographic areas.
Thanks to a modular design, DATAGENESIS can be integrated in customer data pipeline (cloud or in-premise), and with 3rd party BI tools and data platforms. The high generative performance allows synthesizing hundreds to millions of households in a matter of minutes. These capabilities enable players of different industries to integrate synthetic demographics in their data analytics environment and make the most of their existing customer, patient, and citizen data records without compromising privacy.
Do you want to learn more about the possibilities of synthetic data? Contact us to query about our solutions and request a demo today.