Synthetic population for tax fraud detection testing

About the client

The Internal Revenue Service (IRS) mission is to provide America’s tax payers top quality service by helping them understand and meet their tax responsibilities, and to enforce the law with integrity and fairness to all. The Agency is currently ongoing a multi-year modernization plan to strengthen its information technology systems, technologies and processes. This includes data infrastructure (cloud, Agile, DevOps, API), process automation, and advanced data analytics and tools.

Challenge

The IRS Enterprise Services (ES), Enterprise Systems Testing (EST) unit provides enterprise-wide testing solutions and critical support across the systems and applications of the Agency. In its goal to acquire advanced data analytics capabilities, EST requires high-quality data to perform automated testing in pre-production environment of tax review and enforcement software, including tax fraud detection. Currently, simulated tax returns are generated for virtual individual households and updates based on random selections and rule-based configurations. This is a largely manual process that does not consider all possible use cases given in a tax population. In addition, production (real tax) data cannot be used due to privacy compliance issues.

The customer’s requirements were to develop a simulation engine that generates synthetic tax data representing more scenarios with lower effort. It required simulation of individuals and business entities which can evolve over time and generate updated tax records. The business goal is to reduce IT delivery cycles by generating synthetic tax data that is rapid, distributable, and repeatable.

The Simulation of the Nation (SimoN) tool

In response to the IRS requirements, Skymantics is providing a generator of synthetic population and eFile tax forms, with aging capabilities. The SimoN toolset is based on our synthetic population technology DataGenesis. A modular Machine Learning model architecture implements the algorithm training and generation, and a custom application creates and submits tax eFile forms for automated validation. Learn more about our experience with synthetic data.

The generation interface allows the tailoring of population size, geography, and any demographic attribute as desired for custom scenarios. Synthetic populations can then be aged over a number of years, as life events make the status of households, individuals and businesses evolve. A summary of SimoN features includes:

Skymantics is applying an Agile methodology to ensure customer requirements are covered by the SimoN solution. Open Source data science software libraries have been used to the maximum extent to ensure model reliability and robustness. A microservice architecture approach has been used, which increases configurability, reusability and explainability of the models. Data for model training is based on authoritative sources including U.S. Census, U.S. Bureau of Labor Statistics, academic research and other 3rd party sources.

The Skymantics difference

By pioneering the field of synthetic data in the generation of population attributes, Skymantics is leading the way in the application of Artificial Intelligence to a wider understanding and prediction of trends in demographics and psychographics. Financial, healthcare, and Government industries are some of the most mature domain areas of application. Learn more about our work in financial services / Government services.

We are currently offering an initial population synthesis capability to public and commercial organizations, which provides the following key value propositions:

Privacy and Security

Production data should never be used for testing scenarios and policies. Obfuscation and encryption are not valid security techniques. Synthetic data.

Performance

Able to produce test data in seconds or minutes with full traceability.

Flexibility

Ability to track necessary use cases to support tests, and to define forecast scenarios for testing decision-making outcomes.

Automation

Minimize overhead time for testing, allowing teams to shift left their quality controls and automate metrics.

Integration

Ability to synthesize data in your environment through toolkit.

Data quality

Maintains referential integrity across people, households, jobs business, just as a real population would.

Interested in learning more about our SYNTHETIC DATA capabilities or requesting a demo?
Contact us

Other success stories