Predictive Analytics for Tax

vamshi chamala
Apr 3, 2019
3 min read

Updated: Apr 3, 2019

Taxes! Taxes! Yes, April 15th is just around the corner to filing one soon. So, what has bigdata to do with taxes? Well IRS reports that there are about 280 million taxes filed every year by individuals and corporations combined. That is lot of tax returns, thus big data makes lot of sense.

While a lot of us have used turbotax and other software for our individual returns there are other companies like say H&R block, Liberty Tax and others who help file taxes for other individuals and business. They need a tax software tool to file millions of returns for their customers. Also, rather than just file taxes for their clients they would like a value-added service where we can do some big analytics on the returns and help identify which clients would be impacted by which tax law change not for the tax past year but for the upcoming year. This way they can notify their clients ahead as new tax laws changes are enacted throughout the year.

So here is the story of how we implemented a multi-tenant SaaS application to help a major tax firm in Dallas area by providing technical and architectural leadership.

When we started the project we had no idea about this domain though we had the technical expertise in big data. So, after several long brainstorming sessions and partnering with the customer we came up with the initial design to split this into 3 major tiers namely ingestion, transform, model & serve along with overall orchestration of the big data pipeline.

Ingestion

The ingestion tier was responsible for ingesting data from various legacy systems into azure.

• Tax returns data without PII as JSON for each return were extracted to Azure data lake.

• Tax law changes

• There are content metadata related to tax law changes which would go to Azure Cosmos Document Database

• While the actual content was stored in Azure data lake / Blob storage.

Transform

The next step was to prepare this massive amounts data to be queried as part of big data analytics.

• The first step was to convert the JSON files into index sequence files as JSON files are not optimal for query performance while index sequence files being columnar format are very efficient. This was implemented using a MapReduce job that would read these JSON files and convert them to index sequence files.

• The next step was to convert the natural language query into a PIG query so that it’s easy to model the tax law changes as a query against the index sequence files. This was implemented using ANTLR with a grammar and the generated parser was used for conversion.

• Finally the analytics was done by running another MapReduce job that ran these generated PIG scripts that identified the impacted returns by each tax law changed modeled as a Tax Event.

Model & Serve

Now that we have our insights you still need a way to model and relate those to serve the customer so that they can easily view / filter them. While we started off exploring Graph database to model the relationships between Tax Events, Returns, and Users in a multi-tenant scenario we ended up using Azure SQL. Stay tuned for a later post on that.

The key thing about big data is not about data but about analytics and what insights we can provide businesses to better serve their customers.

Comments