Understanding Data Extraction for Recovery Audit
There are several techniques of Data Extraction for recovery audit. In this article, we will examine one of the most prominent, Human-in-the-loop.
Human-in-the-loop (HITL) data extraction techniques have been around since the first introduction of Optical Character Recognition (OCR) software. Initially, HITL processes were seen as a series of quality checks or validations to make sure the OCR tools collected data properly. These days Machine learning is becoming more and more mainstream, and the role of the human within the data extraction process is becoming far more complex and essential.
There is a significant amount of ‘critical’ information in almost every industry that needs to be accessed at a moment’s notice but unfortunately may be buried deep in an email, pdf, or word doc. The problem is that most of these documents have no fundamental data structure or content organization, so getting at that ‘critical’ content when you need it, is a tremendous challenge.
For the example of a recovery audit, essential details written in documents such as legal contracts, invoices, pricing agreements, and most everything in-between are frequently locked away and not accessible.
Clearly, there is a need to extract and store this valuable data in a structured format for business analytics, but the question is how to do it?
You might want to get an analyst to review each document interpreting the content and then keying in the relevant elements into a database in a perfect world. Still, in reality, this may not be feasible. If nothing else, the sheer size and scope of the content can make manual reviews unfeasible.
Instead, we need to deploy a more scalable approach that utilizes both the insights of a human and the scalability of a computer application. This process is commonly called Human-in-the-loop machine learning.
For example, let’s start with a simple use case: Suppose your company has several thousand suppliers. You want to aggregate the open items listed on an aged trial balance to track invoice numbers, due dates, reference ids, open credits, or other details.
To accomplish this task, you will first need to create a collection of related, semantic-based, real-world logic and knowledge. This collection of knowledge is referred to as an Ontology.
An Ontology is a compilation of content libraries designed to translate and interpret your documents’ contents accurately. Not just definitions and knowledge but also the details of how it all connects semantically to your project. – As an analogy, if the Ontology were a part of the human body, it would not be the brain rather, it would be all of the knowledge, experiences, and connections collected inside the brain.
From there, you will need to start with a basic understanding of what data you would like to collect. Let’s say you want to extract some different terms including, dollar amounts, payment due dates, or even payments that are not currently offest.
With those data needs established, you will need to create some set of training data. To make the first batch of training data, an analyst can review a couple of sample documents and inform the system of the relevant data and its looks.
In essence, your analysts will be showing the ML system what patterns and information it will need to identify and collect. The application will create a set of algorithms to replicate your steps and manage the related content.
With the training data set, you can then run the bulk of your supplier statements through the system and wait to see if the machine extracts the information you desire. To be effective at scale, a machine learning system needs to have the ability to recognize and report when the target data is not successfully identified.
When the system fails to collect the desired information, it should periodically and interactively alert its human counterpart. The human will then look at the places where the machine is hung up and demonstrate how to interpret the data.
Effectively, the human creates a new set of training data collection techniques which will then be incorporated and applied to make the system more robust. This process of testing content and then adding new data capturing methods in a perpetual cycle is called ‘Human in the loop.’
Human-in-the-loop data extraction techniques can also allow Machine Learning systems to learn shapes, sounds, and orientations. All of these are especially useful where the data extraction needs are related to handwriting or voice mails.
For more information on Data extraction for Recovery audit or Human-in-the-loop Machine Learning here is an article form Wikipedia.