What is Intelligent Document Processing?

This article will give an in-depth explanation of Intelligent Document processing.

What is Intelligent Document Processing?

Data has become the cornerstone for most businesses. It can give meaningful insights, generate leads/sales, help a company grow… But this abundance of data must be captured and interpreted. The first problem is the capturing of data from documents. Most companies are not able to effectively extract data from unstructured documents and therefore they do not use data to improve their business. The next problem is when companies do collect data, but they do not know how to extract and interpret this data. These companies waste too much time on manually extracting unstructured data, without getting the insights they desire. 

Table of Contents

Around 80% of total enterprise data is unstructured and can’t be analysed as is.

There are 3 types of data a company encounters:

Structured data is organized in a predictable, orderly pattern. Spreadsheets, sales data, website analytics, ERP’s… are all examples of structured data. If column A represents currencies, A1 might say ‘euros’, A2 might say ‘dollar’, but A3 wouldn’t say ‘water’. Structured data is usually composed of numbers or values that make it easy for Optical Character Recognition to extract, interpret and classify information. 

Semi-structured data comes from documents that are generally the same but sometimes have variations. Say a bank uses a corporate KYC form. It’s a template that is structured and systematized, but it might be necessary to have different KYC templates for each industry. Sometimes templates have to be updated for new regulations, or the signature location moves from the right to the left. The data is structured, but less predictable. OCR cannot handle the variation, so humans must step in. 

Unstructured data can come in highly unsystematized, more unpredictable forms due to the variety of formats: email, chat, social media posts, sensor data, IoT, audio and video files. It is qualitative and not quantitative in nature. A document will not have a designated location to extract information from. Think of looking for vehicle information on a loan application form (structured) versus having to figure out vehicle information from a social media profile (unstructured). More effort goes into extracting “Mercedes SL AMG” from Bob’s Facebook post than from the loan application. By necessity, extraction, interpretation, and classification has largely been a human job, albeit a tedious one. 

The problem with all this data is that around 80% of total enterprise data is unstructured and can’t be analysed as is.  Processing this unstructured data is time-consuming and demotivating for employees. How can organisations transform this data to valuable, structured data without overburdening their employees?

IDP might be the answer.

What exactly is Intelligent Document Processing?

Intelligent Document Processing or IDP, is a technology based on AI and machine learning that allows organisations to automate the data extraction from complex, unstructured documents and convert it into usable data.  IDP utilises different technologies to extract, interpret, categorize, relevant data. Before it can be implemented, an IDP system must be trained  on a number of different example documents. Afterwards, the system automatically extracts the relevant data. After training, when the system is not sure if the data is correct, it will demand human validation to continually improve the algorithm. The different sub-technologies used within IPD are Artificial Intelligence (AI), Machine Learning (ML), Optical Character Recognition (OCR), Computer Vision, Robotic Process Automation (RPA), and Intelligent Character Recognition (ICR ). IDP can be a huge timesaver for businesses that still manually extract data from documents.

Because all these technologies work seamlessly together, an IDP system can learn from itself. This means that organisations can automate the data extraction from complex and varying document types and emails.

What is the difference between Intelligent Document Processing and Optical Character Recognition?

IDP is different from OCR or data capture. IDP does use OCR technology, but it is larger than just that. It also incorporates so much more technologies that help the IDP system to make well thought out decisions. IDP is also different from robotic process automation. RPA is a separate, single task that runs on a data driven and trained model. But this model can only do this one repetitive task that it is trained on. RPA does not have the ability to understand and interpret the data like an IDP system. For lack of a better term, RPA is a ‘dumb’ system. 

The different steps within Intelligent Document Processing

1. Data input

ingesting documents with user interface or API
The IDP workflow starts with the input of documents or emails. This data can either be digital or paper based. The IDP software can interpret digital documents, but it can also extract data from photographed or scanned handwritten/physical documents. IDP can gather these documents from either a user interface or an Application Programming Interface. Once the documents are uploaded to the IDP system, the processing can begin.

2. Pre-processing

Pre-Processing OCR, cleanup, word bounding boxes

Noise removal

When the software notices some stains, marks or scribbles on the page that are unnecessary, it will remove them, to make sure the OCR has the clearest possible image of the relevant information.

This step prepares the document before it is analysed, because the reliability of IDP depends on clean, accurate and correct date. The documents are categorized and classified during this step and made ready for conversion. The IDP system does a preliminary clean-up: it removes stains, it splits lager documents into their various parts, it enhances images, it removes noise… This way, the IDP system has the best chance of getting an accurate result. Some of the technologies used are:

De-skewing

Some documents may have been scanned or photographed at a weird angle, causing the data to be skewed. This can cause difficulties for the OCR to recognise text or other objects. The software has the capability to notice when characters or signatures are at a strange angle, and it can stretch and turn the image to make the scan or image look square.

Binarization

This is a technique that converts a coloured image into black or white pixels. After binarization, an image only consists of pure black and pure white, 2 types of pixels. This makes a clear distinction between the background and foreground (text). Again, this helps the OCR to recognise the characters.  

3. Intelligent document classification

Document Classification page management, language detection

Most information that organisations need to analyse are multiple page documents with several types and formats of information. The success of data extraction is partly based on the software’s ability to separate the different pieces of data by format and the ability to route these pieces of information into the right workflow to be extracted.

A loan application for instance, consists of bank statements, tax returns, pay stubs, credit history and ID information. The IDP software will analyse the text-based data and split all this information in the different, relevant parts. When it comes to classifying images, like a photograph of the ID, there will be computer vision algorithms that will extract the relevant data.

This step of Intelligent Document processing is typically human-in-the-loop. The software will ask for human input when it isn’t sure about the accuracy of its determination. This human feedback is then used to further train and improve the algorithm.

4. Data extraction

Extract Information: names, dates, places, payments

The most important part of intelligent document processing is the data extraction. This is the (relatively self-explanatory) stage where the IDP software will extract all the needed data from the pre-processed documents. IDP software has different models (neural networks) that are trained to extract different formats and types of information. OCR is capable of extracting text from images and scanned documents, while Natural Language Processing (NLP) is responsible for the interpretation and translation of handwritten data. The software also determines the type of data it is extracting. It labels dates, addressed, order numbers, names, etc. correctly.

Data extractions are done in 2 parts:

Key value pair extraction

The extraction of data based on a unique identifier that accompanies the data. For example, wherever a name is located on an ID card, it will always be accompanied by the word ‘name’ (in the language where the ID is from). The software can use the word ‘name’ to help identify the name of the person

Table extraction

Table extraction is quite self-explanatory; it is the extraction of data from a table or an array.

Then there are 3 ways to extract the data:

OCR: OCR is the first step in data extraction. This piece of software recognises the characters and translates them into blocks of data. This is an essential part of IDP, but there are a few things that can go wrong.

  • Word detection error: This occurs when the software fails to detect a certain word, image, or block of text. This is almost always caused by poor image quality or bad pre-processing
  • Word segmentation error: This happens when the system fails to interpret a word correctly, because there was a fault in the detection of spaces in between words, there are different text alignments or the spacing between characters isn’t perfect.
  • Character segmentation error: This is an error caused by the inability to differentiate between different characters. This error is mostly caused by cursive handwriting or connected alphabets.
  • Character recognition error: When the system fails to identify the right character and therefore creates an incorrect word.

Most of these errors occur when OCR is the only form of data extraction used. IDP uses AI and Machine Learning to pre-process the documents, to prevent these errors from happening.

Rule based extraction: Rule based extraction models work perfectly for structured texts and work mostly for semi-structured texts. This model can identify data based on a key value pair. Meaning, it can look for the account number, because it will be the string of numbers following “Account Number” or “Account No.” So, the model only has to look for that key value to find the data it’s looking for.

Learning based approach: Deep learning and machine learning based OCR hybrid data extraction is rooted in (un)supervised training to improve their neural networks. The more processed documents, training and feedback model has had, the more accurate and efficient will be.

5. Data validation

Data validation: validate quality and accuracy of the data

During the post-processing validation stage, ML models validate the quality and accuracy of the extracted data. They correct common misspellings or adjust the fonts to match the standard formatting. The data also has to be validated through pre-determined algorithms and rules. When the data doesn’t conform to these rules, it will be flagged for human validation which, in turn, will train the software to be more accurate in the future. For example: a signed contract needs two signatures to be valid. When the document only contains one signature, the system will flag it as inaccurate and will be reviewed.
Human review: There does not yet exist an IDP model that can extract data with 100% accuracy. Therefore, the system always gives a certainty score about the extracted data. The model is X% sure the extracted data is correct. When this score is under the threshold set by the organisation, the system will automatically send a request for human review. When a document is reviewed, the system learns from its mistakes and gets more accurate. The more data is reviewed by a human, the more accurate the data extraction model gets.

6. Data integration/output

Output: expose results to external systems with API's/RPA

When the relevant data has been extracted and validated to make sure it’s correct, it is assembled and passed on to a database or a business process via an API.

Benefits of Intelligent Document Processing compared to manual data extraction and template-based systems

Time saver/faster document processing

    • Save up to 90% time with respect to data extraction compared to manual data extraction. An employee needs about one minute for short documents and up to hours for documents with 100+ pages, while an IDP system can do both in about 30 seconds.

Improved accuracy

    • Up to 99.99% accurate according to document complexity and length.

Improved productivity

    • Because IDP-systems work (almost) entirely automatically, your employees get a lot more free time to work on other things.

Process any document

    • IDP transforms all physical, paper documents into digital versions that are sharable with peers. This helps the digital transformation in the organisation.

Cost efficient

    • Because the employees do not have to spend hours manually extracting data from documents and emails, there is no possibility for human errors. This combination can lead to a cost reduction of up to 70%.

Business wide automation

    • IDP systems are easily integrated with existing automated systems. This means that IDP can help an organisation to create a fully integrated RPA system.

Scalability

    • Because IDP isn’t tied to one process at a time, it is scalable. This means that it can handle 1000+ documents at the same time.

How to get started with Intelligent Document Processing?

Industries where Intelligent Document Processing is relevant:

  • IDP for Banking
    • Identity & income verification documents
    • Tax returns
    • Application forms
    • Profit & loss statements
    • Mailroom automation
  • IDP for Insurance
    • Claims management
    • Detecting fraud
    • Policy administration
    • Automated underwriting process
  • IDP for Logistics
    • Bill of lading
    • Shipping labels
    • Packet list
    • certificates
  • IDP for Commercial real estate
    • Rent rolls
    • Trailing 12 months
    • Offering memorandum
    • Operating statements
    • Certificates & receipts
  • IDP for Human Resources
    • Analyse resumes
    • Analyse key data from job applications
    • Faster employee onboarding/offboarding
    • Extracting employee contracts
    • Processing satisfaction surveys
  • IDP for Legal
    • Easier case reviews
    • Faster contract administration
    • Archive digital documents
    • Detect fraud
  • IDP for Medical
    • Better patient history
    • Easier patient admission

Why training is so important?

How is an Intelligent Document Processing model trained?

The training of the Artificial Intelligence model starts with looking at the example documents that the company wants to automate and the list of data that needs to be extracted. The IDP company will offer guidance and ideas on how to set up the project settings correctly. They make recommendations on how much data will be needed to achieve the desired automation rate.

Next up, the training data will need to be created. Then you can train your initial models, test their accuracy and make changes to the data where necessary.

Lastly, when you are ready to deploy your model, the benefits of automation start immediately. Most IDP systems have analytics modules where you can monitor the performance of the model. Where needed there can be human validation to correct any mistakes in the training data or add training data to improve the performance and accuracy improves.

What makes Metamaze unique is the quality of annotations. A machine learning model can only be as good as the training data that was used. If you don’t annotate all occurrences of a field, the model will be confused whether it needs to extract this certain field or not.

Int the world of AI technology, annotators that don’t make mistakes don’t exist. Annotation is hard and humans make mistakes. Even the best, highly educated engineers, accountants or sales-reps will make some mistakes on any specific task. Therefore, there are some companies that resort to annotate every document twice from scratch. This is a huge drain on human capital and still leaves an error margin of 2%.

Depending on document type, after the first annotation, as much as 35% of documents can still contain (small) mistakes. And still after the first full human annotation, that number only drops to 10%. Therefore, Metamaze focuses on quality of annotation over quantity. Fixing annotation mistakes can easily lead to 10-20% improvement of your model and is considerably less effort than annotating extra documents from scratch. In addition, adding new documents will not help the accuracy of the model if the existing documents still contain mistakes. You need to add about 5 times more annotated training data to your neural network to get a similar accuracy improvement compared to just fixing the original dataset. So, we always recommend correcting the existing data first before you add new data.

Metamaze uses review tasks to help correct mistakes. These review tasks are automatically suggested based on data that might be misannotated. This is a kind of automated testing for annotations. These suggested review tasks are far more efficient than adding new documents for annotation or checking all annotated documents for mistakes. Due to the use of AI and deep learning, the system will select which documents need an extra review and which you can safely skip.

How Metamaze is unique in the Intelligent Document Processing market

  • Unique intelligent data quality improvement
    • Metamaze implements data-driven AI to improve the IDP system. Data-driven AI is the use of a secondary AI to determine what documents are useful to manually train the primary AI. This way the main AI is trained in the most efficient way possible. This eliminates the need for more human validation training than necessary.
  • No-code platform
    • Our platform does not require any coding or AI knowledge to use. Everything is very intuitive and easy to use and adjust.
  • One-stop-shop
    • Metamaze offers a single solution for all your different document processing needs. Different platforms for different document types are therefore history.
  • Advanced quality assurance model
    • At Metamaze, the IDP system has an advanced model to determine when human intervention is needed.
  • Multilingual by design
    • The smart AI platform is integrated with 50+ languages. This means that documents in foreign languages do not have to be translated before they can be processed.
  • Built to integrate any document type and any other smart system within a company
    • Every document or automation software can be connected to Metamaze and work flawlessly together.


Learn more about Intelligent Document Processing