Generative AI for Intelligent Document Processing

We believe Generative AI is the future of Intelligent Document Processing and offer a technical roadmap to get there.

Executive summary

The emergence of Generative AI and Large Language Models (LLMs) is a technological trigger with a more drastic impact than any other technology in recent history.

But LLMs are not only useful as writing assistants or in chatbot interfaces. Many document processing tasks like data entry, order booking, invoice processing, mailroom processing, … can benefit from these models.

The recent rise of “function calls” or “commands” is enabling these LLMs to connect to external resources like internet search, database lookups, or code snippet execution. This can transform task-specific or generative LLMs into goal-seeking “agents” taking actions.

For business users performing document processing tasks, generative agents will lead to a tremendous amount of value:

We estimate the global straight-through-automation (STP) rate to rise significantly, leading to 50,6% less human effort will be needed to process documents.
The breadth of tasks to be automated will increase and include taking nuanced actions in external systems
Onboarding time of new automation pipelines will be reduced by up to 90%
Less - expensive - IT services will be needed by giving business users the ability to train and correct the Generative Agent.

We believe Generative AI and generative agents are the future of Intelligent Document Processing and offer a technical roadmap to get there.

Who is Metamaze?

Metamaze is a platform for creating, training, evaluating, and deploying private LLMs for document and email processing pipelines. Our customers have been training and running more than 800 private LLMs in production, with thousands of versions in between. In the coming months, we aim to extend our framework from supervised to generative LLMs and agents.

Generative AI glossary

LLMs:
Large Language Models are large Transformer models. Examples include the Metamaze Hydra model, ChatGPT, GPT4, Falcon, …
Generative:
LLMS that can handle many different tasks, like ChatGPT, GPT4, Falcon, … Sometimes these are called “auto-regressive” models.
Supervised:
LLMs that are fine-tuned for a specific purpose based on a limited set of examples of the task to do, like the Metamaze Hydra model.
Generative agents:
Generative LLMs that are connected to other systems, plan steps, seek goals, … See below for more examples.
Text-based:
models that only use text as an input like ChatGPT, GPT4, Falcon, …
Layout-aware:
models that use text and layout as input like Metamaze Hydra
Adaptive IDP:
Intelligent Document Processing pipelines that learn from mistakes, and are easily configurable and extendable to a specific user’s needs.

What is a Generative Agent?

An AI-powered “Generative Agent” is software designed to perform tasks autonomously by using various machine learning algorithms, models, and external services. Agents learn from previous interactions to optimize their performance in achieving specific tasks or goals. Agents act in a predefined environment (known set of possible actions) based on a given input (the document or e-mail).

When it comes to “taking actions”, it refers to the agent’s ability to make decisions and perform tasks. These actions can range from simple tasks like

to complex tasks such as

interacting with other software systems like ERP and CRM packages,
summarization,
sending e-mails,
linking information,
nuanced understanding of language and context,
making interpretative decisions.

Crucial properties of an agent are

How generative AI can be of value for the user/business processing documents

By using Generative AI and agents, IDP platforms can offer many more capabilities besides “extraction” or “recognition”. This can greatly enhance the breadth of repetitive work that can be reliably automated – with humans still in control and in the loop.

Companies will benefit from giving their business users the capability to train, evaluate and deploy their own – private – document-processing agents based on generative AI and LLMs, with less involvement of IT needed and better results.

For companies building intelligent document processing pipelines, generative LLMs are valuable because they can

Learn nuances that are hard to analyze/describe in general rules, and therefore hard to implement in code,
Learn from business users instead of requiring - expensive - IT services,
Reduce onboarding time by 90% by avoiding IT services,
Reduce the need for human validation by 50% by making better holistic decisions and taking actions,
Unlock new use cases, including requirements like

Lots of client-specific logic that is hard to describe
Dealing with bad reference data quality - a known problem for static code
Incorporating world knowledge like synonyms, writing styles and conversions of units of measurements
Dealing with edge cases

Integrating with RPA platforms would make agents even more powerful.

The needs to be filled - limitations of current IDP platforms

For companies to be able to fully capitalize on Generative AI for their document processing needs, we believe there are three major technological leaps that need to be addressed to build a platform for creating private, generative LLMs:

1. The need for a platform for training, evaluating, and developing private LLMs in a user-friendly low-code way

Generic, one-size-fits-all solutions fall short for complex, customer-specific processes, which is why you need an Adaptive IDP platform where you can train your own models on your own data.

Document processing tasks like data entry, order booking, invoice entry, … are typically performed by users that do not have specialized IT or AI knowledge. But current open-source or commercial solutions do not allow users to train/fine-tune, evaluate, deploy and scale Generative LLMs without coding skills. We believe in democratizing access to private Generative LLMs by automating those steps and giving power to business users.

For the past 5 years, Metamaze has built an exceptional MLOps framework for training, evaluating, and deploying private LLMs. Because of that, we are in a unique position to have the software and experience to scale-up to private Generative LLMs.

2. The need to make LLMs layout-aware

Current state-of-the-art foundation models fall into two broad categories:

Text-only LLMs only have access to text. Converting documents with complex layouts (tables, multiple columns, forms, …) to simple, linear text clearly loses crucial layout information that is often needed to match important data elements with their context. Document processing tasks heavily rely on layout and visual information.
Multi-modal foundation models combining text and images. The current state-of-the-art in multi-modal foundation models is focussed heavily on photographs and pictures, but not on documents. These models are notoriously bad at “reading” the text even of logos, captions, road signs, … This is where specialized OCR models shine instead. Document processing tasks heavily rely on text that needs to be recognized with almost perfect accuracy. Lots of text to be extracted like numbers, names, and… can not afford to have even 1 misrecognized character.

There is a clear gap to give text-based LLMs access to more layout information while maintaining the accuracy of specialized OCR models.

Metamaze has built custom supervised LLMs that extend pretrained multilingual text-only models with rich layout information, drastically improving performance. We plan on using similar techniques but scaling them up to generative LLMs as foundation models.

3. The need for agents to take actions in external systems

Every process is different. Any Adaptive IDP platform needs to be able to tailor its behavior to a company’s existing or target workflow. Often, this means custom code for external data lookup, interpretative decision-making, applying business logic, or data validation.

While it is great that you can customize every aspect of the pipeline, these steps are often hard-coded by developers and engineers. This has two big disadvantages:

These custom steps do not learn automatically when a business user corrects the mistakes made. So if the conversion of “100 bags” to “1 pallet” is not added by a developer, the business user frustratingly has to correct it over and over again.
They require IT services to implement, making the business users dependent on budget and capacity of internal or external IT teams.

Clearly, that is not how an Adaptive IDP system should work.

An end-to-end agent can learn from business users by being shown a couple of examples of which actions are appropriate, requiring no IT development.

Estimation of impact on STP rate

The following numbers are the combined statistics of automation vs validation reason for a multitude of projects in a fixed time frame. These projects are of a mixed set of difficulty: some are so hard they reach no more than 50% STP, and others are fairly straightforward and achieve up to 99.2% STP.

Overall, the conclusion is that the global STP rate would rise from 76.6% to ~88.1%. That means ~50.6% less human validation is needed. Time that can be spent on more value-adding tasks!

Generative agents can help by

Making holistic decisions: instead of looking and having validation and aggregation rules field by field, an agent would have the total context of a document. This would lead to better-automated decisions on the correctness of extraction because the whole document. Less manual checks need to be performed.
For parsing, this would mean that a whole document can be taken into account to decide if a date or number needs to be parsed as American or European style.
Taking actions: often certain subsets of documents or e-mails require manual action to be taken like looking up reference data in a different system, adding new data to systems, or sending an e-mail requesting more information, …

What Metamaze is doing | Our roadmap

Our principles:

We build solutions that can be used by business users
We don’t reinvent the wheel. We take the wheel and build a Ferrari. If a secure, reliable, accurate API or product exists, we will embed it rather than trying to build it ourselves. There are no “untouchable parts”. If a new LLM API is more accurate than our – currently state-of-the-art – Hydra model, we will support it and offer it to our customers.
We maintain enterprise-level security, scalability, and reliability.

The Metamaze roadmap w.r.t. LLMs, Generative AI, and agent-based intelligent document processing

Request a Metamaze demo

Learn how Metamaze can help you automate any document and email in your organization. Book a demo with one of our experts and we’ll give you a quick tour of our product.

Request demo