When building any Machine Learning model – and especially in an Intelligent Document Processing platform – data and models go hand in hand. In the AI-world, a lot of efforts have been made on the second part: models. Model-centric AI is about optimizing and working hard on the algorithms behind models to make sure the model is as accurate as possible on a fixed dataset. But improving your model has diminishing returns: it becomes harder and harder to keep improving.
Our entire machine learning team knows that our models are already state-of-the-art and that focusing solely on improving the model architectures, might only result in an increase of a few extra (tenths of a) percentage points in terms of accuracy. The biggest gains don’t come from the models anymore: they come from improving the data.
Our CTO Jos Polfliet usually explains it something like this: