The nonlinear approach
Our aim is that our customers understand the task at hand in detail, including terminology. In that way, we can shape the goal of the project together. It is one of the first questions we ask: What is your end goal? What are you trying to extract or classify? Where does this data come from? Is it representative?
When the whole team clearly understands what the goal is, we can start with our nonlinear approach of data annotation and see what the possibilities are.
We first need to get to know the data. We look into the data together with a data expert and ask a lot of questions. We have to completely understand what the goal of the extraction or classification is. Based on all of this, we define together with the customer the names of the tags/classes.
To make it clearer, let’s include an example of labelling documents for entity extraction from the invoices. Since our customer doesn’t want to manually extract informations from the invoices and based on the customer’s needs, we decide to extract the following: name of the seller, name of the buyer, date of the invoice, payed amount, and amount of discount.
[TODO] Add a little thing about data bias and that we need to investigate it
In the next step we have to write the rules for labelling or labelling guidelines. Good labelling guidelines don’t leave no space for self interpretation and include examples.
Let’s look at an example of labelling guidelines for the name of the buyer:
> Label the whole name, e.g. John Smith or María Dolores Carmen Rodríguez Martínez, as one entity. Do not include Mr, Mrs, Dhr …. If the name of the buyer is a company, annotate the name of the company.
And for the date of the invoice:
> Label the whole date including the year but without day of the week, e.g. 04/10/2018. Label it no matter in which format it appears; 2018-10-04, 4th of October 2018, 4. 10. 18, etc. Don’t forget to label all instances through the whole document, it appears very often also at the end of the document.
It is crucial to understand, that if we don’t annotate all the occurrences of an entity in the documents, the model will have problems learning when to extract an entity and when not. Consequently the model’s extractions won’t be complete and the accuracy of the model will be low.
The secret to creating good labelling guidelines is to anticipate the edge-cases in the documents and describe them in the guidelines. There is a possibility that some documents include “errors”, and guidelines need to provide instructions on what to do with those documents.
Include also the general syntactic guidelines: for example, do you include trailing whitespaces, punctuation marks, etc. You expect everyone to label in the same manner, so provide very clear instructions, even if you sometimes think something is obvious.
After writing the labelling guidelines it is time to start labelling. And here the iterations begin. We advice to first look at 10 documents and try to label them using the labelling guidelines. If at any point you have to make an additional decision on how to annotate certain entity, you have to add it to the labelling guidelines. If you notice one of the guidelines doesn’t make sense, now is the best time to adjust it.
Let’s return for a moment to the example of the entity extraction from the invoices. We noticed when labelling the first couple of documents that names sometimes appear in several lines. So we decided to add the following to the labelling guidelines:
> If the name appears in two or more lines, annotate each line separately.
Another thing we noticed is, that the date of the invoice sometimes appear in the header or footer, but because of different reasons. We decide to describe this edge case in the labelling guidelines:
> Be careful, if a date appears in the header/footer due to the print, do not annotate it. Annotate dates in the header/footer only if the date is a part of the document.
It is important to know that if you decide to change the labelling guidelines later on, in the middle of the labelling process, it can happen that all the already labelled documents have to be labelled again. That is why it is of outmost importance to define unambiguous labelling guidelines.
After annotating several documents, where the change of the guidelines was not needed, it is time for the next step. Ask a colleague or two to help you annotate around 50 documents. You should use the so called 4-eye-principle, where each document gets labelled by at least two people. Afterwards, you compare the results and based on that update the labelling guidelines. You repeat the process until you are sure you saw enough instances of documents, that your labelling guidelines won’t need any more changes.
Only then start the production annotations. Now that you clearly understand the domain, articulate it to the team of labellers repeatedly. It is advisable that labellers first label several already labelled documents so you can measure their performance against your goals. If needed, you need to provide the training to improve the accuracy of labelling. If at any point they cannot annotate a document based on the labelling guidelines, they need to let you know so you can update the guidelines. It is also important for them to regularly check the guidelines for possible changes.
If resources allow, it is advisable for majority of the domains to use the 4-eye-principle. It guarantees high quality of labelling. In the case that labels from both annotators don’t match, a third annotator is needed to find the correct labels.
It is important to know that if you decide to change the labelling guidelines later on, in the middle of labelling process, it can happen that all the already labelled documents have to be relabelled. That’s the reason why it is of outmost importance to define unambiguous labelling guidelines.
Optimisation of the labelling process
After the initial set of the data is annotated, it is already possible to train the first version of the model. Including the model increases the efficiency of labelling by providing extraction/classification suggestions together with probability that the suggested extraction or class is correct. The more data you label, the more data the model is trained on, the more accurate the suggestions are and the more time is saved by the annotators. Including a model early in the annotation process is also a great indicator, when the amount of labelled data is sufficient for production model.
Annotation of the data can be a boring and time consuming job but it can also be efficient and fast if you know how to tackle the problem correctly.
Our customers individually decide how much they are included in the annotation of their data. If they decide to annotate the data themselves, we are supporting them through the whole process. We help with data annotation workshops where we find the best solutions for the annotation of the customer’s specific data as well as help writing the labelling guidelines and performing quality checks after the labelling has started. On the other hand, some customers want to outsource the whole task, and we take over not only the downstream ML tasks, but as well the whole process of data annotation.