HomeBlogedrone NewsArguing with Language: Inter Annotator Agreement

Arguing with Language: Inter Annotator Agreement

Marcin LewekMarketing Manageredrone

When it comes to processing natural language and developing smart voice assistants for eCommerce, sooner or later, you will encounter the term Inter-Annotator Agreement. Such an agreement is critical for the illusion of understanding words by a machine.

Language has always allowed us to communicate and cooperate in pursuit of aims critical for the common good. This cooperation and the passing of knowledge gained over time were possible thanks to a not entirely conscious agreement. So far, so good, but it looks like we face times when we need to redefine this agreement due to AI. Or thanks to it.

In general, the agreement is something worth doing right? We can list thousands of reasons why this is, starting from self-satisfaction, finishing with worldwide peace. Until defining whether black or white is relatively easy and objective, some aspects of our existence can be – lightly said – discussed.

As you can guess, speech itself is one of such debatable things, which makes computers struggle to process it, as we underestimate the importance of aeons of evolution. Over generations, we exercised communication, the process of sharing thoughts, which resulted in language wielding mastery. Yet, while the meaning of words is usually clear for people, its classification can be fuzzy even for them!

List of cordless impact driver's traits with annotations.
AVA’s R&D work is conducted on real-life examples. In the image given, PHU A/B/C – eCommerce with tools and electro-tools. This is a list of cordless impact driver’s traits with annotations.

A day like every day in the AVA project. But first things first.

Let’s put some tags on terms

AI surely can go crazy with every task we assign, but we need to give it decent rails. Speaking of NLU (Natural Language Processing), one of these is annotations. What is an annotation? It’s the process of classification, or more precisely, labelling. You can think about labels in terms of such as rails.

Another thing worth highlighting is corpus. It’s a piece (usually large) of text we place on the input of the algorithm. Consider it as a chunk we feed it in favour of getting specific results.

Human in the Loop

An annotator is the one who annotates gargantuan text corpus. So we have a text corpus, and we want to label words, but not all of them. AVA will automate it as much as possible, so we need to teach the network how to annotate traits by itself. This is our task here, so it requires annotating a few corpora manually.

As you may have noticed, there is a lot of “It depends”, “hinging”, “regarding the result we are aiming for”. Machine learning, in general, is all about down-to-earth defining the scope, feeding the network with proper data, and tuning it, making sure that with a given input, it returns desired output. However, there is no magic here; it may appear that there is a “ghost in this machine.”

Labels are dependent on what we precisely want to achieve or which aspect of the text corpus network will analyze. In our case, it is “What is, and what isn’t a product trait”. And it is…

A matter of dispute

AVA, along with edrone, is focused strictly on eCommerce, and the matter of eCommerce are products and their traits. These traits can take values. Our task here seems to be simple, as features and their measurements look easy to find out and annotate!

So our corpus is a product list with its descriptions, tables of features, measurements and other sources, where we – as customers – expect information and be looking for it—a piece of cake. But to be sure, let’s make a quick test if the research team is whole on the same page.

Why is it so important? Because we have to teach the algorithm to annotate every trait properly. Consent between annotators translates later into algorithm efficiency, tested by the team.

A matter of agreement

When it comes to assessing the efficiency of yes/no verdicts, it’s more than likely that you will hear the term ‘F-score’ or ‘F1-score’. It might seem complicated at first glance, but trust me, it’s actually child’s play (ok, teenager play). It comes to addition, multiplication and division, a few values.


We have two annotators who annotate the corpus secretly. F1 requires a reference point. Therefore, one of the annotators will be treated ‘arbitrally right’ as a benchmark.

The first annotator stands for the ‘trait/not trait’ division below (left and right side of the square). The second annotator will F1 ‘be tested’. His choices are in the circle.

Does this word stand for the product’s trait?

  • True positives – “Yes”, and in fact, it is a trait.
  • True negatives – “No”, and it wasn’t.
  • False positives – “Yes”, but sadly it wasn’t a trait.
  • False negatives – “No”, well… it was.

In a given example:

TP = 11
TN = 13
FP = 2
FN = 3

Nice and simple so far? With no doubt. Let’s calculate the precision. We can say that precision ‘focuses’ on “Yes” labels. What share of selected traits (in the circle) was indeed the trait?

[ {displaystyle text{Precision} = {frac {text{TP}}{text{TP}+text{FP}}} = {frac {11}{13}} = 0.846 !} ]

Recall, on the other hand, ‘focuses’ on all traits. What share of Traits was labelled “Yes”?

[ {displaystyle text{Recall} = {frac {text{TP}}{text{TP}+text{FN}}} = {frac {11}{14}} = 0.785 !} ]

Now we can calculate F1-score:

[ {displaystyle text{F1} = { 2 cdot frac {text{Precision} cdot text{Recall}}{text{Precision}+text{Recall}}} =  0.814 !} ]

A pretty nice score! Yet, I invented this example. Life is not that easy 😉

Cohen’s 𝜅

F1 doesn’t reflect the valid agreement since it does not take into account the random chance for it. Here is where two new players enter the stage: Kappas.

  • Cohen’s 𝜅: we use it while two annotators annotate each instance with a category.
  • Fleiss’ 𝜅: is used when each instance was annotated 𝑛 times with a category.

We are about to use Cohen’s one; thus, it represents our case. Another quite convenient thing about Kappa is that it allows matching agreement between two different experiments. Kappa calculated between words annotations can be matched with Kappa for image annotations.

[ {displaystyle kappa equiv {frac {p_{o}-p_{e}}{1-p_{e}}}!} ]

[  p_0 – text{observed proportionate agreement} ]

Cohen’s Kappa takes into account the theoretical chance for the random agreement.

[  p_e – text{random agreement probability}  ]

[  p_e =  p_{yes}+p_{no}  ]

[  p_{yes} – text{YES agreement probability} newline p_{no} – text{NO agreement probability} ]

YES/NO agreement probability – Product of A-annotator taging “yes” ratio, and B-annotator taging “yes” ratio.

[ {displaystyle p_{yes} = {frac {Y_{A}}{Y_{A}+N_{A}}} cdot {frac {Y_{B}}{Y_{B}+N_{B}}}!} ]

The same is true for “no”.

[ {displaystyle p_{no} = {frac {N_{A}}{Y_{A}+N_{A}}} cdot {frac {N_{B}}{Y_{B}+N_{B}}}!} ]

Cohen’s Kappa can take values between -1 and 1. However, negative values of Kappa should be rejected and test re-evaluated.

Negatives Kappas say basically that annotators are in opposite rather than in agreement. In other words, you can randomize annotations, and they would be more compatible. Thus in practice, we are dealing only with positives.

Kappa value	Agreement	% of Reliable Data
0 - 0.2		None		0-4%
0.21 - 0.39	Minimal		4-15%
0.4 - 0.59	Weak		15-35%
0.6 - 0.79	Moderate	35-63%
0.8 - 0.9	Strong		64-81%
0.91 - 1	Perfect		82-100%

Inter-Annotator Agreement

To run the test, we have chosen four eCommerces and randomly chosen five products in each offer. As a result, we had 20 products, and the task was to label their traits – names and values – properly.

For more in-depth insights, the test was conducted by three AI Specialists (Łukasz, Hubert, Piotr) in three scenarios. In each one, annotators were matched in pairs, so we had three results for each scenario. Also, we’ve calculated the F1 score too.

Scenario I – [ALL] – five labels

  • Name – start
  • Name – middle
  • Value – start
  • Value – middle
  • Other

Scenario II – [ATR] – three labels

  • Name
  • Value
  • Other

Scenario III – [NAM] – three labels

  • Name – start
  • Name – middle
  • Other

In scenarios I & III, there is the division to start and middle of name and value. Details of division are not critical to understanding this text; however, worth describing. Both traits’ names and values consist of strings of tokens. The token might be the word, couple of words, a chunk of letters, or a single letter. It’s a matter of AI-developer choice. Name-start is the first token, and the middle is the whole rest.

Finally, we’ve ended with 18 metrics, which we can use to assess annotators agreement.

Results for scenario=ALL, labels=['N-S', 'N-M', 'V-S', 'V-M', 'O']
Scores (H, L) macro F1=0.5359, Cohen kappa = 0.4679
Scores (H, P) macro F1=0.5509, Cohen kappa = 0.4974
Scores (L, P) macro F1=0.5960, Cohen kappa = 0.5674

Results for scenario=ATR, labels=['N', 'V', 'O']
Scores (H, L) macro F1=0.6548, Cohen kappa = 0.4918
Scores (H, P) macro F1=0.6767, Cohen kappa = 0.5284
Scores (L, P) macro F1=0.7031, Cohen kappa = 0.5918

Results for scenario=NAM, labels=['N-S', 'N-M', 'O']
Scores (H, L) macro F1=0.6017, Cohen kappa = 0.4639
Scores (H, P) macro F1=0.6292, Cohen kappa = 0.5132
Scores (L, P) macro F1=0.6552, Cohen kappa = 0.5239

Are the results good? Well, no. It appears that the agreement was Weak.
Is it something bad? Well, no. It’s the R&D. Result gives us clues. How then interpret the results?

A matter of perspective

Results represent the point of view of each annotator. Basically, it means that they have a different opinion on what is product trait name and product trait value. What can we learn from it?

Annotator efficiency

We have to teach the algorithm to annotate every trait properly. Consent between annotators later translates into algorithm efficiency, measured by F1-score.

If Łukasz annotates the training set for annotating algorithm, it will learn to ‘think’ about traits’ names and values the way Łukasz does. Then Inter-Annotator Agreement test between Łukasz and Algorithm he created will turn out well, with the high rate of agreement. But if we test the same algorithm against Hubert, IAA will drop to a value similar to the previous Łukasz vs Hubert test.

The proper way of Annotation

In all scenarios test between Łukasz and Piotr had the biggest agreement value. It’s a tip that a similar approach to their’s may be the proper one. On the other hand, it’s worth bowing to Hubert’s approach because he may notice something interesting, and finally, its approach may turn out better.

Keep calm, annotate and calculate kappa. We have Artificial intelligence to build!

Marcin Lewek

Marketing Manager


Digital marketer and copywriter experienced and specialized in AI, design, and digital marketing itself. Science, and holistic approach enthusiast, after-hours musician, and sometimes actor. LinkedIn

Do you want to increase sales and build even better relationships with your customers?

Book a free demo

Let us show you around the world of e-commerce.
Subscribe to our Newsletter

The administrator of your personal data is edrone LLC. We will handle your contact details in line with our Privacy Policy.