Extracting nouns in their baseform (lemmata) from German texts can be easily done using Python and the Pattern library, especially its pattern.de module. However, using the pattern.de library alone often leads to unsatisfying results, because the baseform is often not correctly determined. The results can be enhanced using libleipzig which queries the Wortschatz Uni Leipzig database.
Both libraries can be installed for Python via the package manager pip. Unfortunately libleipzig does not work with Python 3 so it’s necessary to stick with Python 2.7.
Extracting nouns using pattern.de
At first, let’s extract nouns and their baseform using the pattern.de module, only. We need to import the proper functions/modules and specify a nonsense text with nouns in singular and plural form for testing. Furthermore, let’s declare a dict
, that we will use to count the occurrences of each noun later on.
Now, we parse the text with pattern.de and split it into sentence objects, which again contain the word objects. We print these objects in order to understand what’s going on:
The output is as follows:
We can already see that single sentences and their words are correctly identified including their respective part-of-speech tags, for example “Katze” is identified as singular noun (“NN”). Now we need to select each noun, get its lemma (baseform) and count it. So we update the nested for loop from above like this:
Now we can sort the nouns by their count and print the results:
Output:
As we can see, it basically works but there are some problems identifying the correct baseforms (“matte” vs. “matten”) and also “draußen” is incorrectly determined as noun.
Let’s check first with some simple examples if libleipzig might help us out:
Looks like using this library could help us improving our results.1 So let’s integrate it in our script by first defining a function to fetch the baseform and type of a word from libleipzig:
We need to catch a possible WebFault
exception, because libleipzig communicates with a server to fetch the results which might go wrong (e.g. server/connection down).
Now we can update our nested for loop to query libleipzig in case we enable it (constant LIBLEIPZIG_FOR_LEMMATA
is True
):
This is a bit more complex, because here we need to decide at first when we believe we encountered a noun: We must have a proper word first (w.string
). Then either pattern.de told us we have a noun (w.type.startswith('NN')
) or we use libleipzig and a word, that is not the first word of a sentence, begins with an uppercase character (LIBLEIPZIG_FOR_LEMMATA and w_i > 0 and w.string[0].isupper()
) – this might be an indicator that we have a noun that was not identified as noun by pattern.de (for example “Gebell”) and we should check that with libleipzig.
The results now are not perfect but they are definitely better than without using libleipzig:
So why use pattern.de at all? Because the parser works reliable, the API is very clear and straightforward to use and its fast. Its weaknesses in identifying the word types can be lessened by using libleipzig as we can see. However, you will notice that when using this library your code will run much slower, because it queries the Wortschatz server quite often and hence does not run “offline”.
-
Of course, “evaluating” the quality of the results with three examples is completely unscientific. My assessments regarding the quality of results from pattern.de vs. libleipzig are completely based on my own small experiments and should be verified in larger scenarios. ↩