Extracting nouns in their baseform (lemmata) from German texts can be easily done using Python and the Pattern library, especially its pattern.de module. However, using the pattern.de library alone often leads to unsatisfying results, because the baseform is often not correctly determined. The results can be enhanced using libleipzig which queries the Wortschatz Uni Leipzig database.
Both libraries can be installed for Python via the package manager pip. Unfortunately libleipzig does not work with Python 3 so it’s necessary to stick with Python 2.7.
Extracting nouns using pattern.de
At first, let’s extract nouns and their baseform using the pattern.de module, only. We need to import the proper functions/modules and specify a nonsense text with nouns in singular and plural form for testing. Furthermore, let’s declare a
dict, that we will use to count the occurrences of each noun later on.
from __future__ import print_function from collections import defaultdict import sys from pattern.text.de import split, parse text = u"""Eine Katze liegt auf einer Matte. Viele Katzen liegen auf vielen Matten. Die Katzen schlafen, die Matten nicht. Die Hunde schlafen auch nicht. Man hört ihr lautes Gebell draußen vor dem Haus. In vielen Häusern schlafen viele Katzen. Häuser haben Türen.""" nouns = defaultdict(int) # will be used to count the nouns (noun -> count mapping)
Now, we parse the text with pattern.de and split it into sentence objects, which again contain the word objects. We print these objects in order to understand what’s going on:
parsed_text = parse(text, lemmata=True) for sentence in split(parsed_text): print('SENTENCE: %s' % sentence) for w in sentence.words: print('> WORD: %s' % w)
The output is as follows:
SENTENCE: Sentence('Eine/DT/B-NP/O/ein Katze/NN/I-NP/O/katze liegt/VB/B-VP/O/liegen auf/IN/B-PP/B-PNP/auf einer/DT/B-NP/I-PNP/ein Matte/NN/I-NP/I-PNP/matte ././O/O/.') > WORD: Word(u'Eine/DT') > WORD: Word(u'Katze/NN') > WORD: Word(u'liegt/VB') ...
We can already see that single sentences and their words are correctly identified including their respective part-of-speech tags, for example “Katze” is identified as singular noun (“NN”). Now we need to select each noun, get its lemma (baseform) and count it. So we update the nested for loop from above like this:
for sentence in split(parsed_text): print('SENTENCE: %s' % sentence) for w in sentence.words: print('> WORD: %s' % w) # noun types always start with "NN", so select them: if w.type.startswith('NN') and w.string: # get the lemma (if existent) or the original word string and save it in "l": l = w.lemma or w.string nouns[l] += 1 # count up this noun
Now we can sort the nouns by their count and print the results:
print('---') sorted_nouns = sorted(nouns.items(), key=lambda item: item, reverse=True) for lemma, count in sorted_nouns: print('%s:\t\t%d' % (lemma, count))
katze: 4 matten: 2 matte: 1 haus: 1 häusern: 1 hunde: 1 türen: 1 draußen: 1
As we can see, it basically works but there are some problems identifying the correct baseforms (“matte” vs. “matten”) and also “draußen” is incorrectly determined as noun.
Let’s check first with some simple examples if libleipzig might help us out:
from libleipzig import Baseform base = Baseform(u'Matten'); print(base) > [(Grundform: u'Matte', Wortart: u'N'), (Grundform: u'Matten', Wortart: u'NN')] base = Baseform(u'Häusern'); print(base) > [(Grundform: u'H\xe4user', Wortart: u'N'), (Grundform: u'H\xe4user', Wortart: u'N')] base = Baseform(u'draußen'); print(base) > [(Grundform: u'drau\xdfen', Wortart: u'A')]
Looks like using this library could help us improving our results.1 So let’s integrate it in our script by first defining a function to fetch the baseform and type of a word from libleipzig:
from libleipzig import Baseform from suds import WebFault def lemma_and_type_from_leipzig(word): try: base = Baseform(word) if base and base.Grundform: return base.Grundform.lower(), base.Wortart else: return None, None except WebFault: print('WebFault while using libleipzig', file=sys.stderr) return None, None
We need to catch a possible
WebFault exception, because libleipzig communicates with a server to fetch the results which might go wrong (e.g. server/connection down).
Now we can update our nested for loop to query libleipzig in case we enable it (constant
for sentence in split(parsed_text): print('SENTENCE: %s' % sentence) for w_i, w in enumerate(sentence.words): print('> WORD: %s' % w) # check if we *might* have a noun here: if w.string and (w.type.startswith('NN') or (LIBLEIPZIG_FOR_LEMMATA and w_i > 0 and w.string.isupper())): l = None came_from_leipzig = False if LIBLEIPZIG_FOR_LEMMATA: l, wordtype = lemma_and_type_from_leipzig(w.string) if l and wordtype: if wordtype != 'N': # libleipzig says this is no noun print('>> libleipzig: no noun') continue came_from_leipzig = True if not l: l = w.lemma or w.string came_from_leipzig = False print('>> NOUN: %s (%s, %s)' % (w.string, l, came_from_leipzig)) if l not in nouns: nouns[l] = 0 nouns[l] += 1
This is a bit more complex, because here we need to decide at first when we believe we encountered a noun: We must have a proper word first (
w.string). Then either pattern.de told us we have a noun (
w.type.startswith('NN')) or we use libleipzig and a word, that is not the first word of a sentence, begins with an uppercase character (
LIBLEIPZIG_FOR_LEMMATA and w_i > 0 and w.string.isupper()) – this might be an indicator that we have a noun that was not identified as noun by pattern.de (for example “Gebell”) and we should check that with libleipzig.
The results now are not perfect but they are definitely better than without using libleipzig:
katze: 4 matte: 3 häuser: 1 haus: 1 tür: 1 gebell: 1 hund: 1
So why use pattern.de at all? Because the parser works reliable, the API is very clear and straightforward to use and its fast. Its weaknesses in identifying the word types can be lessened by using libleipzig as we can see. However, you will notice that when using this library your code will run much slower, because it queries the Wortschatz server quite often and hence does not run “offline”.
Of course, “evaluating” the quality of the results with three examples is completely unscientific. My assessments regarding the quality of results from pattern.de vs. libleipzig are completely based on my own small experiments and should be verified in larger scenarios. ↩