Welcome to the personal website of Markus Konrad. You can find blog posts about computer science related topics here, as well as information about own open-source software projects.

Blog Posts

Why You Should Never Use MongoDB

Posted at 27 Sep 2016
Tags: databases, nosql, link

Cross join / cartesian product between pandas DataFrames

Posted at 16 Apr 2016
Tags: python, pandas, datascience

Cross joins which form the cartesian product between two datasets, are a quite useful operation when you need to run calculations on all possible combinations of the rows in these datasets, for example calculating the age difference between each person in two groups of people. Another example would be calculating the distance between several origin cities and several destination cities. This is a good use-case for pandas, which helps working with large datasets efficiently in Python. But although the library supports most common join operations using DataFrames, it lacks the support for cross joins. However, cross joins can be created with a little workaround using pandas.merge(), which I will demonstrate with a small example.

Read on…

Most software already has a “golden key” backdoor: the system update

Posted at 28 Feb 2016
Tags: link, security

A Python script to convert BibTeX to BibJSON

Posted at 18 Feb 2016
Tags: python, bibtex

I made a small Python script that converts BibTeX to BibJSON. It uses BibtexParser to parse the input BibTeX data and then constructs the JSON format as proposed by the BibJSON draft.

An equation that debunks conspiracy theories

Posted at 18 Feb 2016
Tags: link, science, probability

Unexpected gender bias on GitHub

Posted at 14 Feb 2016
Tags: link, opensource, github, gender

Python implementation of Linde-Buzo-Gray algorithm

Posted at 10 Jan 2016
Tags: python, clustering, datascience, imageproc

I needed some algorithm for clustering multidimensional vectors in Python, so I remembered that I once implemented the Linde-Buzo-Gray aka Generalized Lloyd algorithm (for better explanation see this website) in Java and since it does the job well, I decided to implement it in Python. I put the result on github along with a small IPython notebook with which you can visualize the results of the clustering process – the red dots are the detected clusters of the input data (blue crosses):

IPython matplotlib output example of LBG algorithm

Removing diacritics, underlines and other marks from unicode strings in Python

Posted at 05 Jan 2016
Tags: icu, python, unicode

Sometimes it is necessary to normalize a string by removing all kinds of diacritics (accents), underlines or other “marks” that can be attached to characters in unicode. This is important for example for full-text search or text mining. Transliteration to ASCII characters is not an option because this would for example also eliminate Greek, Russian or other characters. With the help of the PyICU library, the task can easily be achieved:

from icu import UnicodeString, Transliterator, UTransDirection
u = UnicodeString(s)
t = Transliterator.createInstance("NFD; [:M:] Remove; NFC", UTransDirection.FORWARD)
normalized = str(u)

After converting a Python string to an ICU UnicodeString object, we can apply a transliteration operation that is defined as "NFD; [:M:] Remove; NFC“. This operation means the Unicode string is at first decomposed (NFD), then the character class “marks” is removed (“[:M:] Remove”) and finally the string is re-composed again (NFC). At the end, the UnicodeString object is converted back to a Python str.

After defining a function we can use it as follows and see that it works (underlines may not be displayed correctly in your browser):

> 'cafe'
normalize_string('αυτῷ μνήμ̳η̳ς')
> 'αυτω μνημης'

Finding out annual music favorites from Clementine music player

Posted at 30 Dec 2015
Tags: sql, sqlite, music, clementine

I’ve always despised iTunes and because of that I began to use Clementine several years ago, after trying out lots of unsatisfying music player alternatives on OS X. At the end of the musical year, it’s time to draw some conclusions, like what’s the top 10 songs and albums of the year. Fortunately, Clementine stores all information about your music collection in an SQLite database and hence it is possible to answer these questions with some SQL.

Read on…

Scraping data from Facebook groups and pages

Posted at 23 Dec 2015
Tags: facebook, scraping, nlp, python, php

I’ve written a small set of tools to (semi-)automatically collect and analyze data from online discussions on Facebook groups and pages. It is available on github. It can be used to collect posts and comments (including their hierarchical structure and some metadata) from public groups and pages automatically. For closed groups, manually saving the HTML output and parsing it with a provided Python script is necessary.

After collecting the data, statistical analyses can be performed on it. For now, identifying and counting nouns as described in a previous blog post is implemented.