Finding out annual music favorites from Clementine music player

Posted at 30 Dec 2015
Tags: sql, sqlite, music, clementine

I’ve always despised iTunes and because of that I began to use Clementine several years ago, after trying out lots of unsatisfying music player alternatives on OS X. At the end of the musical year, it’s time to draw some conclusions, like what’s the top 10 songs and albums of the year. Fortunately, Clementine stores all information about your music collection in an SQLite database and hence it is possible to answer these questions with some SQL.

Clementine’s main database is usually called clementine.db and is located in its user configuration directory. On OS X, this is ~/Library/Application Support/Clementine. We should create a copy of the database file before we mess with it, even if we’ll only use SELECT statements. After that, we can open the database with the Terminal command sqlite3 clementine_copy.db. First of all, for better display let’s enable “column mode” and display the column headers:

sqlite> .mode column
sqlite> .headers on

The only database table that’s interesting for us is called “songs”. It contains almost all information about each single song in your music collection. We can have a look at its fields by printing its schema:

sqlite> .schema songs
  /* Metadata from taglib */
  title TEXT,
  album TEXT,
  artist TEXT,
  albumartist TEXT,
  composer TEXT,
  track INTEGER,
  disc INTEGER,
  bpm REAL,
  year INTEGER,
  genre TEXT,
  comment TEXT,
  compilation INTEGER,
  /* Information about the file on disk */
  directory INTEGER NOT NULL,
  filename TEXT NOT NULL,
, [...], playcount INTEGER NOT NULL DEFAULT 0, lastplayed INTEGER, rating INTEGER, [...], skipcount INTEGER NOT NULL DEFAULT 0, score INTEGER NOT NULL DEFAULT 0, [...]);

For our “top of the year” statistics, the most interesting fields will be playcount and rating, the latter describing the number of “stars” in the score rating with a number between 0 and 1.0, or -1 if the song is unrated. The score field seems to be some kind of internal rating which I couldn’t find out what it’s for or how it’s calculated.

With this information, we can already make a simple top 10 list of songs by play count in year 2015:

sqlite> SELECT artist,album,title,playcount,rating
   ...> FROM songs WHERE year = 2015
   ...> ORDER BY playcount DESC LIMIT 10;
artist      album       title         playcount   rating    
----------  ----------  ------------  ----------  ----------
Aloa Input  Mars etc    Far Away Sun  14          0.8       
Will Samso  Animal Han  Your Golden   13          0.8       
Aloa Input  Mars etc    Perry         12          1.3       
Will Butle  Policy      Take My Side  12          0.8

However, this is not very satisfying, because many songs have the same play count and the rating is not considered at all. Let’s include the rating and also filter out songs that were not rated. Furthermore, we need to cap the rating to a maximum value of 1.0, because 5-star songs sometimes get higher values than 1.0 (I couldn’t find out why).

SELECT artist,album,title,playcount,MIN(rating,1.0) AS capped_rating
FROM songs WHERE year = 2015 AND rating >= 0
ORDER BY capped_rating DESC,playcount DESC LIMIT 10;

We can either primarily sort by play count or by rating, both is not very satisfying. The problem is that the play count is highly dependent on when the song was added to the collection. Songs released earlier in the year will more likely have higher play counts than winter releases. What we need is a “play count by day” since the song was added. We can calculate it by something like “play count / days(now - date_added)”. Fortunately, the “ctime” field contains the date when the song file was created. Although this is not necessarily the date the song was added to the collection, this is the most accurate we can get. Together with SQLite’s date and time functions we can get a “play count by day (ppd)” number and use it for sorting:

  playcount/(julianday('now')-julianday(datetime(ctime, 'unixepoch'))) AS ppd
FROM songs
WHERE year = 2015

The rating is still important, so we should include it in some kind of weighted score. Let’s say this score should be made up by 50% rating and 50% of the “ppd” value so we can write the following:

  playcount/(julianday('now')-julianday(datetime(ctime, 'unixepoch')))*0.5+MIN(rating,1.0)*0.5
    AS weighted_score
FROM songs
WHERE year = 2015 and rating >= 0
ORDER BY weighted_score DESC
Villagers             Darling Arithmetic    Courage
Melanie De Biasio     No Deal Remixed       I’m Gonna Leave You (C
Portable feat. Lcio   DJ-Kicks              Surrender (Kosi Edit)
Courtney Barnett      Sometimes I Sit And   Kim's Caravan 
Kamasi Washington     The Epic              Isabelle     
Kurt Vile             B'lieve I'm Goin Down Pretty Pimpin
Will Butler           Policy                Anna             
Rocko Schamoni & Le   Die Vergessenen       Angela               
Bob Moses             All In All            Far From The Tree   
Dan Mangan + Blacksm  Club Meds             Forgetery          

In order to find out the top albums, we just need to modify the query so that all songs are grouped by the album name and use the average song play count and average rating of the album. The “*1.0” multiplications are only there to force floating-point division.

artist, album,
  SUM(playcount)*1.0/COUNT(title) AS avg_playcount,
  AVG(MIN(rating,1.0)) AS avg_rating,
  SUM(playcount)*1.0/COUNT(title)/(julianday('now')-julianday(datetime(ctime, 'unixepoch')))*0.5
  AS weighted_score
FROM songs
WHERE year = 2015 and rating >= 0
GROUP BY album
ORDER BY weighted_score DESC
Viet Cong             Viet Cong                                  
Built to Spill        Untethered Moon                            
Dan Mangan + Blacksm  Club Meds
Courtney Barnett      Sometimes I Sit And Think, And Sometimes I Just Sit        
Villagers             Darling Arithmetic
Will Butler           Policy
Kamasi Washington     The Epic     
A Tribe Called Knarf  Es ist die Wahrheit obwohl es nie passierte
Kurt Vile             B'lieve I'm Goin' Down                                              
Ought                 Sun Coming Down
If you spotted a mistake or want to comment on this post, please contact me: post -at- mkonrad -dot- net.
← “Removing diacritics, underlines and other marks from unicode strings in Python
View all posts
Scraping data from Facebook groups and pages” →