Cross join / cartesian product between pandas DataFrames

Posted at 16 Apr 2016
Tags: python, pandas, datascience

Cross joins which form the cartesian product between two datasets, are a quite useful operation when you need to run calculations on all possible combinations of the rows in these datasets, for example calculating the age difference between each person in two groups of people. Another example would be calculating the distance between several origin cities and several destination cities. This is a good use-case for pandas, which helps working with large datasets efficiently in Python. But although the library supports most common join operations using DataFrames, it lacks the support for cross joins. However, cross joins can be created with a little workaround using pandas.merge(), which I will demonstrate with a small example.

Let’s start by importing some Python modules that we’ll need later:

import pandas as pd
import math

At first we need some data to work on. Let’s use the example from the beginning and calculate the distances between several origin and destination cities using their geographic coordinates:

origin_cities = ['Berlin',  'Hamburg', 'Munich']
origin_coords = {
    'lat':  [52.518611, 53.550556, 48.137222],
    'lng':  [13.408333, 9.993333,  11.575556]

destination_cities = ['London',  'New York City', 'Moscow', 'Sydney', 'Istanbul']
destination_coords = {
    'lat':  [51.50939,  40.712778, 55.75,    -33.85, 41.01],
    'lng':  [-0.11832, -74.005833, 37.616667, 151.2, 28.960278]

df_orig = pd.DataFrame(origin_coords, index=origin_cities)
df_dest = pd.DataFrame(destination_coords, index=destination_cities)

So our origins look like this:

               lat        lng
Berlin   52.518611  13.408333
Hamburg  53.550556   9.993333
Munich   48.137222  11.575556

And destinations like this:

                     lat         lng
London         51.509390   -0.118320
New York City  40.712778  -74.005833
Moscow         55.750000   37.616667
Sydney        -33.850000  151.200000
Istanbul       41.010000   28.960278

Now we need to create the cross product of those two dataframes, i.e. combining each row of df_orig with each row of df_dest so that we get 3x5=15 rows in our result set. This can be done using pd.merge() and defining a common temporary key in both dataframes as shown here. Let’s define a function for this:

def df_crossjoin(df1, df2, **kwargs):
    df1['_tmpkey'] = 1
    df2['_tmpkey'] = 1

    res = pd.merge(df1, df2, on='_tmpkey', **kwargs).drop('_tmpkey', axis=1)
    res.index = pd.MultiIndex.from_product((df1.index, df2.index))

    df1.drop('_tmpkey', axis=1, inplace=True)
    df2.drop('_tmpkey', axis=1, inplace=True)

    return res

This function creates a common key in two dataframes and merges them using this key. Since in each row the key is the same on both sides, each row from one side is combined with each row from the other side. The function also creates a MultiIndex, which reassembles the row-to-row connections between the two dataframes (you’ll see in the output below). The temporary keys are deleted from the input dataframes and the resulting dataframe. Furthermore, one can pass arguments to pd.merge() via kwargs as we will see now when we create our cross product:

dfx = df_crossjoin(df_orig, df_dest, suffixes=('_orig', '_dest'))

We pass the additional suffixes parameter to pd.merge() in order to get nicer column names as we see in the output of our cross product dataframe dfx now:

                        lat_orig   lng_orig   lat_dest    lng_dest
Berlin  London         52.518611  13.408333  51.509390   -0.118320
        New York City  52.518611  13.408333  40.712778  -74.005833
        Moscow         52.518611  13.408333  55.750000   37.616667
        Sydney         52.518611  13.408333 -33.850000  151.200000
        Istanbul       52.518611  13.408333  41.010000   28.960278
Hamburg London         53.550556   9.993333  51.509390   -0.118320
        New York City  53.550556   9.993333  40.712778  -74.005833
Munich  London         48.137222  11.575556  51.509390   -0.118320
        New York City  48.137222  11.575556  40.712778  -74.005833

The hierarchical MultiIndex nicely reflects the connection between the rows of the two dataframes: Each origin city is combined with each destination city. This can be very helpful for grouping and summarizing our data. But for now, let’s calculate the distances between the cities! At first we need to define a function that calculates the distance between two points on earth “as the crow flies”. It is called haversine formula and defined as follows:

def haversine(p1, p2):
    R = 6371     # earth radius in km
    p1 = [math.radians(v) for v in p1]
    p2 = [math.radians(v) for v in p2]

    d_lat = p2[0] - p1[0]
    d_lng = p2[1] - p1[1]
    a = math.pow(math.sin(d_lat / 2), 2) + math.cos(p1[0]) * math.cos(p2[0]) * math.pow(math.sin(d_lng / 2), 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    return R * c   # returns distance between p1 and p2 in km

Let’s apply this function on each row of our cross product dataframe to calculate a table of distances:

dist = dfx.apply(lambda row: haversine((row['lat_orig'], row['lng_orig']), (row['lat_dest'], row['lng_dest'])),

So here we have our distances in kilometers:

Berlin   London             931.102605
         New York City     6385.265543
         Moscow            1608.611198
         Sydney           16092.237562
         Istanbul          1736.516521
Hamburg  London             719.935447
         New York City     6129.909138
Munich   London             917.520160
         New York City     6488.106293

We can now for example get the nearest three destinations for each origin along with their distance by grouping on the first level (i.e. “per origin”) and fetching the three smallest distances per group:

Berlin   London       931.102605
         Moscow      1608.611198
         Istanbul    1736.516521
Hamburg  London       719.935447
         Moscow      1781.696837
         Istanbul    1986.084508
Munich   London       917.520160
         Istanbul    1583.524250
         Moscow      1959.863440

I’ve created a gist for the whole code here:

If you spotted a mistake or want to comment on this post, please contact me: post(-at-)mkonrad(-dot-)net.
← “Why You Should Never Use MongoDB
View all posts
Most software already has a "golden key" backdoor: the system update” →