Cross joins which form the cartesian product between two datasets, are a quite useful operation when you need to run calculations on all possible combinations of the rows in these datasets, for example calculating the age difference between each person in two groups of people. Another example would be calculating the distance between several origin cities and several destination cities. This is a good use-case for pandas, which helps working with large datasets efficiently in Python. But although the library supports most common join operations using DataFrames
, it lacks the support for cross joins. However, cross joins can be created with a little workaround using pandas.merge()
, which I will demonstrate with a small example.
Let’s start by importing some Python modules that we’ll need later:
import pandas as pd
import math
At first we need some data to work on. Let’s use the example from the beginning and calculate the distances between several origin and destination cities using their geographic coordinates:
origin_cities = ['Berlin', 'Hamburg', 'Munich']
origin_coords = {
'lat': [52.518611, 53.550556, 48.137222],
'lng': [13.408333, 9.993333, 11.575556]
}
destination_cities = ['London', 'New York City', 'Moscow', 'Sydney', 'Istanbul']
destination_coords = {
'lat': [51.50939, 40.712778, 55.75, -33.85, 41.01],
'lng': [-0.11832, -74.005833, 37.616667, 151.2, 28.960278]
}
df_orig = pd.DataFrame(origin_coords, index=origin_cities)
df_dest = pd.DataFrame(destination_coords, index=destination_cities)
So our origins look like this:
lat lng
Berlin 52.518611 13.408333
Hamburg 53.550556 9.993333
Munich 48.137222 11.575556
And destinations like this:
lat lng
London 51.509390 -0.118320
New York City 40.712778 -74.005833
Moscow 55.750000 37.616667
Sydney -33.850000 151.200000
Istanbul 41.010000 28.960278
Now we need to create the cross product of those two dataframes, i.e. combining each row of df_orig
with each row of df_dest
so that we get 3x5=15 rows in our result set. This can be done using pd.merge()
and defining a common temporary key in both dataframes as shown here. Let’s define a function for this:
def df_crossjoin(df1, df2, **kwargs):
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
res = pd.merge(df1, df2, on='_tmpkey', **kwargs).drop('_tmpkey', axis=1)
res.index = pd.MultiIndex.from_product((df1.index, df2.index))
df1.drop('_tmpkey', axis=1, inplace=True)
df2.drop('_tmpkey', axis=1, inplace=True)
return res
This function creates a common key in two dataframes and merges them using this key. Since in each row the key is the same on both sides, each row from one side is combined with each row from the other side. The function also creates a MultiIndex
, which reassembles the row-to-row connections between the two dataframes (you’ll see in the output below). The temporary keys are deleted from the input dataframes and the resulting dataframe. Furthermore, one can pass arguments to pd.merge()
via kwargs
as we will see now when we create our cross product:
dfx = df_crossjoin(df_orig, df_dest, suffixes=('_orig', '_dest'))
We pass the additional suffixes
parameter to pd.merge()
in order to get nicer column names as we see in the output of our cross product dataframe dfx
now:
lat_orig lng_orig lat_dest lng_dest
Berlin London 52.518611 13.408333 51.509390 -0.118320
New York City 52.518611 13.408333 40.712778 -74.005833
Moscow 52.518611 13.408333 55.750000 37.616667
Sydney 52.518611 13.408333 -33.850000 151.200000
Istanbul 52.518611 13.408333 41.010000 28.960278
Hamburg London 53.550556 9.993333 51.509390 -0.118320
New York City 53.550556 9.993333 40.712778 -74.005833
[...]
Munich London 48.137222 11.575556 51.509390 -0.118320
New York City 48.137222 11.575556 40.712778 -74.005833
[...]
The hierarchical MultiIndex
nicely reflects the connection between the rows of the two dataframes: Each origin city is combined with each destination city. This can be very helpful for grouping and summarizing our data. But for now, let’s calculate the distances between the cities! At first we need to define a function that calculates the distance between two points on earth “as the crow flies”. It is called haversine formula and defined as follows:
def haversine(p1, p2):
R = 6371 # earth radius in km
p1 = [math.radians(v) for v in p1]
p2 = [math.radians(v) for v in p2]
d_lat = p2[0] - p1[0]
d_lng = p2[1] - p1[1]
a = math.pow(math.sin(d_lat / 2), 2) + math.cos(p1[0]) * math.cos(p2[0]) * math.pow(math.sin(d_lng / 2), 2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
return R * c # returns distance between p1 and p2 in km
Let’s apply
this function on each row of our cross product dataframe to calculate a table of distances:
dist = dfx.apply(lambda row: haversine((row['lat_orig'], row['lng_orig']), (row['lat_dest'], row['lng_dest'])),
axis=1)
So here we have our distances in kilometers:
Berlin London 931.102605
New York City 6385.265543
Moscow 1608.611198
Sydney 16092.237562
Istanbul 1736.516521
Hamburg London 719.935447
New York City 6129.909138
[...]
Munich London 917.520160
New York City 6488.106293
[...]
We can now for example get the nearest three destinations for each origin along with their distance by grouping on the first level (i.e. “per origin”) and fetching the three smallest distances per group:
dist.groupby(level=0).nsmallest(3)
Berlin London 931.102605
Moscow 1608.611198
Istanbul 1736.516521
Hamburg London 719.935447
Moscow 1781.696837
Istanbul 1986.084508
Munich London 917.520160
Istanbul 1583.524250
Moscow 1959.863440
I’ve created a gist for the whole code here: