Cross joins which form the cartesian product between two datasets, are a quite useful operation when you need to run calculations on all possible combinations of the rows in these datasets, for example calculating the age difference between each person in two groups of people. Another example would be calculating the distance between several origin cities and several destination cities. This is a good use-case for pandas, which helps working with large datasets efficiently in Python. But although the library supports most common join operations using DataFrames
, it lacks the support for cross joins. However, cross joins can be created with a little workaround using pandas.merge()
, which I will demonstrate with a small example.
Let’s start by importing some Python modules that we’ll need later:
At first we need some data to work on. Let’s use the example from the beginning and calculate the distances between several origin and destination cities using their geographic coordinates:
So our origins look like this:
And destinations like this:
Now we need to create the cross product of those two dataframes, i.e. combining each row of df_orig
with each row of df_dest
so that we get 3x5=15 rows in our result set. This can be done using pd.merge()
and defining a common temporary key in both dataframes as shown here. Let’s define a function for this:
This function creates a common key in two dataframes and merges them using this key. Since in each row the key is the same on both sides, each row from one side is combined with each row from the other side. The function also creates a MultiIndex
, which reassembles the row-to-row connections between the two dataframes (you’ll see in the output below). The temporary keys are deleted from the input dataframes and the resulting dataframe. Furthermore, one can pass arguments to pd.merge()
via kwargs
as we will see now when we create our cross product:
We pass the additional suffixes
parameter to pd.merge()
in order to get nicer column names as we see in the output of our cross product dataframe dfx
now:
The hierarchical MultiIndex
nicely reflects the connection between the rows of the two dataframes: Each origin city is combined with each destination city. This can be very helpful for grouping and summarizing our data. But for now, let’s calculate the distances between the cities! At first we need to define a function that calculates the distance between two points on earth “as the crow flies”. It is called haversine formula and defined as follows:
Let’s apply
this function on each row of our cross product dataframe to calculate a table of distances:
So here we have our distances in kilometers:
We can now for example get the nearest three destinations for each origin along with their distance by grouping on the first level (i.e. “per origin”) and fetching the three smallest distances per group:
I’ve created a gist for the whole code here: