It’s a fairly common use case to have a latitude and longitude and want
to find the closest objects to a given point. While there are
heavyweight solutions: MySQL Spatial
Extensions,
PostGIS, they can be more trouble than they’re
worth especially if you’re making use of an ORM.
Euclidean Distance (naive, flawed, but simple)
The most naive approach is to calculate the Euclidean
distance between the
objects and point. It works OK so long as the distances are small and
you’re closer to the equator than you are to the poles. You will likely
recognize the math from high school geometry class, a^2 + b^2 = c^2.
A solution in sql (MySQL) might look something like the following.
CREATE TABLE Restaurant (
id INTEGER AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(64) NOT NULL,
lat DOUBLE NOT NULL,
lon DOUBLE NOT NULL
);
set @point_lat = 37.757001;
set @point_lon = -122.420588;
set @radius = 100;
SELECT * from (SELECT id, name, lat, lon,
sqrt(pow(@point_lat - lat, 2) + pow(@point_lon - lon, 2))
* 110574.61087757687 as distance
FROM Restaurant) i
WHERE distance < @radius ORDER BY distance ASC;
We’re computing the square root of the differences between the two
points squared. If it’s less than our search radius we let it through
and sort the passing Restaurants by their distance away from the search
point. Note the use of a sub-select. Depending on the level of
strictness you have MySQL running with (keep your eye out for an
upcoming post) you cannot use an aggregate in a where clause. WHERE is
applied
before computing aggregates. Alternatively you could make use of HAVING,
but I have chosen this route. The other thing that needs explaining is
the number 110574.61087757687. It’s the distance in meters of a single
degree of latitude. You’ll see it used throughout as we’ll be working in
meters.
There are two problems here, the first one is an issue of performance.
As written the above query will run 2 subtractions, take two things to
the power of 2, and then do a square root on every single row in the
Restaurant table. As soon as you have a city’s worth of rows that’s
going to be a performance problem and it’ll only get worse from there.
+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 35906 | Using where; Using filesort |
| 2 | DERIVED | Restaurant | ALL | NULL | NULL | NULL | NULL | 32160 | |
+----+-------------+------------+------+---------------+------+---------+------+-------+-----------------------------+
2 rows in set (0.08 sec)
There’s a relatively easy solution. We need to consider fewer rows and
avoid doing as much computation as possible. The trick is to create a
bounding box and only consider rows that fall inside of it. Since we are
searching within a radius we know that anything that falls outside of a
box with a width of 2 * radius centered around our point will fall
outside of a circle of the same size.
set @r = (@radius / 110574.61087757687);
set @lat_min = @point_lat - @r;
set @lat_max = @point_lat + @r;
set @r = (@radius / 110574.61087757687);
set @lon_min = @point_lon - @r;
set @lon_max = @point_lon + @r;
SELECT * from (SELECT id, name, lat, lon,
sqrt(pow(@point_lat - lat, 2) + pow(@point_lon - lon, 2))
* 110574.61087757687 as distance
FROM Restaurant
WHERE lat BETWEEN @lat_min AND @lat_max AND
lon BETWEEN @lon_min AND @lon_max) i
WHERE distance < @radius ORDER BY distance ASC;
We’ve added a where clause to the inner select that will filter out rows
that have no chance at being inside our radius so when our search point
is in San Francisco and we’re looking for stuff within 1000m we won’t
bother doing the math for things in NYC, or even on the other side of
town.
That’s not quite enough though. If we use explain plan we’ll see that a
table scan is occurring to do the bounding box check. We’ll need an
index to address this.
... explain query ...
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 122 | 100.00 | Using where; Using filesort |
| 2 | DERIVED | Restaurant | ALL | NULL | NULL | NULL | NULL | 32160 | 100.00 | Using where |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-----------------------------+
2 rows in set, 1 warning (0.05 sec)
CREATE INDEX Restaurant_lat_lon ON Restaurant(lat, lon);
... explain query ...
+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 122 | 100.00 | Using where; Using filesort |
| 2 | DERIVED | Restaurant | range | Restaurant_lat_lon | Restaurant_lat_lon | 16 | NULL | 1043 | 100.00 | Using where |
+----+-------------+------------+-------+--------------------+--------------------+---------+------+------+----------+-----------------------------+
2 rows in set, 1 warning (0.01 sec)
Even in my tiny test dataset there’s over an order of magnitude
reduction in the number of rows considered and it sped things up
around 5x (though admittedly both are fast and two runs is hardly a
scientific comparison.)
Correctness (now that’s much better)
So now that things are fast and we’re avoiding doing math as much as
possible we can do a much better job at accurately computing the
distance between points. The Earth is not flat, it’s a sphere. While a
degree of latitude is constant anywhere it occurs, the same can’t be
said for longitude. A degree of longitude is much smaller near the poles
than the equator.
Enter the Haversine
Formula which
accurately computes great-circle
distances, the
distance between two points on a sphere, and while the Earth isn’t a
perfect sphere it’s much closer to one than a plane. I won’t bother
explaining the math, wikipedia already does so and I’d just mess it up.
I will however translate it in to SQL for you. We’ll also adjust our
bonding box calculation to account for longitudinal variation.
set @r = (@radius / 110574.61087757687);
set @lat_min = @point_lat - @r;
set @lat_max = @point_lat + @r;
set @r = (@radius / abs(cos(radians(@point_lat)) * 110574.61087757687));
set @lon_min = @point_lon - @r;
set @lon_max = @point_lon + @r;
SELECT * from (SELECT id, name, lat, lon, 12756200 *
asin(sqrt(pow(sin(radians(@point_lat - lat) * 0.5), 2) +
cos(radians(@point_lat)) *
cos(radians(lat)) *
pow(sin(radians(@point_lon - lon) * 0.5), 2)))
as distance FROM Restaurant
WHERE lat BETWEEN @lat_min AND @lat_max AND
lon BETWEEN @lon_min AND @lon_max) i
WHERE distance < @radius ORDER BY distance ASC;
If you look closely you’ll notice a new magic number in there, 12756200
which is the diameter of the earth in meters. This query will perform
roughly the same as the previous performance enhanced version even
though it’s doing more complicated math. If we weren’t filtering with a
bounding box we’d see a much more apparent slow down.
I plan to follow this post up next with with one on how I implemented
this solution with a Django model focusing on optimizing the queries run
to fetch related data. (found
here.)
Note: the foundation for my final solution was a MySQL presentation on
the subject, but I found it to be lacking in explanation and there were
quite a few things that could (needed to really) be optimized. If you’d
like to take a look at that document it can be found
here.