I have a database of user submitted latitude/longitude points and am trying to group ‘close’ points together. ‘Close’ is relative, but for now it seems to ~500 feet.
At first it seemed I could just group by rows that have the same latitude/longitude for the first 3 decimal places (roughly a 300×300 box, understanding that it changes as you move away from the equator).
However, that method seems to be quite lacking. ‘Closeness’ can’t be significantly different than the distance each decimal place represents. It doesn’t take into account that two locations may have different digits in the 3rd (or any) decimal place, but still be within the distance that place represents (33.1239
and 33.1240
).
I’ve also mulled over the situation where Point A, and Point C are both ‘close’ to Point B (but not each other) – should they be grouped together? If so, what happens when Point D is ‘close’ to point C (and no other points) – should it be grouped as well. Certainly I have to determine the desired behavior, but how would either be implemented?
Can anyone point me in the right direction as to how this can be done and what different methods/approaches can be used?
I feel a bit like I’m missing something obvious.
Currently the data is an a MySQL database, use by a PHP application; however, I’m open to other storage methods if they’re a key part in accomplishing this. here.
Advertisement
Answer
There are a number of ways of determining the distance between two points, but for plotting points on a 2-D graph you probably want the Euclidean distance. If (x1, y1)
represents your first point and (x2, y2)
represents your second, the distance is
d = sqrt( (x2-x1)^2 + (y2-y1)^2 )
Regarding grouping, you may want to use some sort of 2-D mean to determine how “close” things are to each other. For example, if you have three points, (x1, y1)
, (x2, y2)
, (x3, y3)
, you can find the center of these three points by simple averaging:
x(mean) = (x1+x2+x3)/3 y(mean) = (y1+y2+y3)/3
You can then see how close each is to the center to determine whether it should be part of the “cluster”.
There are a number of ways one can define clusters, all of which use some variant of a clustering algorithm. I’m in a rush now and don’t have time to summarize, but check out the link and the algorithms, and hopefully other people will be able to provide more detail. Good luck!