Skip to content
Advertisement

How to get row with max dot product using MemSQL

First of all, I’ll preface this by saying I have minimal experience with SQL.

Anyways, I have a table in MemSQL of the following format:

+-----+--------+----------+----------+
| id  |  uuid  | identity | template |
+-----+--------+----------+----------+
| int | string | string   | blob     |
+-----+--------+----------+----------+

I am trying to use the MemSQL DOT_PRODUCT feature to obtain the identity of the template which generates the maximum dot product against a probe vector that I have provided. Note template is a normalized array of floats of fixed length.

My SQL statement is as follows:

SELECT id, identity, MAX(DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]'))) 
AS score FROM collection;

However, I seem to be experiencing strange behavior, where I am getting inconsistent results (1 out of 10 times that I execute the query I will get a different identity but always the same max score). Additionally, the identity is incorrect (see further below).

The result from the query is the following (9 out of 10 times):

+----+-------------+------------------+
| id |  identity   |      score       |
+----+-------------+------------------+
|  7 | armstrong_2 | 0.56488848850131 |
+----+-------------+------------------+

As a sanity check, I wrote the following SQL statement, expecting for the max to be the same. Note, I am using the exact same probe vector from before:

SELECT id, identity, DOT_PRODUCT(template, JSON_ARRAY_PACK('[<same probe template from before>]')) 
AS score FROM collection ORDER BY score DESC;

The results are as follows:

+----+--------------+--------------------+
| id |   identity   |       score        |
+----+--------------+--------------------+
|  1 | armstrong_1  | 0.56488848850131   |
| 21 | armstrong_1  | 0.56488848850131   |
|  6 | armstrong_1  | 0.56488848850131   |
| 11 | armstrong_1  | 0.56488848850131   |
| 16 | armstrong_1  | 0.56488848850131   |
| 17 | armstrong_2  | 0.534708674997091  |
|  7 | armstrong_2  | 0.534708674997091  |
| 22 | armstrong_2  | 0.534708674997091  |
|  2 | armstrong_2  | 0.534708674997091  |
| 12 | armstrong_2  | 0.534708674997091  |
| 10 | mr_bean_2    | 0.072085081599653  |
| 15 | mr_bean_2    | 0.072085081599653  |
|  5 | mr_bean_2    | 0.072085081599653  |
| 20 | mr_bean_2    | 0.072085081599653  |
| 25 | mr_bean_2    | 0.072085081599653  |
| 14 | mr_bean      | 0.037121964152902  |
|  9 | mr_bean      | 0.037121964152902  |
|  4 | mr_bean      | 0.037121964152902  |
| 19 | mr_bean      | 0.037121964152902  |
| 24 | mr_bean      | 0.037121964152902  |
| 13 | jimmy_carter | -0.011749440804124 |
| 23 | jimmy_carter | -0.011749440804124 |
| 18 | jimmy_carter | -0.011749440804124 |
|  8 | jimmy_carter | -0.011749440804124 |
|  3 | jimmy_carter | -0.011749440804124 |
+----+--------------+--------------------+

What is going on? Why is the MAX identity from the first query not the same as the max identity (top row) for the second query? Is one / both of my query statements incorrect?

Additionally, when I compute the dot product by hand (without any SQL or MemSQL), I find that armstrong_1 does indeed produce the highest score of 0.56488848850131. So why is my first SQL query (with the MAX operator) not working?

Advertisement

Answer

This is simply not valid SQL:

SELECT id, identity, MAX(DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]'))) AS score
FROM collection;

You have no GROUP BY but the query is an aggregation function (due to the MAX(). Then there are two other columns. This is not valid SQL and it is sad that some databases allow it.

The best approach is ORDER BY:

SELECT id, identity, DOT_PRODUCT(template, JSON_ARRAY_PACK('[<probe template here>]')) AS score
FROM collection
ORDER BY score DESC
LIMIT 1;   -- or whatever your database uses to limit to one row
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement