Skip to content
Advertisement

Query to list users who have answered similarly to questions as the specified user

I’m making a project in which users answer questions with yes/no choices and an option of must-match, e.g.:
Question 1) Stack Overflow is helpful? [Yes / No] [MustMatch]
Question 2) Hills are better than beaches? [Yes / No] [MustMatch]
etc.
Users can skip questions if they want.

I need MySQL query to calculate match percentage or match ratio between two specified users; the match ratio being number of same answers / number of common questions. (The MustMatch option should be ignored for this query.) Example:
User 3 has answered as: Q1) Yes, Q2) No, Q3) Skip, Q4) Yes
User 5 has answered as: Q1) Yes, Q2) Yes, Q3) No, Q4) Yes
User 9 has answered as: Q1) Yes, Q2) No, Q3) Yes, Q4) No
Suppose we want match ratio of user 3 and user 5. The match ratio will be same-answers / common-questions = 2/3 as they have answered same to 2 questions (Q1 and Q4) and because user 3 skipped a question so they have 3 questions in common.
If we want match ratio of user 5 and user 9, it’ll be 1/4.

I also need MySQL query to show top 20 matches in descending order of match ratio for a user. The MustMatch option should be considered for this query. This list of top 20 matches will contain userid, no. of common answers, no. of common questions and (match percentage or match ratio). For top 20 matches of example user 5: user 3 will come before user 9 in descending order of match ratio:

UserID CommonAnswers CommonQuestions Percentage  
...  
3           2             3             66%  
...  
9           1             4             25%  
...

“MustMatch” explained:
Simply, the answer must match. If the answer to a question is important compatibility match to a user then they can mark their answer as MustMatch. If a user marks their answer as MustMatch and that particular answer doesn’t match with the other user then don’t show their match in top 20 list of either of the users.
For example, if user 5 marks their answer to the 4th question as MustMatch then in user 5’s top 20 list, user 9 won’t display because of mismatching answers to 4th question. Also then in user 9’s top 20 list, user 5 won’t display.
If user 5 marks their answer to the 3rd question as MustMatch then user 5 won’t match with user 3 as user 3 has skipped that question and user 5 won’t match with user 9 because of mismatching answers to that question.
Skipped question can’t be marked MustMatch, as you can see in below database design that it’s not possible.

This is my table for storing answers:

create table `answers`(
`userid` mediumint unsigned not null,
`qid` tinyint unsigned not null, # question id
`answer` bit(1) not null,
`mustMatch` bit(1) not null default b'0',
primary key (`userid`,`qid`)
);

If a question is skipped then don’t have its record in this answers table.

I’ve a separate table for storing questions’ texts:

create table `questions`(
`qid` tinyint unsigned not null auto_increment primary key,
`question` tinytext not null
);

In calculating top 20 matches, apart from MustMatch, there’s another factor: “MinMatch”, in which users can set the minimum matching answers they want from their matches. If user 5 set their MinMatch as 3 then in top 20 list of any user, user 5 will match with only those users who have minimum 3 same answers as user 5. MinMatch is set by the user; it should be >=1 and <= total questions answered by the user.

create table `users`(
`userid` mediumint unsigned not null auto_increment primary key,
`minMatch` tinyint unsigned default 1 # minimum matching answers
);

Can you please tell how to make the two queries? (my two bolded sentences)

Here’s how the 2nd query may be: (you can skip this)

SELECT
    U2.userid,
    SUM(CASE
            WHEN A1.answer = A2.answer THEN 1
            ELSE 0
        END) AS common_answers,
    SUM(CASE
            WHEN A1.answer IS NOT NULL AND A2.answer IS NOT NULL THEN 1
            ELSE 0
        END) AS common_questions,
    common_answers/common_questions AS ratio
FROM
    questions Q
LEFT OUTER JOIN answers A1 ON
    A1.qid = Q.qid AND
    A1.userid = 1
LEFT OUTER JOIN answers A2 ON
    A2.qid = A1.qid AND
    A2.userid <> A1.userid
LEFT OUTER JOIN users U2 ON
    U2.userid = A2.userid
GROUP BY
    U2.userid
ORDER BY
    ratio DESC
LIMIT 20

Here i’ve modified someone(Tom H)’s query i found on Stack Overflow for a similar question. But my this query doesn’t work (error: Unknown column 'common_answers' in 'field list') and doesn’t include the MustMatch and MinMatch factors.


More edit: (More explanation:)

How my first example is stored in the database:
answers table:

userid qid answer mustMatch
    3   1   1       0
    3   2   0       0
    3   4   1       0
    5   1   1       0
    5   2   1       0
    5   3   0       0
    5   4   1       0
    9   1   1       0
    9   2   0       0
    9   3   1       0
    9   4   0       0   

As discussed, match ratio of user 3 and user 5 = 2/3, match ratio of user 5 and user 9 = 1/4.
Let’s add a new user: User 4:
User 4 has answered as: Q1) Yes, Q2) No, Q3) No, Q4) Yes

userid qid answer mustMatch
    4   1   1       0
    4   2   0       0
    4   3   0       0
    4   4   1       0

Let’s start calculating top 20 but before that we need to know the users’ minMatch values:
users table:

userid minMatch
    3   1
    5   1
    9   1
    4   1

Top 20 matches of user 5:
Result:

UserID CommonAnswers CommonQuestions Percentage
    4       3               4           75%
    3       2               3           66%
    9       1               4           25%

If user 5 marks their answer to 3rd question as must match, i.e. this record:

userid qid answer mustMatch
    5   3   0       0

changes to this:

    5   3   0       1

Then user 5’s top 20 list will be:

UserID CommonAnswers CommonQuestions Percentage
    4       3               4           75%

Because only user 4 matches with user 5’s answer to 3rd question.
Also then user 3’s top 20 list will be:

UserID CommonAnswers CommonQuestions Percentage
    4       3               3           100%
    9       2               3           66%

Now if user 9 increases their min match to 3, i.e. this record:

userid minMatch
    9   1

changes to this:

    9   3

Then user 3’s top 20 list will be:

UserID CommonAnswers CommonQuestions Percentage
    4       3               3           100%

User 9 got removed from user 3’s top 20 list, because user 3 and user 9 have only 2 answers in common while user 9 requires minimum 3 answers to be in common.
Also then user 9’s top 20 list will be empty as no user has at least 3 matching answers with user 9.


Instead of storing answers as i’m storing now, i can store them as strings in the users table if that’s a better method. Adding a field answers to the users table:

create table `users`(
`userid` mediumint unsigned not null auto_increment primary key,
`minMatch` tinyint unsigned default 1, # minimum matching answers
`answers` char(5) not null default 'sssss' # 5 questions
);

Then answers can be stored as strings e.g. yYNns, where y=yes to 1st and 2nd questions in this string, n=no to 3rd and 4th questions, capitalizations to 2nd and 3rd questions (‘Y’ and ‘N’) means must match, s=skip the 5th question.

Or maybe some other better storage method. You can give me the 2 SQL queries according to whichever method you think is the best.

Advertisement

Answer

For the first part where you just want to get the ratio of equally answered questions to common answered questions, I’d use this query:

select a1.userid, a2.userid, sum(a1.answer = a2.answer) / count(*) as ratio
from answers a1
join answers a2 on a2.qid = a1.qid and a2.userid > a1.userid
group by a1.userid, a2.userid
order by a1.userid, a2.userid;

I am using a2.userid > a1.userid to get each user pair only once. (If I used a2.userid <> a1.userid, I’d get users 3 and 5 twice for instance, one time as 3/5 and one time as 5/3.)

If you want to look up two particular users, we add the appropriate WHERE clause and remove and a2.userid > a1.userid. We can also get rid of GROUP BY, because there is just one row to return.

select sum(a1.answer = a2.answer) / count(*) as ratio
from answers a1
join answers a2 on a2.qid = a1.qid
where a1.userid = 3 and a2.userid = 5;

For the top 20 query we have one user given and need a HAVING clause to make sure no mustmatch and minmatch is violated.

select
  a1.userid,
  a2.userid,
  count(*) as common_questions,
  sum(a1.answer = a2.answer) as common_answers,
  sum(a1.answer = a2.answer) / count(*) as ratio
from answers a1
join answers a2 on a2.qid = a1.qid and a2.userid <> a1.userid
where a2.userid = 9
group by a1.userid, a2.userid
having sum(a1.answer <> a2.answer and (a1.mustmatch = 1 or a2.mustmatch = 1)) = 0
and sum(a1.answer = a2.answer) >=
      all (select minmatch from users u where u.userid in (a1.userid, a2.userid))
order by ratio desc
limit 20;

Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=21edc084a0843987a14cb8b49f0f0213

Both queries use sum( <boolean expresssion> ). As true is 1 and false is 0 in MySQL this counts matches. The standard SQL way of writing this would be sum(case when <boolean expresssion> then 1 else 0 end) or count(*) filter (where <boolean expresssion>).

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement