I have this table documents
based on client’s set of data. Only provided key is documentId
, but it’s non-unique (and rowId
is mine, not client’s).
This table also shows which documents have been processed and which have not, plus some additional data. isProcessed
is a generated column (isProcessed bool as (processed is not null)
) for indexing purposes.
SELECT * FROM documents; | rowId | documentId | received | processed | user | isProcessed | | ----- | ---------- | ------------------- | ------------------- | ----- | ----------- | | 1 | 10020 | 2020-11-27 15:01:00 | 2020-11-27 17:17:17 | user1 | 1 | | 2 | 10020 | 2020-11-27 15:02:00 | 2020-11-27 17:17:17 | user1 | 1 | | 3 | 10021 | 2020-11-27 15:03:00 | 2020-11-27 17:17:17 | user1 | 1 | | 4 | 10022 | 2020-11-27 15:04:00 | 2020-11-27 17:17:17 | user1 | 1 | | 5 | 10022 | 2020-11-27 15:05:00 | 2020-11-27 17:17:17 | user1 | 1 | | 6 | 10022 | 2020-11-27 15:06:00 | | | 0 | | 7 | 10022 | 2020-11-27 15:07:00 | | | 0 | | 8 | 10022 | 2020-11-27 15:08:00 | | | 0 | | 9 | 10023 | 2020-11-27 15:09:00 | | | 0 | | 10 | 10024 | 2020-11-27 15:10:00 | | | 0 | | 11 | 10024 | 2020-11-27 15:11:00 | | | 0 | | 12 | 10025 | 2020-11-27 15:12:00 | | | 0 | | 13 | 10025 | 2020-11-27 15:13:00 | | | 0 | | 14 | 10025 | 2020-11-27 15:14:00 | | | 0 | | 15 | 10025 | 2020-11-27 15:15:00 | | | 0 |
Then, when employees process some documents, they pass their documentId's
into an app, generating a JSON like this:
{ "user": "user2", "documents": [ 10022, 10022, 10023, 10024, 10025, 10025, 10025 ] }
Now I need to figure out how to:
a) update documents
to set processed
and user
for first matching unprocessed documents, but no more than that!
That’s 2 x 10022 (rows 6, 7), 1 x 10023 (row 9), 1 x 10024 (row 10), 3 x 10025 (rows 12, 13, 14).
What I’ve got so far is:
SET @processedTS = NOW(); UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10022 AND !isProcessed ORDER BY rowId LIMIT 2; UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10023 AND !isProcessed ORDER BY rowId LIMIT 1; UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10024 AND !isProcessed ORDER BY rowId LIMIT 1; UPDATE documents SET user = 'user2', processed = @processedTS WHERE documentId = 10025 AND !isProcessed ORDER BY rowId LIMIT 3;
Is there a smarter way to approach this than generate multiple update statements?
b) select the updated documents to get their additional information and make a results file to send to the client
SELECT * FROM documents WHERE processed = @processedTS;
Is there a smarter way to select the updated data?
Expected results
a) table documents
after updates
| rowId | documentId | received | processed | user | isProcessed | | ----- | ---------- | ------------------- | ------------------- | ----- | ----------- | | 1 | 10020 | 2020-11-27 15:01:00 | 2020-11-27 17:17:17 | user1 | 1 | | 2 | 10020 | 2020-11-27 15:02:00 | 2020-11-27 17:17:17 | user1 | 1 | | 3 | 10021 | 2020-11-27 15:03:00 | 2020-11-27 17:17:17 | user1 | 1 | | 4 | 10022 | 2020-11-27 15:04:00 | 2020-11-27 17:17:17 | user1 | 1 | | 5 | 10022 | 2020-11-27 15:05:00 | 2020-11-27 17:17:17 | user1 | 1 | | 6 | 10022 | 2020-11-27 15:06:00 | 2020-11-27 19:34:58 | user2 | 1 | | 7 | 10022 | 2020-11-27 15:07:00 | 2020-11-27 19:34:58 | user2 | 1 | | 8 | 10022 | 2020-11-27 15:08:00 | | | 0 | | 9 | 10023 | 2020-11-27 15:09:00 | 2020-11-27 19:34:58 | user2 | 1 | | 10 | 10024 | 2020-11-27 15:10:00 | 2020-11-27 19:34:58 | user2 | 1 | | 11 | 10024 | 2020-11-27 15:11:00 | | | 0 | | 12 | 10025 | 2020-11-27 15:12:00 | 2020-11-27 19:34:58 | user2 | 1 | | 13 | 10025 | 2020-11-27 15:13:00 | 2020-11-27 19:34:58 | user2 | 1 | | 14 | 10025 | 2020-11-27 15:14:00 | 2020-11-27 19:34:58 | user2 | 1 | | 15 | 10025 | 2020-11-27 15:15:00 | | | 0 |
b) results file
[ { "documentId": 10022, "received": "2020-11-27 15:06:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10022, "received": "2020-11-27 15:07:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10023, "received": "2020-11-27 15:09:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10024, "received": "2020-11-27 15:10:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10025, "received": "2020-11-27 15:12:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10025, "received": "2020-11-27 15:13:00", "processed": "2020-11-27 19:34:58", "user": "user2"}, { "documentId": 10025, "received": "2020-11-27 15:14:00", "processed": "2020-11-27 19:34:58", "user": "user2"} ]
(I know how to make that JSON, I’m looking for a smart way to select the updated data)
My question is…
In production, documents
will have some 300k + rows, and each save run will affect hundreds of rows.
Is there a smarter way to update in bulk & select updated data?
Ideally, I’m looking for a way to pass [10022, 10022, 10023, 10024, 10025, 10025, 10025]
onto the update in a way that will respect FIFO order and match only first listed documents, not more. Running hundreds of update statements does not seem ideal.
I’m using MySQL 8.0. The app is written in javascript (typescript) run on Node.js v12. I’m only looking for smarter MySQL solution. For suggestions with some logic in app layer, the answer doesn’t have to be in JS, any pseudocode is appreciated too.
Fiddle w/ table schema here: https://www.db-fiddle.com/f/ecYLwEGpwZpPtec6VwzMBF/1
Thank you very much.
Advertisement
Answer
A way to find the records in table documents
which need an update:
set @json='{ "user": "user2", "documents": [ 10022, 10022, 10023, 10024, 10025, 10025, 10025 ]}'; WITH RECURSIVE json as ( SELECT 0 as i UNION ALL SELECT i+1 FROM json WHERE i<10 ), to_update as ( SELECT json_extract(@json,CONCAT("$.documents[",i,"]")) as document, count(*) as count FROM json WHERE not json_extract(@json,CONCAT("$.documents[",i,"]")) is null group by document ) SELECT * FROM ( SELECT documents.*, row_number() over (partition by documents.documentId order by documents.documentId) nr, to_update.count count FROM documents INNER JOIN to_update ON to_update.document = documents.documentId WHERE NOT documents.isProcessed ) x WHERE nr<=count ;
This will output:
+-------+------------+---------------------+-----------+------+-------------+----+-------+ | rowId | documentId | received | processed | user | isProcessed | nr | count | +-------+------------+---------------------+-----------+------+-------------+----+-------+ | 6 | 10022 | 2020-11-27 15:06:00 | NULL | NULL | 0 | 1 | 2 | | 7 | 10022 | 2020-11-27 15:07:00 | NULL | NULL | 0 | 2 | 2 | | 9 | 10023 | 2020-11-27 15:09:00 | NULL | NULL | 0 | 1 | 1 | | 10 | 10024 | 2020-11-27 15:10:00 | NULL | NULL | 0 | 1 | 1 | | 12 | 10025 | 2020-11-27 15:12:00 | NULL | NULL | 0 | 1 | 3 | | 13 | 10025 | 2020-11-27 15:13:00 | NULL | NULL | 0 | 2 | 3 | | 14 | 10025 | 2020-11-27 15:14:00 | NULL | NULL | 0 | 3 | 3 | +-------+------------+---------------------+-----------+------+-------------+----+-------+