I’m not particularly familiar with Ruby on Rails, but I’m troubleshooting an issue we’re experiencing with a rake job that is supposed to be cleaning database tables. The tables grow very large very quickly, and the query generated by ActiveRecord doesn’t seem to be efficient enough to handle it.
The Ruby calls looks like this:
Source.where("id not IN (#{Log.select('DISTINCT source_id').to_sql})").delete_all
and this:
Log.joins(:report).where(:report_id => Report.where(cond)).delete_all
I’m trying to get at the SQL, so we can have our DBA’s attempt to optimize it better. I’ve noticed if I drop the “.delete_all” I can add a “.to_sql” which gives me the SELECT statement of the query, prior to the call to “.delete_all”. I’d like to see what SQL is being generated by that delete_all method though.
Is there a way to do that?
Advertisement
Answer
From the fine manual:
delete_all(conditions = nil)
Deletes the records matching
conditions
without instantiating the records first, and hence not calling thedestroy
method nor invoking callbacks. This is a single SQL DELETE statement that goes straight to the database, much more efficient thandestroy_all
.
So a Model.delete_all(conditions)
ends up as
delete from models where conditions
When you say Model.where(...).delete_all
, the conditions
for the delete_all
come from the where
calls so these are the same:
Model.delete_all(conditions) Model.where(conditions).delete_all
Applying that to your case:
Source.where("id not IN (#{Log.select('DISTINCT source_id').to_sql})").delete_all
you should see that you’re running:
delete from sources where id not in ( select distinct source_id from logs )
If you run your code in a development console you should see the SQL in the console or the Rails logs but it will be as above.
As far as optimization goes, my first step would be to drop the DISTINCT. DISTINCT usually isn’t cheap and IN
doesn’t care about duplicates anyway so not in (select distinct ...)
is probably pointless busy work. Then maybe an index on source_id
would help, the query optimizer might be able to slurp the source_id
list straight out of the index without having to do a table scan to find them. Of course, query optimization is a bit of a dark art so these simple steps may or may not work.