Skip to content
Advertisement

Statistics on Query Time (PostgreSQL)

I have a table with a billion rows and I would like to determine the average time and standard deviation of time for several queries of the form:

select * from mytable where col1 = '36e2ae77-43fa-4efa-aece-cd7b8b669043';
select * from mytable where col1 = '4b58c002-bea4-42c9-8f31-06a499cabc51';
select * from mytable where col1 = 'b97242ae-9f6c-4f36-ad12-baee9afae194';

....

I have a thousand random values for col1 stored in another table.

Is there some way to store how long each of these queries took (in milliseconds) in a separate table, so that I can run some statistics on them? Something like: for each col1 in my random table, execute the query, record the time, then store it in another table.

A completely different approach would be fine, as long as I can stay within PostgreSQL (i.e., I don’t want to write an external program to do this).

Advertisement

Answer

Are you aware of the EXPLAIN statement?

This command displays the execution plan that the PostgreSQL planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned — by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table.

The most critical part of the display is the estimated statement execution cost, which is the planner’s guess at how long it will take to run the statement (measured in units of disk page fetches). Actually two numbers are shown: the start-up time before the first row can be returned, and the total time to return all the rows. For most queries the total time is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up time instead of the smallest total time (since the executor will stop after getting one row, anyway). Also, if you limit the number of rows to return with a LIMIT clause, the planner makes an appropriate interpolation between the endpoint costs to estimate which plan is really the cheapest.

The ANALYZE option causes the statement to be actually executed, not only planned. The total elapsed time expended within each plan node (in milliseconds) and total number of rows it actually returned are added to the display. This is useful for seeing whether the planner’s estimates are close to reality.

Could pretty easily write a script which does an EXPLAIN ANALYZE on your query for each of the random values in a table, and save the output to a file / table / etc.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement