I’ve got a Snowflake SQL query I’m trying to execute in R via ODBC connection that looks like this
SET quiet=TRUE; USE SOMEDATABASE.SOMESCHEMA; --Select timestamp of last sale per customer DROP TABLE IF EXISTS sales; CREATE TEMPORARY TABLE sales(CustomerId VARCHAR(16777216), SaleTS TIMESTAMP_NTZ(9)); INSERT INTO sales SELECT CustomerId, SaleTS FROM SALES WHERE SaleTS>= '2020-11-19 00:00:00' AND SaleTS <= '2020-11-19 23:59:59.999' GROUP BY CustomerId; --Use temp table to get correct row from sales table SELECT SUM(SalesDetail.price) as SumPrice COUNT(*) as SoldVolume FROM sales LEFT JOIN SALES as SalesDetail ON Sales.CustomerId = SalesDetail.CustomerId AND sales.SaleTS = SalesDetail.SaleTS
Querying Microsoft SQL Server from R I’d normally include set nocount no;
at the top of the query to ensure only the last step is returned to R to avoid the error Actual statement count 6 did not match the desired statement count 1.
Error makes sense, SQL is returning 6 components when R is expecting 1 (6 one for each step in my SQL query). In Snowflake there doesn’t appear to be an option to set nocount on in the same way. My question is how do I avoid the above error. Does anyone have any experience of executing a mutli-step Snowflake SQL query via R? How can I get R to receive just the last statement from the ODBC connection. So far I’ve tried set nocount=TRUE;
, set echo=FALSE;
, set message=FALSE;
, SET quiet=TRUE
etc
Advertisement
Answer
Snowflake SQL is expressive enough and proposed code could be structured as single query:
WITH cte AS ( SELECT CustomerId, MAX(SaleTS) AS SaleTS -- here agg function is required FROM SALES WHERE SaleTS>= '2020-11-19 00:00:00' AND SaleTS <= '2020-11-19 23:59:59.999' GROUP BY CustomerId ) SELECT SUM(SalesDetail.price) as SumPrice COUNT(*) as SoldVolume FROM cte LEFT JOIN SALES as SalesDetail ON Sales.CustomerId = SalesDetail.CustomerId AND sales.SaleTS = SalesDetail.SaleTS;
The original query is using the same name for both table and temporary table differing only by case sales
vs SALES
, which is error-prone.
Second: database and schema could be set up during establishing connection, so there is no need for USE
inside script. Alternatively fully qualified name could be used in the script.
I guess the intent of query is as follow:
WITH cte AS ( SELECT * FROM SOMEDATABASE.SOMESCHEMA.SALES WHERE SaleTS BETWEEN '2020-11-19 00:00:00' AND '2020-11-19 23:59:59.999' QUALIFY ROW_NUMBER() OVER(PARTITION BY CustomerId ORDER BY SaleTS DESC) = 1 ) SELECT COUNT(*) AS SoldVolume, SUM(price) as SumPrice FROM cte;
If it is possible that a single person has two entries for exactly the same SaleTS then RANK() OVER(...)
should be used instead.