How to do sampling in sql query to get dataframe with pandas

Question

Note my question is a bit different here: I am working with pandas on a dataset that has a lot of data (10M+): q = "SELECT COUNT(*) as total FROM ``" df = pd.read_gbq(q, ...

Accepted Answer

A simple random sample can be performed using the following syntax:select * from mydata where rand()>0.9This gives each row in the table a 10% chance of being selected. It doesn&#8217;t guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here&#8217;s a fiddle of this approachhttp://sqlfiddle.com/#!9/21d1ee/2On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to &#8216;force&#8217; the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

Advertisement

Answer