Skip to content
Advertisement

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:

I am working with pandas on a dataset that has a lot of data (10M+):

q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')

I know I can do with pandas function with a frac option like

df_sample = df.sample(frac=0.01)

however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.

I’ve read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.

Can anyone shed me with more light?

Thank you very much.

UPDATE:

Below is a table showing how the data looks like:

enter image description here

Reputation is the field I am working on. You can see majority records have a very small reputation.

I don’t want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that’s what I meant “evenly”.

I hope this clarifies a bit.

Advertisement

Answer

A simple random sample can be performed using the following syntax:

select * from mydata where rand()>0.9

This gives each row in the table a 10% chance of being selected. It doesn’t guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here’s a fiddle of this approach

http://sqlfiddle.com/#!9/21d1ee/2

On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to ‘force’ the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement