Quick question. What is the rule of thumb when deciding where to begin manipulating data? Should I do it when I am hitting the database? Or, just bring everything into my data frame and .drop from there? I also need to rearrange the columns in 4 separate data frames to union them into one data source once finished. With that in mind, is it easier to rearrange in the SQL or pandas? I know this is trivial, but I appreciate any help.
Advertisement
Answer
Pandas is single-threaded. No matter what your compute power is, you only take advantage of a single core. SQL Sever is multi-threaded. If you are dealing with large data sets, performance-wise you would be better doing the processing on the SQL Server side.
P.S. There are attempts to expose a multi-threaded Pandas API such as Dask, Modin and Koalas