In a big data scenario, where a piece of code has to run tens of thousands of times would checking the integrity of a value on an interface(using switch statements or if’s, etc…) be faster than checking after inserting the values on a database with triggers(technically before they are inserted but in a human perspective as the trigger in this case would verify if everything right before entering the value)?
So basically me and a friend of mine had a shared interest over a topic which involved gathering huge amounts of data as the fluctuations we are trying to measure are very abstract and are only noticeable by the average joe over a long period of time.
The thing is no one is offering that data, so we decided to code some crawlers to gather the values for us. We’re talking about thousands of “objects” with like 7-8 properties to extract and since we’re not only broke but impatient, waiting is a undesirable option as the code might have to run over and over daily for us to gather anything resembling a useful extrapolation.
With this said while we were programming the interface in python to populate the PostgreSQL database i suggested that unit conversions should be done as a trigger since most of the time(50-80%) the data inserted wouldn’t need to be converted.
I presumed that a trigger wouldn’t “waste time” if a conversion wouldn’t require a conversion since i think a trigger is a multi threaded procedure where one of the threads sleeps until something is triggered causing a small improvement in performance. I also remembered this chart that was presented in my classroom
althought this is a bit of a stretch as this pyramid is meant to be used to check integrity after the the tables are populated so it makes sense that a interface that not only has to connect to the database, read the columns, take a decision, and execute it would be the last thing to use.
My friends argument is that the time we would take to develop the triggers would be too excruciating, and time consuming. not only that but it would amplify the complexity which as we all know simple and obvious code is always desirable. Not only that but even if triggers worked on the best case scenario, like i said they would be multi threaded and would be able to be perfectly efficient, the same could be replicated in python with threads and it would still be faster.
His argument was solid especially the first part, since even if i suck with python, correcting, changing or even adding triggers with no graphical help when you’re a noob like me can take forever so in the end even if i decided to scrap the python code it wouldn’t be as big of a sacrifice as wasting time on a trigger that would run slower than if we just took the simple answer
Either way here’s the code we ended up using:
for tile in tiles: nameElement = tile.find_element_by_css_selector('.ct-tile--description') priceElement = tile.find_element_by_css_selector('.ct-price-formatted') #Elements because its a sneaky way to prevent a failed search, since it may not exist discountElements = tile.find_elements_by_css_selector('.ct-discount-amount') quantityElement = tile.find_element_by_css_selector('.ct-tile--quantity') #Url has been "historically" fetched as a hyperlink from clicking the title measureKey = quantityElement.text[-2:] rawQuantity = float(quantityElement.text.split()[1].replace(',','.')) measures={ "lt":(1,'l'), "dl":(10,'l'), "cl":(100,'l'), "ml":(1000,'l'), "kg":(1,'kg'), "gr":(1000,'kg'), "un":(1,'un') } data = { "url": nameElement.get_attribute('href'), "name": nameElement.text, "price": priceElement.text[1:], #For this website "Desconto Imediato: " precedes the actual discount and it comes in percentage not the decimal value for analitical purposes "discount": 0 if len(discountElements)==0 else float(discountElements[0].text[19:-1])/100, #TODO multithreading theres no way the crawler gonna do a switch/if statement for each element when most of the time its gonna be measure in the right measurment "measure": measures[measureKey][1], "quantity": rawQuantity/measures[measureKey][0] } #print(data) #urls.append(url) print('collected: '+ str(data))
Since the project is kinda between friends comments often aren’t well written, so in the comment i mention that we use a if/switch case which we in actuality only use to check the discount. We ended up using a dictionary so that we would easily convert between measure(its still not finished as conversions may depend “on a prefix” of the quantity)
You can think of a tile as a object to which we extract important atributes from
Advertisement
Answer
A trigger is executed synchronously, that is, the INSERT
has to wait while the trigger is running. There is also a non-neglectible overhead in calling a trigger, so INSERT
s with a trigger on the table are definitely slower. But then, running application code that performs a check also takes time, so you would have to measure what is faster.
Ultimately, it is a matter of taste if you want to write something as a trigger or as application code.
I am not entirely clear on what of these two things you need (maybe both):
perform a unit conversion before data are inserted
check if the data are alright and error out if they are not
For the first, I personally would use application code, since it is just a straightforward calculation that is done no better in database code than in application code, and debugging of client code is easier.
For the second, I would prefer a trigger, because it runs in the same transaction as the INSERT
, and any error that happens in the trigger will guarantee that the INSERT
is rolled back. This is about data integrity, and that is best checked as close to the data as possible.
If you go with the trigger, and you want to avoid calling the trigger in cases where you know it isn’t necessary, you could use the WHEN
clause of CREATE TRIGGER
to avoid the overhead of calling the trigger in those cases.