Teradata – How to understand data distribution?

Question

I've been looking everywhere how I can improve my Teradata views performance by choosing the right primary index in my tables. I have found multiple answers pointing to the same thing, by using this query to see how data is distributed through the AMPs : I get that I need to have an even distribution, but is it better to

Accepted Answer

Teradata is a Distributed Database running on massive parallel hardware, i.e. multiple hardware nodes (or cloud instances). There are several (approx. 20 to 50) AMPs on each hardware node. Each AMP is an instance of a database server and data is spread across all AMPs based on the PI.Working in parallel is most efficient if the work can be devided equally, thus a well distributed table results in the fastest scans. But the PI is also used for WHERE-condition and joins and this might require other columns. The final decision for the PI is based on following critera:Access: column(s) used for equi-joins (most important) and/orequaity based WHERE-conditionsDistribution: column(s) provide an equal distribution, i.e. not toomany rows per value. Too Many might be a few hundred or thousand.Some skewness (maximum vs avg space per AMP) is tolerable if the PIis needed for Access, most customers have rules for the maximum likenot more than 50% skewVolatility: PI column(s) should be stable, i.e. not updated a lot.The Primary Key is a perfect match for #2 and #3, but maybe not for the most important #1.Applying those criteria is important for larger table and usually lead to a PI based on PK or FK columns, but your table is small, only 13000 rows. Adding a date/timestamp to a PI is usually not recommended because this lowers the possibility to be used for join, but it&#8217;s commonly used for PARTITIONing big tables.Thus both PI candidates are ok, assuming this table is joined to a big one you might check the big table&#8217;s PI column if you can match them.

Advertisement

Answer