Skip to content
Advertisement

Teradata – How to understand data distribution?

I’ve been looking everywhere how I can improve my Teradata views performance by choosing the right primary index in my tables.

I have found multiple answers pointing to the same thing, by using this query to see how data is distributed through the AMPs :

SELECT HASHAMP(HASHBUCKET(HASHROW(<PRIMARY INDEX>))) AS
"AMP#",COUNT(*)
FROM <TABLENAME>
GROUP BY 1
ORDER BY 2 DESC;

I get that I need to have an even distribution, but is it better to have lots of target AMPs with few rows each or fewers AMPs but with less rows ?

Concrete example : On my table, choosing one index (product ID) says I’ll distribute through 190 different AMPs whith each having up to 83 rows, and choosing two indexes (Product ID, Date) gets me 476 AMPs with each having up to 24 rows.

Advertisement

Answer

Teradata is a Distributed Database running on massive parallel hardware, i.e. multiple hardware nodes (or cloud instances). There are several (approx. 20 to 50) AMPs on each hardware node. Each AMP is an instance of a database server and data is spread across all AMPs based on the PI.

Working in parallel is most efficient if the work can be devided equally, thus a well distributed table results in the fastest scans. But the PI is also used for WHERE-condition and joins and this might require other columns. The final decision for the PI is based on following critera:

  1. Access: column(s) used for equi-joins (most important) and/or equaity based WHERE-conditions
  2. Distribution: column(s) provide an equal distribution, i.e. not too many rows per value. Too Many might be a few hundred or thousand. Some skewness (maximum vs avg space per AMP) is tolerable if the PI is needed for Access, most customers have rules for the maximum like not more than 50% skew
  3. Volatility: PI column(s) should be stable, i.e. not updated a lot.

The Primary Key is a perfect match for #2 and #3, but maybe not for the most important #1.

Applying those criteria is important for larger table and usually lead to a PI based on PK or FK columns, but your table is small, only 13000 rows. Adding a date/timestamp to a PI is usually not recommended because this lowers the possibility to be used for join, but it’s commonly used for PARTITIONing big tables.

Thus both PI candidates are ok, assuming this table is joined to a big one you might check the big table’s PI column if you can match them.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement