Skip to content
Advertisement

getting Clustering/Bucketing columns programmatically

For reference, I am connecting to amazon-athena via sqlalchemy using essentially:

create_engine(
            f'awsathena+rest://:@athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
            echo=True)

In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:

select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
    and extra_info = 'partition key'

However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:

show create table mySchema.myTable

but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.

Thank you in advance.

PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.

Advertisement

Answer

Athena uses Glue Data Catalog to store metadata about databases and tables. I don’t know how much of this is exposed in information_schema, and there is very little documentation about it.

However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.

The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition’s data).

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement