For reference, I am connecting to amazon-athena
via sqlalchemy
using essentially:
create_engine( f'awsathena+rest://:@athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}', echo=True)
In most relational databases that adhere to the ANSI-SQL
standard, I can programmatically get the partition columns of a table by running something like the following:
select * from information_schema.columns where table_name='myTable' and table_schema='mySchema' and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.
Advertisement
Answer
Athena uses Glue Data Catalog to store metadata about databases and tables. I don’t know how much of this is exposed in information_schema
, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable
(e.g. aws glue get-table …
) you will find the bucketing information in Table.StorageDescriptor.BucketColumns
.
The GetTable
call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions
to retrieve the location of each partition’s data).