AWS Athena custom data format?

Question

I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format. This is how the log lines look: 2020-12-09T18:08:48.789Z {"reqid":&...

Accepted Answer

Define table with single column:CREATE EXTERNAL TABLE your_table( line STRING)ROW FORMAT DELIMITED  FIELDS TERMINATED BY 't'  ESCAPED BY '\'  LINES TERMINATED BY 'n'LOCATION 's3://mybucket/path/mylogs/';You can extract timestamp and JSON using regexp, then parse JSON separately:select ts,        json_extract(json_col, '$.reqid') AS reqid        ...from(select regexp_extract(line, '(.*?) +',1) as ts,       regexp_extract(line, '(.*?) +(.*)',2) as json_col  from your_table)sAlternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL: CREATE EXTERNAL TABLE your_table (     ts STRING,     json_col STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(.*?) +(.*)$" )  LOCATION 's3://mybucket/path/mylogs/'; SELECT ts, json_extract(json_col, '$.reqid') AS reqid  ... FROM your_table

Advertisement

Answer