I’d like to query my app logs on S3 with AWS Athena but I’m having trouble creating the table/specifying the data format.
This is how the log lines look:
2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}
which is a timestamp followed by space and the JSON line I want to query.
Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can’t simply use JSON.
Advertisement
Answer
Define table with single column:
CREATE EXTERNAL TABLE your_table( line STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' ESCAPED BY '\' LINES TERMINATED BY 'n' LOCATION 's3://mybucket/path/mylogs/';
You can extract timestamp and JSON using regexp, then parse JSON separately:
select ts, json_extract(json_col, '$.reqid') AS reqid ... from ( select regexp_extract(line, '(.*?) +',1) as ts, regexp_extract(line, '(.*?) +(.*)',2) as json_col from your_table )s
Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:
CREATE EXTERNAL TABLE your_table ( ts STRING, json_col STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(.*?) +(.*)$" ) LOCATION 's3://mybucket/path/mylogs/'; SELECT ts, json_extract(json_col, '$.reqid') AS reqid ... FROM your_table