Tag: hive

Hive SQL cast string as timestamp without losing the milliseconds

I have string data in the form 2020-10-21 12:49:27.090 I want to cast it as a timestamp. When I do this: select cast(column_name as timestamp) as column_name from table_name all of the milliseconds are dropped, like this: 2020-10-21 12:49:27 I also tried this: select cast(date_format(column_name,’yyyy-MM-dd HH:mm:ss.SSS’) as timestamp) as column_name from table_name and the same problem persists, it drops the

Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns

aggregation hive hiveql sql window-functions

I have a dataset with booking hotels. date_in has format “yyyy-MM-dd”. I need select top 10 the most visited hotel by month. I get the following error: Error: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line

How to UPDATE a value in hive table?

hive hiveql impala sql

I have a flag column in Hive table that I want to update after some processing. I have tried using hive and impala using the below query but it didn’t work, and got that it needs to be a kudu table …

In Hive, how to read through NULL / empty tags present within an XML using explode(XPATH(..)) function?

hive hiveql sql xml xpath

In below Hive-query, I need to read the null / empty “string” tags as well, from the XML content. Only the non-null “string” tags are getting considered within the XPATH() list now….

How to count all rows in raw data file using Hive?

hive sql

I am reading some raw input which looks something like this: Note the first two rows are “good” rows and the last two rows are “bad” rows since they are missing some data. Here is the snippet of my hive query which is reading this raw data into a readonly external table: I need to get the count of ALL

AWS Athena custom data format?

amazon-athena amazon-web-services aws-glue hive sql

I’d like to query my app logs on S3 with AWS Athena but I’m having trouble creating the table/specifying the data format. This is how the log lines look: 2020-12-09T18:08:48.789Z {“reqid”:&…

Performance difference with Where condition in subquery/cte

hive hiveql sql

Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance? Let’s say I have two hive tables A and B which are both partitioned on the field date. Is that query’s performance the same as the following? Answer The

Selecting most recent rows in a SQL query

hive impala sql

I want to join two tables, selecting the most recent rows for an ID value present in table 1. i.e. For each ID value in table 1, only return the most recently added row for an ID value. For example, table 1 looks something like this: So if the same ID value is found twice in this table, only return

Exclude records with certain values in Qubole

hadoop hive hiveql qubole sql

Using Qubole I have Table A (columns in json parsed…) I need to Select only IDs which have Recommendation GOOD but Decision BAD. Therefore output should be 3. I tried : Answer Use analytic functions. Demo: Result: