Skip to content
Advertisement

Tag: hive

Hive SQL cast string as timestamp without losing the milliseconds

I have string data in the form 2020-10-21 12:49:27.090 I want to cast it as a timestamp. When I do this: select cast(column_name as timestamp) as column_name from table_name all of the milliseconds are dropped, like this: 2020-10-21 12:49:27 I also tried this: select cast(date_format(column_name,’yyyy-MM-dd HH:mm:ss.SSS’) as timestamp) as column_name from table_name and the same problem persists, it drops the

Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns

I have a dataset with booking hotels. date_in has format “yyyy-MM-dd”. I need select top 10 the most visited hotel by month. I get the following error: Error: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line

Hive: group by calculated column

I need to execute query like select myUsualField, SOME_FUNCTION(myAnotherField) as myUnusualField from MYTABLE group by myUsualField, myUnusualField In Hive this query fails: it cannot find field …

How to count all rows in raw data file using Hive?

I am reading some raw input which looks something like this: Note the first two rows are “good” rows and the last two rows are “bad” rows since they are missing some data. Here is the snippet of my hive query which is reading this raw data into a readonly external table: I need to get the count of ALL

Performance difference with Where condition in subquery/cte

Is there a performance difference for applying the where condition to a subquery data source compared to applying it at the joined statement? Is there a difference between these in performance? Let’s say I have two hive tables A and B which are both partitioned on the field date. Is that query’s performance the same as the following? Answer The

Selecting most recent rows in a SQL query

I want to join two tables, selecting the most recent rows for an ID value present in table 1. i.e. For each ID value in table 1, only return the most recently added row for an ID value. For example, table 1 looks something like this: So if the same ID value is found twice in this table, only return

Advertisement