Skip to content
Advertisement

Split large texts into chunks in separate rows

I have a table, where some texts are atrociously big. I want to make sure every row in the query output does not exceed, say, 100.000 characters. How do I do that?

Here is a quick sample:

WITH large_texts AS (
  (SELECT 'humongous text goes here' AS text ,'1' AS id)
  UNION ALL 
  (SELECT 'small one' AS text ,'2' AS id)
  UNION ALL
  (SELECT 'and another big one over here' AS text ,'3' AS id)
)

SELECT * FROM large_texts

Let’s say I want output text column to be less than 10 characters. So, I need this result:

+----+------------+
| id | text       |
+----+------------+
| 1  | humongous  |
+----+------------+
| 1  | text goes  |
+----+------------+
| 1  | here       |
+----+------------+
| 2  | small one  |
+----+------------+
| 3  | and anothe |
+----+------------+
| 3  | r big one  |
+----+------------+
| 3  | over here  |
+----+------------+

It would be even better if I could also avoid splitting in the middle of words.

Advertisement

Answer

It would be even better if I could also avoid splitting in the middle of words.

Consider below approach

create temp function split_with_limit(text STRING, len FLOAT64)
returns ARRAY<STRING>
language js AS r"""
    let input = text.trim().split(' ');
    let [index, output] = [0, []]
    output[index] = '';
    input.forEach(word => {
        let temp = `${output[index]} ${word}`.trim()
        if (temp.length < len) {
            output[index] = temp;
        } else {
            index++;
            output[index] = word;
        }
    })
    return output
""";
select id, small_chunk
from yourtable_with_large_texts, 
unnest(split_with_limit(text, 10)) small_chunk with offset
order by id, offset

If applied to sample data in your question – output is

enter image description here

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement