Hello swarm intelligence!
I have the following use case: For every movie that is requested by a user, I create a number of tags for that specific movie, derived from several sources (actors, plot etc.. ).
I will use this data for associaton mining.
The problem: If I use the movie for rows and the tags for columns, the tags will easily exceed the technical limitations of 3000 columns ( there is even more actors, and then plot keywords etc)
Is there any way, I can organize this data to then use it for (quick) association mining?
Thanks a lot
Advertisement
Answer
Don’t put tags in columns. Instead create a separate table, named something like movie_tags
with two columns, movie_id
and tag
. Put each tag in a separate row of that table.
This is known as “normalizing” your data. Here’s a nice walkthrough with an example very similar to yours.
Edit: Let’s say you have a catalog of movies about the Italian Mafia in New York City in the 20th century. Let’s say the movies are
1 Godfather 2 Goodfellas 3 Godfather II
Then your movie_tags
table might contain these rows.
1 Gangsters 2 Gangsters 3 Gangsters 1 Francis Ford Coppola 3 Francis Ford Coppola 2 Martin Scorsese
Pro tip If you find yourself thinking about putting lots of data items with the same meaning in their own columns, you probably need to normalize the data and add appropriate tables.