Working on data acquisition and clean-up.
We started to scrape some test data (comments), clean, sort and rank the included words through Python to create our “library” of words used in music comments.
It became apparent that we would need to further clean/filter the data. We made a list of stop words ([..] stop words are words which are filtered out before or after processing of natural language data (text). […] stop words usually refer to the most common words in a language […] source: wikipedia) by combining several stop word list which we found online. Next we created a “spam filter” list, to exclude words that originated from spam comments. After using those “filters” we refined the them again. We repeated this process several times.
Ultimately we modified the clean-up to exclude spam comments before ranking, which immensely improved the resulting list.
When the clean-up code was working sufficiently, we enlarged the data set to check the code’s scalability. For that we chose 3 songs from Soundcloud’s charts for 3 genres (Rap/Hip-Hop, Pop, Metal) which we decided on beforehand.