Working 21 – 31 March

Working on data acquisition and clean-up.

We started to scrape some test data (comments), clean, sort and rank the included words through Python to create our “library” of words used in music comments.

It became apparent that we would need to further clean/filter the data. We made a list of stop words ([..] stop words are words which are filtered out before or after processing of natural language data (text). […] stop words usually refer to the most common words in a language […] source: wikipedia) by combining several stop word list which we found online. Next we created a “spam filter” list, to exclude words that originated from spam comments. After using those “filters” we refined the them again. We repeated this process several times.
Ultimately we modified the clean-up to exclude spam comments before ranking, which immensely improved the resulting list.

When the clean-up code was working sufficiently, we enlarged the data set to check the code’s scalability. For that we chose 3 songs from Soundcloud’s charts for 3 genres (Rap/Hip-Hop, Pop, Metal) which we decided on beforehand.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s