Social media platforms like Facebook, Twitter or LinkedIn gained huge popularity and have proven to be a valuable and popular source of user-generated content as harvesting social media platforms allows for gathering huge amounts of data from a diverse set of users. The extraction of information from social media platforms has become popular not only among scientists as these new data sources allow for new applications as e.g., detecting real-world incidents, earthquakes or recommender systems aiming at recommending news, other users or hashtags based on information extracted from Twitter. However, research focused on music information retrieval (MIR) and music recommender systems hardly makes use of user-generated data gathered from online social networks. For instance, Schedl found that work leveraging social media data for Music Information Retrieval is hardly existent except for approaches exploiting the music service Last.fm. Similarly, Bertin-Mathieux et al. call for a large, publicly available data set which can be used to evaluate scalable algorithms in the field of music information retrieval and music recommender systems.
To foster research in the fields of music information retrieval and music recommendations based on user-generated data retrieved from online social networks, we present the #nowplaying data set, which contains information about the music listening behavior of users gathered from Twitter. In particular, we leverage so-called #nowplaying-tweets, i.e., tweets stating that a certain user A listened to a specific track by a specific artist. An example of such a tweet is depicted in the following: "Like a Rolling Stone - Bob Dylan #nowplaying #listenlive". In this tweet, a user states that he/she listened to the song "Like a Rolling Stone" performed by Bob Dylan. For the #nowplaying data set, we extract information about the artist and the track title and enrich it with further metadata about the tweet, the artist and track mentioned in the tweet. After performing this extraction, we are able to provide a data set which currently comprises 40 million listening events described by 400 million triples.