Exploring uses of large unlabeled datasets

This project was completed for course Data Science I in Fall 2017.

In this project I investigated the different ways that large unlabeled datasets can be leveraged for common binary text classification tasks. In particular I wanted to improve upon ‘naive’ strategies like keyword searches, as well as with established supervised learning tasks, my efforts constituting for the most part ‘semi-supervised learning’ techniques, and therefore require varying levels of user input. I present an array of techniques, both successful, but also unsuccessful, and back up each attempt with extensive experimentation. I also provide insights as to why certain methods succeed over others.

See the project github and full project write up here.