[A copy of the code used in this project can be found on GitHub].
In this project we aim to use a cluster computing framework (Apache Spark) to create a classifier which will automate detection of the presence of hate-speech, offensive language or neither of the two aforementioned classes in a short section of text. This can also be used to assess the proportion of hate-speech or offensive language an individual uses on social media.
We will analyse a selection of circa 25,000 tweets which have been manually labelled by users of the CrowdFlower platform. We will hold out a selection of data for comparing two machine learning classifiers (Multi-nomial Naive Bayes & Random Forests) as well as two feature extraction techniques (Bag of Words, and Term Frequency-Inverse Document Frequency(TF-IDF)). In order to identify the best parameters for our feature extraction, training set size, and classifiers we will use a validation set to perform hyper-parameter optimisation.