Jason Fang - Projects

Natural Language Processing (NLP) is everywhere in our life. There are many types of NLP, such as classical NLP, deep learning NLP, seq2seq, and so on. This project is the first part of my NLP project which covers a classical NLP model, bag-of-words, for sentiment analysis.

Data Preprocessing

I used a restaurant review dataset to demonstrate my NLP process. The dataset is in tsv format because the comma is used in the reviews, so the delimiter is '\t'. Also, I chose to ignore double quotation marks, setting quoting = 3. To clean the dataset, I used the nltk package including stopwords and PorterStemmer. Stopwords is used to ignore nonsense words such as a, the, and so on PorterStemmer is used to reduce words dimension, eg. transforming learned to learn.

I first replaced all the commas by space in the dataset, and change all letters to lowercase. Then I chose to ignore all the stopwords and transformed all words.

Bag of Words Model

CountVectorizer is used to further reduce rare words in the dataset. After all this, it is time to use some classification models for sentiment analysis.

In the end, there comes classification. You can choose many models to classify the result. I just used SVM, Naive Bayes, and decision tree for demonstration purposes. There are many models and parameters that you can change. Also, there are many ways to evaluate the model. I used accuracy and F1 score for demonstration purposes.

Portfolio