What it is

A logistic classifier trained on a corpus of 105,195 movie reviews in Portuguese, built by scrapping user film reviews from Filmow (the same corpus used in the previous chapter).

The setup

Chapter 5 of Jurafsky & Martin introduces the reader to logistic regression classifiers. The chapter uses these classifiers to teach foundational machine learning concepts like loss functions, optimization, and regularization. I devoted special attention to this chapter because it taught such important concepts; implementing this classifier was one of the ways to make sure I got everything right. What I really wanted form this project was to get a feel for how a logistic classifier worked (and especially how gradient descent worked in practice).

The process

Challenge #1: Understanding the underlying math

The first real hurdle for me was ensuring I was really familiar with the mathematical foundations behind logistic regression and gradient descent. The chapter presents these concepts in a remarkably clear and succinct manner, but I didn’t possess all the mathematical knowledge to grasp those concepts immediately. To get up to speed, I found a Youtube channel called StatQuest that explained everything very clearly. The most important videos were their linear and logistic regression series, which introduced me to the ideas of fitting a line to data:

Another important video was their gradient descent guide:

I watched those videos each at least twice; you can find the notes I took while watching them here.

Challenge #2: Choosing features

The Naive Bayes classifier of the last chapter simply compared how often a word appeared in each class to make a classification decision. In contrast, the Logistic regression uses much more sophisticated features, so I wanted to have a go at implementing them.

I first looked for what features were commonly used in movie review classifications. When I couldn’t find much information on that front, I decided to use the same features used in the chapter:

features

Jurafsky & Martin (2021), p. 5

Implementing the features 3 through 6 above would be relatively simple, but for the first two features, I would need a sentiment lexicon in Portuguese. I found an article by Freitas and Vieira (2015)1 describing comparing several Portuguese sentiment lexicons. I chose to use OpLexicon since it was comparatively leaner relative to the other lexicons.

Challenge #3: Producing reliable testing data

In contrast with the naive Bayes classifier, this implementation was not successful in accurately testing example movie reviews. One of the major reasons for this is that this classifier does not employ lemmatization, semantic parsing, or any other processing of the input reviews. This means that neither verbs nor negated adjective are accurately processed, which causes the count of negative words to be lower than it should, which in turn confuses the classifier when it comes to testing negative reviews.

The lack of lemmatization is on purpose: these projects are created to benefit my learning. I realize there are NLP libraries out there that could handle these tasks relatively easily, but my goal is to become intimate with how that processing actually takes place, and I feel using a preexisting library would not take me closer to that goal.

However, even though this classifier can’t accurately test data, the experince of building it taught me much nevertheless. Analyzing the classifier and finding explanations for why something isn’t working was very enlightening. Additionally, this was an example of a use case where naive Bayes classifiers are “good enough” (which is something the textbook had alluded to, in this chapter).

  1. Freitas, L. A., Vieira, R. (2015). Exploring Resources for Sentiment Analysis in Portuguese Language. 2015 Brazilian conference on intelligent systems (BRACIS) , 152-156. https://doi.org/10.1109/BRACIS.2015.52