Jigsaw Unintended Bias in Toxicity Classification

10 min readDec 8, 2019

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

my GitHub: https://github.com/rumankhan1

my LinkedIn: https://www.linkedin.com/in/rumankhan1/

In this blog, I have explained a case study that I have solved recently. Step by step approach is explained in this blog.

Objective

Build a model to detect toxic comments and reduce the unintended bias of the model.

Background info

When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. “gay”), even when those comments were not actually toxic (such as “I am a gay woman”). This happens because training data was pulled from available sources where, unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users. kaggle

So we have to build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities.

Data Overview

Given data have total 1804874 comments and 45 attributes. Out of 45 attributes, only a few of them will be used for modelling. Few will be used to check the model bias.

Few of them are:

“Comment_text” — comment text
“Target” — This is used for the actual class label. This has values in the range between 0.0–1.0. Where 0.0 being non-toxic and 1.0 being extremely toxic. So we’ll consider all the comments having target<0.5 as non-toxic and target>=0.5 as a toxic comment.
Columns related to race or ethnicity like Asian, black, Jewish, etc
Columns related to gender like male, female, etc
Columns related to sexual orientation like bisexual, heterosexual, homosexual_gay_or_lesbian, etc
Columns related to religion like an atheist, Buddhist, Christian, Hindu, Muslim, etc

So basically we’re given Identity-based features that can be broadly divided into 4 groups(As we can see above). Later on, these features will be used to check model bias.

Performance Metric

We’ll be using AUC, and a Custom AUC metric provided bt kaggle to check final model bias score.

So before understanding Custom AUC metric, we’ll have to first understand more about the dataset. Dataset has many features, while most of the features are useless but some are very useful and will be used to check model bias as I mentioned earlier.

Except for useless features, all the useful features (except for target and comment_text) falls under two category. First is Identity subgroup and second is background subgroup. In identity subgroup, there would be features like white, black, gay, Muslim, etc. And in subgroup all the features except for the features those are in identity subgroup would be there. After doing the analysis of all features I’ll choose features for identity subgroup and rest will go in background subgroup.

The custom final AUC have these four terms that help in finding model bias:

Overall AUC: This is normal AUC Score
Subgroup AUC: Computes AUC on Postive examples and Negative examples of Identity subgroups. It represents model understanding and performance within the group itself. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity
Background Positive Subgroup Negative(BPSN): Computes AUC on Postive examples of Background subgroups and negative examples of Identity subgroups. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not.
Subgroup Positive Background Negative(BNSP): Computes AUC on Postive examples of Identity subgroups and negative examples of Background subgroups. A low value in this metric means that the model confuses toxic examples that mention the identity with non-toxic examples that do not.

So, the final metric is computed as:

Constraints

Miss-classification should be as low as possible. Because if some non-toxic comments get classified as toxic or if a toxic comment gets classified as non-toxic, In both situation this can cause a loss.

Machine Learning Problem

This a binary classification problem. Where positive class- Toxic, Negative class-Non_toxic.

Exploratory Data Analysis

Distribution of toxic and non-toxic comment

code for below plot

From the plot, It is clear that data is very much imbalanced like there is 92 per cent difference in class distribution. So there might be overfitting.

While modelling I’ll try Downsampling, upsampling and giving weight to the class label.

[ Tried all but none of them helped in improving final score so in end trained the model on original dataset]

Utility function to plot a stacked bar plot

This function takes a list of features and plot stacked bar plot for them. Find code on GitHub https://github.com/rumankhan1/Jigsaw-Unintended-Bias-in-Toxicity-Classification/

Analysis of sensitive features that are Severe_toxicity, obscene,identity_attack, insult, threat

Most of the comments, either it is toxic or non-toxic, have some kind of sentiment related to identity or insult intention.

Most toxic comments are made with the intention of Insult. And second most toxic with intention of Identity attack which is very common on any social platform.

Analysis of features related to race or ethnicity which are Asian, black, Jewish, Latino, other_race_or_ethnicity, white

Here, we can see that most comments are either that is toxic or non-toxic contains “white” and then “Black” race.

Also, most toxic comments are made on people of the “White” race. And second, most are people of the black race.

Analysis of features related to sexual orientation which are bisexual, heterosexual,homosexual_gay_lesbian andother_sexual_orientation

In the plot, we can see most comments are made based on homosexual or gay or lesbian. Both, most toxic and non-toxic comments are made based on only this sexual orientation.

So, this gives an idea that people with these three sexual orientation are most likely to hear toxic on social or some platform.

Analysis of features related to religion which are atheist, Buddhist, Christian, Hindu, Muslim

From this plot, we can conclude that either toxic or non-toxic comment, the most of comment which is made with religious intention are related to ‘Christian’. Most toxic comments are related to “Muslim”.

People of “Muslim” religion are most likely to hear toxic on social or some platform.

Conclusion of the above analysis is:

Purpose of this analysis was to get the idea about most targeted identity groups on the internet or say on which most comments are made so that later we can use these group as part of Identity subgroup and rest of to background subgroup to check final model bias score. As model may get biased towards these most re-occurring groups.
From above plots, We get to know that ‘white’, ‘Muslims’, ‘Jewish’, ‘Black’, ‘lesbian_or_gay’, ‘Christian’, etc based on these groups most comments are made. Now those comments may be toxic or non-toxic both. So I’m going to put these groups/features into Identity subgroups and rest of them into background subgroup and later will be used to check model bias score.

Analysis of user Feedback features which are sad, like, etc

Function to draw the percentage plot.

function to a plot percent plot of input categorical feature.

This analysis basically not makes any sense for modelling but it gives a very important conclusion about social behaviour when people see a toxic or non-toxic comment.

As we can see in this plot of funny reaction features. This “funny” feature contains no of times funny reaction on a given comment. So, 85.87 % of comments have 0 reactions either it is toxic or non-toxic.

Let’s look at other user feedback feature= “likes” to get more idea about social behaviour.

So in this plot, we can see about 41.08 % comments have no any kind of reaction. While there are only 9 % of toxic comments in the entire dataset.

We can say, people, don't care either it is a toxic or non-toxic comment.

And also there is no relation between a comment being toxic or non-toxic and user feedback on that comment on a social platform.

Time series analysis

As data set have a temporal feature so we can find useful patterns over time. So now we’ll see different plot w.r to time.

No of comments made over time

It’s very commonsensical. Of course, comment count gonna increase over time. At some point, there is exponential growth. Some time slow. But mostly increasing as new user keep coming on the internet.

No of comments made over time based on sexual orientation

As we know, in recent years people have started talking about there sexual orientation and are bold with their identity. And are much confident in expressing their thoughts.

Also, in recent years, many who don’t like gay or homo or lesbian have started talking shit on the internet(basically toxic)

So these all are the reason why there is such exponential increase in comments, especially on one group.

Analysis of “comment_text “ feature — Comment length

This column/feature contains a toxic and non-toxic comment that will be later preprocessed and fed to Deep learning model.

Average comment length is 202.

And no more to say…

Cleaning of text comment

Below function perform cleaning of comment text.

you can get whole code here — https://github.com/rumankhan1/Jigsaw-Unintended-Bias-in-Toxicity-Classification/

After cleaning comment in data, we’ll save it new column as:

As earlier in this blog, I mentioned that the rest of useless features/attributes will be discarded so now discarding all those useless features from the dataset.

These below are the identity subgroup which we’re going to use to check model bias. Rest of all the features will be now counted in background subgroup. As this is a classification problem so labelling as 1 if value≥0.5 else 0.

Train and test split

I splitting train and test into 80:20 ratio. 80 % In train and 20 % in the test. After splitting, in train dataset, we have 1.4M points and in the test dataset, we have around 350k points.

Converting comment text to integer sequence — Tokenization

Below code snippet first, initialize the tokenizer, then fit on train+test comment text then transform both of them. So now we have train_token and test_token. After converting to integer sequence we’ll have to do padding so that every integer sequence have the same length.

Weight matrix from a pretrained vectorizer

I have used facebook crawl-300d-2M.vec. Which has 300d rep for each word. Below code load the file and storge all the weights to a dictionary to build a weight matrix later on.

below code builds the weight matrix. I have documented all the working of code as you can also see.

Model

I tried various model. With machine learning models, I had the highest final model bias score of 0.83 and all the other model had a lower score.

Then, I tried deep learning models with different architecture. First, I started with a very simple model then further kept increasing the model complexity and at the same time results were also improving. Then I realized, using bi-directional RNNs would be much better here compared to uni-directional. Then the result slightly improved. Then I added an attention layer too. And later convolution layer after bi-directional RNN layer. And the result kept improving. I saw a noticable improvement when I added the Global max pool layer and global average pool layer. So in end, I came with below model architecture which gave 0.93 according to custom AUC score on kaggle private leader board.

Below is code for the model architecture:

Training the model

I trained the model only for 5 epochs with a batch size of 1024 data points. Actually, I also trained the same model for 10, 15, 25 epochs but found in most case after 5 epochs model is started to overfit. Also, we had a very good result. After 5th epoch, we have very similar train and test loss and AUC so that ensures us that model is not overfitting that badly.

Following are the learning plot through different epochs:

Here, It seems like, If we had trained the model for a few more epochs then result might get better but It’s not.

The model started to overfit just after 5th epoch. So, I trined for only 5 epochs.

In these plots we can see, train and test score is very close and similar.

Conclusion

So with our final model, we have final AUC of around 0.93 on the test. And the loss was around 0.1233 on test and 0.1214 on the train.

Also, one thing is that Attention layer and Bi-directional RNNs are very helpful for such kind of text classification based task.

Github page of this case study : https://github.com/rumankhan1/Jigsaw-Unintended-Bias-in-Toxicity-Classification

my GitHub: https://github.com/rumankhan1

my LinkedIn: https://www.linkedin.com/in/rumankhan1/

References: