Jump to content
  • Comparison of Different Encoding Methods using Data Science


    When applying models like linear regression, logistic regression or random forest, it is not only helpful but also may be necessary to encode categorical variables because most models can only take in numeric values. In addition to the most common method, Dummy (One-Hot) Encoding, we also have many other encoding methods available, like Impact Encoding, Weight-of-Evidence Encoding and so on. In this blog, we will first introduce seven different encoding methods, and then apply them to four different datasets to measure their effectiveness. We find that some methods work particularly well when taking into account both accuracy and practical applications.

    workflow.thumb.png.a75e27cf9520a19518ad5eeaf4f233d6.png

    Workflow that incorporates various encoding methods

     

    Encoding Methods

    In this section, we will talk about the seven encoding methods that we are going to compare in our experiments. We also provide a tiny dataset to show how these encoding methods work.

    Here below is the data that we are going to encode, which includes one categorical variable, and a boolean target variable:

    ID

    Type

    Failure

    1

    A

    True (1)

    2

    A

    False (0)

    3

    A

    True (1)

    4

    B

    False (0)

    5

    B

    True (1)

    6

    C

    False (0)

    7

    C

    False (0)

    8

    D

    True (1)

    We will use each encoding method to encode the Type variable. If a target variable is necessary for the encoding method, then we will use the 1 and 0 values of the Failure target variable.

    Dummy (One-Hot) Encoding

    For a categorical variable with n levels, this method creates n1 new boolean variables. The value of each variable is just 0 or 1, where 0 means that it is not this type, and 1 indicates that it is this type. This method does not require a target variable.

    Type

    Variable1 

    (Is Type==A?)

    Variable2 

    (Is Type==B?)

    Variable3 

    (Is Type==C?)

    A

    1

    0

    0

    B

    0

    1

    0

    C

    0

    0

    1

    D

    0

    0

    0

    Helmert Encoding

    This method creates n1 new variables. The main difference from one-hot encoding is that for any level, each new variable can take many values, not just 0 or 1. For each variable, the sum of each level is 0. This method does not require a target variable. It can be useful for categorical variables that are ordered.

    Type

    Variable1 

    Variable2 

    Variable3 

    A

    3/4

    0

    0

    B

    -1/4

    2/3

    0

    C

    -1/4

    -1/3

    1/2

    D

    -1/4

    -1/3

    -1/2

     

    Binary Encoding

    This method creates log2nnew variables. Each new variable can be either 0 or 1. Each level is converted to a binary digit, and then each digit creates a new variable. This method does not require a target variable.

    Type

    Variable1 

    Variable2 

    A

    0

    0

    B

    0

    1

    C

    1

    0

    D

    1

    1

    Frequency Encoding

    This method uses the frequency of the level as the encoded value. This method does not require a target variable. Different levels of the categorical variable may have the same encoded value.

    Type

    EncodedValue

    A

    3/8

    B

    2/8

    C

    2/8

    D

    1/8

    Hash Encoding

    This encoding method hashes the level to some value by using a standard hashing algorithm (e.g. MD5), and then computes the remainder modulo a specified number of levels. The advantage of such a method is that if there are missing categories from the training set, the new levels found in test set can still be encoded using the same hash function. We do not need to manually create a default value for unseen categories, and different unseen categories would be encoded to different values. Different levels of the categorical variable may have the same encoded value.

    In this example, suppose the number of levels that we specify is 3, and the hashed values are fabricated.

    Type

    HashValue

    EncodedValue

    A

    17

    2

    B

    42

    0

    C

    25

    1

    D

    4

    1

    Impact Encoding

    Impact encoding takes the value of the target variable into consideration, by computing the mean of the target variable for each level. A level typically associated with a larger target value would therefore be encoded larger.  Different levels of the categorical variable may have the same encoded value.

    Type

    Encoded Value

    A

    2/3

    B

    1/2

    C

    0

    D

    1

    Weight-of-Evidence Encoding

    This method of encoding is only used for binary classification problems. The encoded value of WoE encoding is taken as WoE = ln(# of True# of False). Since the number of True or False values could be 0, we need some smoothing to combat against those cases. A naive approach for this is to manually set the number of True/False to a tiny number if its value is 0.  Different levels of the categorical variable may have the same encoded value.

    Type

    Encoded Value

    A

    0.693147

    B

    0

    C

    -13.815511

    D

    13.815511

    Datasets Overview

    Here below are the datasets that we will experiment on. Each of them has at least one categorical variable to be encoded. 

    • Diamonds

      • 3 categorical, 6 continuous, 53940 rows in total

      • Categorical variables: cut (5 distinct levels), color (7), clarity (8)

      • Response: price 

    • Adult Income

      • 8 categorical, 5 continuous, 48842 rows

      • Categorical: workclass (8 distinct levels), education (16), marital status (7), occupation (14), relationship (6), race (5), sex (2), native country (41)

      • Response: Income (Categorical, >50K or not)

      • Factors: age, workclass, education, marital-status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country

      • 20% of data are positive (>50K)

    • San Francisco Crime

      • 1 categorical, 0 continuous, 1M rows

      • Categorical: Address (more than 20K levels)

      • Response: whether the crime is violent or not

    • Vancouver Crime

      • 3 categorical, 1 continuous, 500K rows

      • Categorical: Address (more than 20K levels)

      • Response: whether the crime is violent or not

    Summary

     

    Diamonds

    Adult Income

    SF Crime

    Vancouver Crime

    Task

    Regression

    Classification

    Classification

    Classification

    # of Categorical

    Variables

    3

    8

    1

    many

    # of levels

    few

    varies 

    many

    many

    Balanced

    (True/False)

    N/A

    2-8

    1-9

    4-6

    Testing the Methods

    Here below is a sample workflow: it is the workflow for the diamonds dataset. We also have a baseline model, which only takes in continuous variables, to be compared with. A model that outperforms the can said to provide information by encoding categorical variables.

    pastedimage0.png.462704615c19ccf20b8680cd8c7e0580.png

    Here is a summary of the steps shown in the above workflow:

    1. Split the dataset into Training set and Test set.

    2. Compute encoded value based on the Training set.

    3. Apply the encoding rule to Test set.

    4. Rebalance Training set if necessary. (For unbalanced dataset, we will compare the result with and without rebalancing.)

    5. Apply Linear Regression / Logistic Regression / Random Forest to Training set.

    6. Measure performance by MSE / R-squared / Variance Explained / F-Score

    Note: When encoding, we must calculate encoded values only based on the training set. This is because the encoding rules are established and applied before the training of data, when we only have access to training data. That means we need to first split the dataset, and then compute encoded values, and apply the encoding rule to the test set. Therefore, we also need to handle the case when a level only appears in the test set. For example, we may manually create a level called other, whose encoded value is the average of the encoded values of all other levels. In the sample workflow above, we split the dataset inside a Python Notebook (which is called by the Python Execute operator) and append labels to each data and finally encode the training dataset. The operator called Train and Test are just filters based on the label.

    Encoding Comparison Results

    Diamonds Dataset

    We try to predict the price of each diamond based on their parameters like color, cut, clarity, etc. We use Linear Regression and Random Forest for this task.

    pastedimage0(1).png.cfbe57559adc51ed1cdc6f66ab4ecc5f.pngpastedimage0(2).png.56c76ae799fb4bc189c40a5a15dfdd8c.png

     

    MSE

    R-Sqaured

    Model

    Linear Regression

    Random Forest

    Random Forest

    Linear Regression

    Impact

    1.65E+06

    5.80E+05

    0.9639

    0.8953

    Binary

    1.83E+06

    5.53E+05

    0.9655

    0.8843

    Freq

    2.11E+06

    4.05E+05

    0.9748

    0.8664

    Baseline

    2.34E+06

    2.03E+06

    0.8738

    0.8515

    Hash

    1.98E+06

    6.14E+05

    0.9617

    0.8749

    Dummy

    1.28E+06

    3.60E+05

    0.9775

    0.9191

    Helmert

    1.30E+06

    3.55E+05

    0.9779

    0.9178

    The chart on the top left is the MSE of our prediction on the test set. The lower the better. All encoding methods outperform the baseline in either model, so we can say that all encoding methods provide some information. The best models here are Dummy and Helmert encoding. R-squared values here are not very useful as the models have really similar R-squared values.

    We believe that Dummy and Helmert encoding win here because they have lots of variables, taking a large number of degrees of freedom. Binary encoding with a few new variables also displays some advantage. Impact encoding, which incorporates the target value to the encoding value, is also not bad in the task of regression.

    Adult Income Dataset

    This dataset has many categorical variables, and the number of levels varies. We try to predict whether a person's income is greater or less than $50,000 based on his/her education, age, gender, job, etc. Since the dataset is not very balanced, we also compare the result before and after rebalancing.

    pastedimage0.png.a334bd5da3ecbb8de88e48cef196aa53.png

     

    Logistic Regression

    Random Forest

     

    Balanced

    Unbalanced

    Balanced

    Unbalanced

    WOE

    0.684

    0.633

    0.692

    0.673

    Impact

    0.676

    0.629

    0.694

    0.670

    Binary

    0.660

    0.613

    0.683

    0.652

    Freq

    0.625

    0.536

    0.680

    0.655

    Hash

    0.604

    0.450

    0.671

    0.628

    Helmert

    0.686

    0.642

    0.686

    0.661

    Dummy

    0.682

    0.652

    0.698

    0.670

    Baseline

    0.518

    0.407

    0.550

    0.425

    For logistic regression, we see that Hash encoding and Frequency encoding are doing a bad job. Instead, WoE and Impact encoding are doing comparatively well, even a better job than Dummy and Helmert encoding for both models.

    Why are Impact and WoE encoding doing so well? In this case, we can consider the Random Forest as using the encoded values to create a number of "if else" statements, so that the WoE and Impact Encoding, with only one variable, can result in effectively the same tree as Dummy or Helmert encoding. 

    San Francisco Crime Dataset

    Recall that this dataset is a little unbalanced, with only 10% of positive response and all models can benefit from rebalancing.

    In addition, since we have more than 20,000 levels for the Address variable, directly applying Dummy Encoding and Helmert Encoding would result in an extremely wide dataset which might lead to performance issues, or possibly running out of memory. Therefore, we do not include these two methods in our comparison.

    pastedimage0.png.7d00e0275322cfc9da6d059277b88297.png

     

    Fraction of Variance Explained(Balanced)

    Fraction of Variance Explained(Unbalanced)

    F (Balanced)

    F(Unbalaned)

    WOE

    0.116

    0.086

    0.249

    0.004

    Impact

    0.059

    0.045

    0.252

    0.005

    Binary

    0.002

    0.002

    0.205

    0.000

    Freq

    0.000

    0.000

    0.202

    0.000

    Hash

    0.000

    0.000

    0.185

    0.000

    Even though the models perform poorly on this dataset, we can still compare their performance. WoE and Impact encoding clearly win. They are the only models producing non-trivial models on the unbalanced data. If the dataset is not rebalanced, then Binary Encoding, Frequency Encoding and Hash Encoding are always producing negative predictions, which leads to a zero F-score.

    The implication here is that WoE and Impact are particularly useful when there is a very high number of categorical levels. However, the dataset is rather  unbalanced, and there is only one independent variable, so the models are weak and the results are therefore inconclusive.

    We decided to try another dataset  similar, but less unbalanced and with more variables.

    Vancouver Crime Dataset

    This dataset is less unbalanced. We again try to predict whether the crime is violent or not. In this dataset, we include continuous variable Year, and categorical variables month, day of week and address. We still consider Dummy and Helmert Encoding to be impractical, so we leave them out of the comparison.

    pastedimage0(1).png.8feccea24860719149092d3e71868d46.png

     

    Fraction of Variance Explained

    Accuracy

    WOE

    0.194

    0.675

    Impact

    0.123

    0.661

    Binary

    0.032

    0.619

    Freq

    0.058

    0.647

    Hash

    0.001

    0.605

    Baseline

    0.000

    0.605

    Unfortunately, the Hash encoding and Baseline model are still always predicting negative. 

    What is surprising in this dataset is that Frequency Encoding is doing a better job. One reason for this phenomenon is illustrated by the figure below.

    pastedimage0(2).png.8ec6bc18e2f1a245bb353aa8d40237f3.png

    The more cases reported on the address, the larger the encoded value, but also the larger probability of a violent crime. So our Frequency Encoding is having the similar advantage as WoE or Impact Encoding. However, such an explanation could not explain the poor performance of Frequency Encoding in previous datasets.

    However, the clear winners are Impact and WoE Encoding, as we suspected.

    Summary

    1. WoE and Impact Encoding win by taking the target variable into consideration.

    2. Helmert and Dummy Encoding win by having lots of variables at the expense of degrees of freedom. However, when the number of levels is extremely large, we may face performance issues, and implementing them in a production setting may be impractical. (We can consider grouping some levels together before encoding.)

    3. Binary, Frequency, and Hash Encoding in general do not provide as much information about values to be predicted.

    Note: This experiment of comparing different categorical encoding methods was inspired by the article Modeling Trick: Impact Coding of Categorical Variables with Many Levels, where Impact Encoding was applied to the SF Crime dataset. In this article, they use the data of June 2012 and only use the Address variable. We also only take the Address variable into account, but apply multiple encoding methods and use the dataset from 2003 to 2018.

    Authors

    This blog was written by Zongyuan Chen, who worked as an intern at TIBCO in the summer of 2020, focusing on encoding methods for categorical variables, modeling trends of COVID-19, and data visualization. He studied Computer Science and Statistics at the University of Washington, and is now pursuing a master's degree in Financial Engineering at Columbia University

    Steven Hillion works on large-scale machine learning. He was the co-founder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of open-source and enterprise software in machine learning. Earlier, he led engineering at a series of start-ups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...