House price prediction 3/4: What is One Hot Encoding

A series about creating a model using Python and Tensorflow and then importing the model and making predictions using Javascript in a Vue.js application, above is the vid and below you will find some useful notes.

Here, in part 3 of this series, I will show what is and how does one hot encoding works.

In the first post, called House price prediction 1/4: Using Keras/Tensorflow and python, I talked about how to create a model in python, pre-process a dataset I've already created, train a model, post-process, predict, and finally about creating different files for sharing some information about the data for use on the second part.

Then in part 2, called House price prediction 2/4: Using Tensorflow.js, Vue.js and Javascript, I took the model, the data for pre and post processing and after loading everything we were finally able to predict hose prices using Vue.js.

And finally in part 4 I will show what and why normalizing the inputs is important.

1.
Pre-reqs
- Have node.js installed
  
  Have Python 3.x installed
  
  Have the code for the previous posts: Python Real Estate - Github and Real Estate price prediction using Tensorflow.js and Vue.js
2.
What is One Hot Encoding
- One hot encoding is a process with which we take a set of named categories in which the order between the values is not implicit, like colors, for example red, green and blue, or fruits, for example apples, lemons and blueberries and so on and transform them into a numeric representation so that a machine learning algorithm can perform operations on those values.
- A named category in which the order between the values is implicit does not need to be One Hot Encoded. One such category could be for example medals with values like Gold, Silver or Bronze which are awarded given a position in an ordered ranking like who came first, second or third, in such cases we can say that gold is better than silver, which is also better than bronze for example.
- So for a list of neighborhoods, at a high level, the encoding or transformation process would typically entail
  Neighborhoods: medellin aranjuez medellin centro medellin belen rosales medellin la castellana medellin la castellana
  Transforming the named categories into numbers by for example assigning an index into each one of them:
  Neighborhood: 0,medellin aranjuez 1,medellin centro 2,medellin belen rosales 3,medellin la castellana 4,medellin la castellana
  Then we would create a vector representation of them with as many columns as categories we have, in which case we would take the previous indexes and assign a “One” into the column that corresponds with the index of the category and “Zeros” on the other columns,
  Column Indexes,Neighborhood: 0,1,2,3,4 1,0,0,0,0,medellin aranjuez 0,1,0,0,0,medellin centro 0,0,1,0,0,medellin belen rosales 0,0,0,1,0,medellin la castellana 0,0,0,0,1,medellin la castellana

Using a Scikit Learn Preprocessor

For using a preprocessor like the label encoder you need to first instantiate it, fit it with the complete dataset and then use transform when you need to encode a particular set of values.

  
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

neighborhoods = [
    'envigado',
    'poblado',
    'centro',
    'laureles',
    'bello',
]

labels_to_test = [
    'laureles',
    'centro',
    'laureles',
]

label_encoder = LabelEncoder()
label_encoder.fit(neighborhoods)

print('Label Encoded String', label_encoder.transform(labels_to_test))

Using the Label Encoder and the One Hot Encoder

First you should configure both encoders by calling fit on both of them and also transform on the Label encoder

  
label_encoder = LabelEncoder()
label_encoder.fit(complete_set_of_data)

onehot_encoder = OneHotEncoder(sparse=False)

categorical_column = label_encoder.transform(complete_set_of_data)
integer_encoded = categorical_column.reshape(len(categorical_column), 1)
onehot_encoder.fit(integer_encoded)

Then you would call transform on both when transforming a set of values

  
values_to_transform = label_encoder.transform(values_to_transform)
integer_encoded = values_to_transform.values.reshape(len(values_to_transform), 1)
onehot_encoded = onehot_encoder.transform(integer_encoded)

Onehot encode in Javascript

Before using the model created in python inside Javascript we have to One hot encode the data we want to predict, in this case I used a javascript library for that

  
import * as onehot from 'one-hot-enum';

let reducedlist = this.completeSetOfData.slice(1);
let enumaration = onehot.enumaration(reducedlist);
let encoded = onehot.encode(reducedlist);
let zeros = Array.apply(null, Array(encoded[0].length)).map(Number.prototype.valueOf, 0);

this.dictionary = {};

for (let i in enumaration) {
  this.dictionary[enumaration[i]] = encoded[i];
}

this.dictionary[this.neighborhoods[0]] = zeros;

Then for transforming a string into it's one hot encoded vector representation you just need to provide the string into the dictionary
```
  
this.dictionary[valueToTransform]
  
```

6.

Resources

Why One-Hot Encode Data in Machine Learning?

What is One Hot Encoding? Why And When do you have to use it?

sklearn.preprocessing.LabelEncoder

sklearn.preprocessing.OneHotEncoder

one-hot-enum

One-hot Encoding explained

DLightHouse

Buscar este blog