Preprocessing data for data science (Part 1)

Start of the cleaning data process

import pandas as pd
raw_df = pd.read_csv(‘data.csv’)
# raw_df.info() # display all the columns and their data type
# it's useful to have the columns in a list you can modify easily
useful_columns = [
'Age',
'Preferred Foot',
'Height',
'Weight',
'Agility',
'Strength',
'Dribbling',
'Jumping',
'Marking',
'Interceptions',
'Position', # I like to let my objective columns at the end
]
working_df = raw_df[useful_columns]
print(working_df.head())
Age Preferred Foot Height  Weight  Agility  Strength  Dribbling  Jumping  \
0 31 Left 5'7 159lbs 91.0 59.0 97.0 68.0
1 33 Right 6'2 183lbs 87.0 79.0 88.0 95.0
2 26 Right 5'9 150lbs 96.0 49.0 96.0 61.0
3 27 Right 6'4 168lbs 60.0 64.0 18.0 67.0
4 27 Right 5'11 154lbs 79.0 75.0 86.0 63.0

Marking Interceptions Position
0 33.0 22.0 RF
1 28.0 29.0 ST
2 27.0 36.0 LW
3 15.0 30.0 GK
4 68.0 61.0 RCM
  • Encoding: for getting a numerical representation of categorical values (usally strings)
  • Scaling: for normalizing continious values

Encoding

  • CB -> 0
  • CM -> 1
  • GK -> 2
  • LB -> 3
  • ST -> 4
from sklearn.preprocessing import LabelEncoder
# create the encoder for the colum
position_encoder = LabelEncoder()
# learn the classes and assign a code to each
position_encoder.fit(working_df['Position'])

# get the encoded column
encoded_position = position_encoder.transform(working_df['Position'])
encoded_position
array([21, 26, 14, ..., 26, 24,  4])
position_encoder.inverse_transform(encoded_position)
array(['RF', 'ST', 'LW', ..., 'ST', 'RW', 'CM'], dtype=object)

Scaling

from sklearn.preprocessing import MinMaxScaler

# There are many types of scalers
strength_scaler = MinMaxScaler()
# Note that the scalers receive a 2D array as input
strength_scaler.fit(working_df[['Strength']])
# Get the scaled version
scaled_strength = strength_scaler.transform(working_df[['Strength']])
scaled_strength
array([[0.525 ],
[0.775 ],
[0.4 ],
...,
[0.1875],
[0.3875],
[0.5375]])

Finishing the pipeline

# New data frame
clean_df = pd.DataFrame()

# I will create a dictionary for storing all my encoders
encoders = {
'Preferred Foot': LabelEncoder(),
'Position': LabelEncoder()
}

# Encode all the categorical features
for col, encoder in encoders.items():
encoder.fit(working_df[col])
clean_df[col] = encoder.transform(working_df[col])

scalers = {
'Agility': MinMaxScaler(),
'Strength': MinMaxScaler(),
'Dribbling': MinMaxScaler(),
'Jumping': MinMaxScaler(),
'Marking': MinMaxScaler(),
'Interceptions': MinMaxScaler()
}

# Scale all the continous features
for col, scaler in scalers.items():
scaler.fit(working_df[[col]])
clean_df[col] = scaler.transform(working_df[[col]])

print(clean_df.head())
Preferred Foot  Position   Agility  Strength  Dribbling  Jumping   Marking  \
0 0 21 0.939024 0.5250 1.000000 0.6625 0.329670
1 1 26 0.890244 0.7750 0.903226 1.0000 0.274725
2 1 14 1.000000 0.4000 0.989247 0.5750 0.263736
3 1 5 0.560976 0.5875 0.150538 0.6500 0.131868
4 1 19 0.792683 0.7250 0.881720 0.6000 0.714286

Interceptions
0 0.213483
1 0.292135
2 0.370787
3 0.303371
4 0.651685
clean_df.to_csv('clean_data.csv', index=None)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Luis Carlos Contreras

Luis Carlos Contreras

Software and Data Engineer. TCG and video games lover.