Python Data Normalization: Complete Guide

September 3, 2024

The world runs on data. It has contributed to some of the most significant innovations in the last decade, with AI and machine learning at the forefront. From fears of users developing feelings for ChatGPT to luxury retailers like Michael Kors deploying AI virtual assistants, data is the lifeblood of the digital AI-oriented world we live in.

If you're looking to harness the potential for AI and machine learning in your business, you must first ensure that your data is in the correct format. This is where data normalization comes in. Loosely defined, it is the process of transforming numerical and text data into a standard format that can help improve the accuracy of a machine-learning model.

With this in mind, today's article will explore:

Data normalization and why it's important

Common data normalization techniques

How to apply these techniques in Python

Keep reading below for more insights.

What Is Data Normalization?

Data normalization is one of the most important steps in data preprocessing. Without it, getting insights from any information becomes difficult, especially when working with varied datasets.

For example, if you're comparing the performance of students in a class across different subjects and these subjects have varying grading scales, you may use data normalization techniques to convert the grades into a similar scale for easier comparison.

Take an example of math, science, English, and history with the following grading scales:

Math: 0 to 100

English: 0 to 50

Science: 0 to 80

History: 0 to 30

A student scores the following:

Math: 80

English: 35

Science: 50

History: 20

At first glance, it might seem that math is way more critical than history because these subjects use different grading systems. Comparing a student's performance in the current data format can be quite complicated. However, data normalization can make it easier by rescaling these scores to a standard range, typically from 0 to 1 or -1 to 1.

First, let's prepare this data for normalization by getting the percentage score for each subject:

1208157456638858.UK3gt6cOlTdFVRvw0tRR_height640.png

The maximum percentage is 0.80 (math), and the minimum is 0.625 (science).

One of the simplest ways to normalize data is through simple feature scaling, also known as Min-Max Scaling. This technique divides each value by the maximum value for that feature. We'll use it to manually normalize the performance data above. Here is the formula:

Normalized Value = (X - X_min) / (X_max - X_min)

Where:

X is the original value.

X_min is the minimum value of the feature in the dataset.

X_max is the maximum value of the feature in the dataset.

We can normalize this data as follows:

Math:

Normalized Value = (0.80- 0.625) / (0.80- 0.625) = 1.000

English:

Normalized Value = (0.70- 0.625) / (0.80- 0.625) = 0.429

Science:

Normalized Value = (0.625- 0,625) / (0.80 - 0.625) = 0.000

History:

Normalized Value = (0.67- 0.625) / (0.80 - 0.625) = 0.257

Now you have:

Math: 1.00

English: 0.429

Science: 0.000

History: 0.257

The performance is now at a standard scale. Drawing insights from this data, such as the subject that was best and least performed, is easier. And this is what data normalization seeks to achieve.

Why Data Normalization Is Important

Data normalization is important when preprocessing data for algorithms and machine learning models. To understand why, let's look at our previous school performance example. Bringing up the values:

Math: 80

English: 35

Science: 50

History: 20

Suppose we want to use this data to train a machine learning algorithm. Feeding these data points into the system in their current format can cause various problems down the road. To explain why, we will offer a broad, general explanation without delving too much into the intricate details.

Think of the machine learning process as a system of different-sized hills; the lowest point possible represents the lowest error, while the highest peak represents the highest error. Naturally, the algorithm aims to get to the lowest point possible.

Most modern AI models, especially those that utilize neural networks, function by iteratively adjusting the model parameters. At each iteration, the algorithm adjusts the parameters in a direction that takes it to the lowest point possible.

Some hills, like math, with a score of 80, are big hills, while some, like history, with a score of 20, are small hills. They are all equally important, but the algorithm doesn't know this since it processes data in its raw form. As a result, it may focus more on the big hills, neglecting the small hills, which can lead to biased and inaccurate predictions.

This is why data normalization is important. It levels the hills, ensuring none is seemingly more important than the other. In this way, the model can take balanced and accurate steps toward the lowest point possible, otherwise known as convergence.

In short, normalization data analysis methods ensure AI models don't focus on any one feature just because it has a bigger range. By normalizing your data on a standard scale, the algorithm will consider each data point equally, helping it reach convergence faster and more accurately.

Understanding Data Normalization in Python

We're now at the crux of our discussion, where we'll show you how to normalize data in Python. But before we get to that, why the Python programming language? There are many languages; what makes this one suitable for normalizing data? Keep reading below to find out.

Robust Data Normalization Libraries

One of the primary reasons to use Python to normalize your data is the rich ecosystem of libraries this platform offers. They include

Pandas

Use this library for its wide array of data analysis and structuring tools.

Scikit-learn

This is a comprehensive library that includes prebuilt functions for normalization, such as MinMaxScaler and StandardScaler.

NumPy

This library supports working with arrays and matrices, making it useful for custom normalization tasks.

Easy-to-Use Syntax

Python, as a programming language, has a fluid syntax that makes it easy to read and understand code. With strong community support, code maintenance, and debugging are also relatively straightforward, with many free resources available online.

Integration With Machine Learning Frameworks

Python seamlessly integrates with popular machine learning frameworks such as Tensorflow and PyTorch. This naturally makes it the language of choice for normalizing input data for machine learning models.

Now that you understand the versatility of this programming language, you may be wondering how to normalize data in Python. Don’t worry—we’ve got you covered. Keep reading to find out more:

How to Normalize Data in Python

Below is a rundown of the different techniques for normalizing data in Python:

Min-Max Scaling

Remember the formula we used to normalize our student score results? It's known as Min-Max Scaling because it takes the minimum and maximum values of a given dataset and uses them to compute the normalized value. Going back to our previous example:

Math: 80 out of 100

English: 35 out of 50

Science: 50 out of 80

History: 20 out of 30

Let us normalize this data using Min-Max Scaling in Python. Run the following code in your editor of choice with all the prerequisites installed:

from sklearn.preprocessing import MinMaxScaler

import pandas as pd

# Original data

data = {'Subject': ['Math', 'English', 'Science', 'History'],

'Score': [80, 35, 50, 20],

'Max_Score': [100, 50, 80, 30]}

# Convert to DataFrame

df = pd.DataFrame(data)

# Calculate the percentage score

df['Percentage'] = df['Score'] / df['Max_Score']

# Initialize the MinMaxScaler

scaler = MinMaxScaler()

# Fit and transform the percentage scores

df['Normalized'] = scaler.fit_transform(df[['Percentage']])

# Display the normalized data

print(df[['Subject', 'Normalized']])

This is the response you'll get:

Subject Normalized

0 Math 1.000000

1 English 0.428571

2 Science 0.000000

3 History 0.238095

Notice that the results are similar to the normalized values we calculated manually. This process is much faster and can save you a lot of time when working with large datasets.

Z-Score Scaling

While the min-max method uses the minimum and maximum values in a given data set for normalization, the Z-Score scaling method centers normalized data around the standard deviation, ensuring that each feature contributes equally to the model. Here is the formula:

Z = (X - μ) / σ

Where:

Z is the Z-score (standardized value).

X is the original value.

μ (mu) is the mean of the feature.

σ (sigma) is the standard deviation of the feature.

Instead of spending time calculating manually, especially for large datasets, you can use Python to normalize data using Z-score scaling. Applying our previous example, here is the code you should use:

from sklearn.preprocessing import StandardScaler

import pandas as pd

# Original data

data = {'Subject': ['Math', 'English', 'Science', 'History'],

'Score': [80, 35, 50, 20]}

# Convert to DataFrame

df = pd.DataFrame(data)

# Initialize the StandardScaler

scaler = StandardScaler()

# Fit and transform the data

df['Z-Score'] = scaler.fit_transform(df[['Score']])

# Display the standardized data

print(df[['Subject', 'Z-Score']])

This is the response you should get:

Subject Z-Score

0 Math 1.095445

1 English -0.297076

2 Science 0.000000

3 History -0.798369

Now, if you're wondering why the results differ, it's because we used a different approach. The Z-score measures the number of standard deviations a value is from the mean.

For example, Science, which had the lowest score, is 0 standard deviations from the mean, while Math, the highest-scoring subject, is 1.095 standard deviations from the mean. As you can see, the interpretation of the data remains consistent across the two techniques discussed.

MaxAbs Scaling

This data normalization technique is useful for datasets with positive and negative values. It helps normalize the data within a specific range, typically from -1 to 1, by dividing each feature by its maximum absolute value using this formula:

X_scaled = X / max(|X|)

Where:

X_scaled is the scaled value of the feature.

X is the original value of the feature.

max(|X|) is the maximum absolute value of the feature.

We're going to use a different example to show you how this method works. Consider a dataset with the following values:

1208157456638860.RswPKRIwBik2DZNLQJoi_height640.png

We'll use the following Python code to get the normalized values using the MaxAbsScaler for feature A:

from sklearn.preprocessing import MaxAbsScaler

import pandas as pd

# Original data

data = {'Feature': [10, -20, 15, -5]}

df = pd.DataFrame(data)

# Initialize the MaxAbsScaler

scaler = MaxAbsScaler()

# Fit and transform the data

df['Scaled'] = scaler.fit_transform(df[['Feature']])

# Display the scaled data

print(df[['Feature', 'Scaled']])

This is the result you should get:

Feature Scaled

0 10 0.50

1 -20 -1.00

2 15 0.75

3 -5 -0.25

We now have a standard range from -1 to 0.5, which is much easier for an algorithm to interpret.

Decimal Scaling

This technique is suitable for datasets with varied decimal points. It normalizes data by moving the decimal point depending on the absolute value of the feature. Here is the formula:

X_scaled = X / 10^j

Where:

X_scaled is the scaled value of the feature.

X is the original value of the feature.

10^j is 10 raised to the power of j, which is determined based on the maximum value of the feature.

Let's normalize the following dataset using decimal scaling in Python:

Values: 0.345, -1.789, 2.456, -0.678

Here is the code:

import pandas as pd

# Original data with decimal points

data = {'Feature': [0.345, -1.789, 2.456, -0.678]}

df = pd.DataFrame(data)

# Find the maximum absolute value in the dataset

max_abs_value = df['Feature'].abs().max()

# Determine the scaling factor

# Find the smallest power of 10 greater than the maximum absolute value

import math

scaling_factor = 10 ** math.ceil(math.log10(max_abs_value))

# Apply Decimal Scaling

df['Scaled'] = df['Feature'] / scaling_factor

# Display the original and scaled data

print(df)

This is the result:

Feature Scaled

0 0.345 0.0345

1 -1.789 -0.1789

2 2.456 0.2456

3 -0.678 -0.0678

How to Normalize Text Data in Python

As you perform data analysis and preparation for your machine learning model, you may also have to normalize text data. Unlike numerical values, normalizing text data typically includes lowercasing, removing punctuation, and tokenization. For example, you may have to split individual text into units called tokens, which can be phrases or symbols. Say we wanted to split this text:

"Tokenization splits texts into words."

Use the following code to normalize this text data:

import nltk

from nltk.tokenize import word_tokenize

# Download the necessary NLTK resource

nltk.download('punkt')

# Sample text

text = "Tokenization splits text into words."

# Tokenize the text

tokens = word_tokenize(text)

# Display the tokens

print(tokens)

The tokenized text should be as follows:

['Tokenization', 'splits', 'text', 'into', 'words', '.']

Conclusion

We have come to the end of our discussion on data normalization methods. This process is essential for preparing input data for machine learning models. While you can manually normalize data, it's impractical for large datasets.

Python is preferred because it has comprehensive libraries, a clean fluid syntax, and it's open source. The methods we've discussed include Min-Max, Decimal, and Z-Score Scaling. Now, you can train your machine learning models with greater accuracy!

FAQ

What is the difference between scaling and normalization?

While both are common data preprocessing techniques, there's a subtle difference between them. Normalization attempts to adjust the range of values by fitting them into a specific range, whereas scaling attempts to adjust a range of feature values without necessarily fitting them into a specific range.

What is the difference between standardization and normalization?

Standardization transforms data to have a mean of zero and a standard deviation of one without binding it to a specific range. Normalization rescales data to fit within a specific range, say [0,1] or [-1,1].

What is the difference between normalized data and denormalized data?

Normalized data has been transformed to fit within a specific range as part of machine learning data preprocessing. On the other hand, denormalized data has been reverted back to its original form.

Python Data Normalization: Complete Guide

What Is Data Normalization?

Why Data Normalization Is Important