8 Essential python libraries for mastering data quality checks

Pandas, NumPy, Scikit-learn, and more. These free and open source Python libraries will revolutionize the way you approach data quality.

8 Essential Python Libraries for Mastering Data Quality Checks

Anoop Gopalam

October 25, 2023

Data Quality: A Necessity In Today’s Digital World

Data quality is the backbone of data-driven decision-making. Poor data quality costs businesses, impacting profits and decision-making. IBM reported an estimated $3.1 trillion annual losses in the U.S. in 2016.

Python, a top data science language, is a powerful and versatile programming language that offers a range of libraries to help you improve your data quality. In this blog, we will explore eight of these libraries and how they can help maintain data accuracy, consistency, reliability, and more, mitigating the costly risks of poor-quality data and transform your data into a valuable asset.

1. Pandas

Pandas is a foundational library for data manipulation and analysis. It provides functions to quickly detect and handle missing values, duplicates, and outliers.

Examples:

Removing missing values

import pandas as pd
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
cleaned_df = df.dropna()

Finding and removing duplicates

df = pd.DataFrame({'A': [1, 2, 1], 'B': [4, 5, 4]})
duplicates = df[df.duplicated()]
df_unique = df.drop_duplicates()

Filling missing values

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df_filled = df.fillna(method='ffill')

2. NumPy

NumPy is essential for numerical operations. It can be used to identify inconsistencies in data types or compute statistical measures to understand data distributions.

Examples:

Identifying NaN values

import numpy as np
array = np.array([1, 2, np.nan, 4])
nan_indices = np.where(np.isnan(array))

Calculating mean excluding NaN values

mean_val = np.nanmean(array)

Replacing NaN values

array_no_nan = np.nan_to_num(array, nan=-1)

3. Scikit-learn

While primarily a machine learning library, scikit-learn provides utilities for data preprocessing. These can be used to scale, normalize, and encode data, ensuring consistency and quality.

Examples:

Feature scaling

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

One-hot encoding categorical variables

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['A']])

Imputing missing values

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df)

4. data-quality-check

This library offers a suite of tools specifically designed for data quality checks. It can validate data types, check for missing values, and even compare datasets.

Examples:

Checking for missing values

missing_values_report = data_quality_check.check_missing_values(df)

Validating data types

data_type_report = data_quality_check.validate_data_types(df, expected_data_types)

Imputing missing values

comparison_report = data_quality_check.compare_datasets(df1, df2)

5. ydata_quality

ydata_quality is a comprehensive tool for data quality verification. It assesses datasets for common issues like duplicates, missing values, and outliers, providing a holistic quality score.

Examples:

Checking for missing values

from ydata_quality import DataQuality
dq = DataQuality(df)
duplicate_report = dq.duplicates()

Validating data types

missing_values_report = dq.missing_values()

Imputing missing values

comparison_report = data_quality_check.compare_datasets(df1, df2)

6. Great expectations

Great Expectations allows users to set clear expectations for their data. By defining these expectations, users can ensure that incoming data meets the required standards.

Examples:

Checking for missing values

result = ge_df.expect_table_columns_to_match_ordered_list(['A', 'B'])

Validating data types

result = ge_df.expect_column_mean_to_be_between('A', min_value=1, max_value=3)

Imputing missing values

result = ge_df.expect_column_values_to_be_unique('A')

7. deepchecks

Deepchecks is designed to validate data and models before deployment. It checks for distribution shifts, data leakage, and other common pitfalls.

Examples:

Checking for label leakage

from deepchecks import Dataset, CheckSuite
suite = CheckSuite()
suite.add(checks.LabelLeakage())
dataset = Dataset(df, label='target_column')
results = suite.run(dataset)

Checking for train-test contamination

results = suite.run(train_dataset, test_dataset)

Checking for feature importance stability

from deepchecks import train_test_validation
results = train_test_validation.check_feature_importance_stability(train_dataset, test_dataset, model)

8. pandera

Pandera provides a flexible and expressive API for data validation. It integrates seamlessly with pandas, allowing for schema-based dataframe validation.

Examples:

Validating a dataFrame against a schema

import pandera as pa
schema = pa.DataFrameSchema({
    'A': pa.Column(pa.Int, nullable=True),
    'B': pa.Column(pa.Float, nullable=True)
})
validated_df = schema(df)

Checking for train-test contamination

schema = pa.DataFrameSchema({
    'A': pa.Column(pa.Int, checks=pa.Check.is_unique())
})
validated_df = schema(df)

Checking for feature importance stability

schema = pa.DataFrameSchema({
    'A': pa.Column(pa.Int, checks=pa.Check.in_range(min_value=1, max_value=10))
})
validated_df = schema(df)

Understanding the shortcomings of python libraries for comprehensive data quality

In concluding our deep dive into the various Python libraries for data quality checks, it is imperative to recognize that despite their extensive capabilities, these libraries alone may not suffice for comprehensive data quality assurance. Each library, including the versatile Pandas, comes with its own set of challenges and limitations. Handling large volumes of data can be demanding on system resources, and the necessity for continuous maintenance of code-based validations can be a significant overhead. The lack of native integration capabilities with a variety of data sources further complicates the data quality assurance process. Additionally, these libraries do not inherently provide functionalities for automating and scheduling data quality checks, often requiring manual intervention and additional tooling to achieve a seamless data quality workflow.

These challenges highlight the need for a more holistic and integrated approach to data quality management, ensuring not just the accuracy and consistency of data, but also the efficiency and scalability of the data quality assurance process itself.

Telmai as an alternative approach

Enter Telmai, a modern solution that redefines the approach to data quality management. Telmai offers a high-performance, intuitive low-code/no-code interface, automating data quality checks and ensuring seamless integration with a variety of data sources. It accomplishes this without imposing a burden on your databases, guaranteeing a consistent, timely, and scalable approach to data quality control. By choosing Telmai, you are not just enhancing your data quality checks; you are also freeing up valuable time and resources, allowing you to concentrate on deriving meaningful insights from your data. Discover the capabilities of Telmai’s platform today, and take a significant step towards superior data quality management.

Passionate about data quality? Get expert insights and guides delivered straight to your inbox – click here to subscribe to our newsletter now.

  • On this page

See what’s possible with Telmai

Request a demo to see the full power of Telmai’s data observability tool for yourself.