Data Cleaning Python Cheat Sheet

Posted : admin On 1/29/2022

Overview

SheetPython

Pandas Cheat Sheet for Python For working with data in python, Pandas is an essential tool you must use. This is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. I've found DataCamp, whose cheat sheets comprise a large portion of the above links, to provide very clear explanations and am currently enjoying their Python Data Science courses. Steve Testa. 2. For instance, converting a string column into a numerical column could be done with data ‘target’.apply (float) using the Python built-in function float. Removing duplicates is a common task in data cleaning. This can be done with data.dropduplicates , which removes rows that have the exact same values. The tough thing about learning data science is remembering all the syntax. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it’s nice to have a handy PDF reference, so we’ve put together this Python regular expressions (regex) cheat sheet to help you out!

Data Mining process is a sequence of the following steps:

  • Data Cleaning – removing noise and outliers
  • Data Integration – combine data from various sources
  • Data Selection – select relevant variables
  • Data Transformation – transform or consolidate data into forms appropriate for mining
  • Data Mining – apply methods to extract patterns
  • Pattern Evaluation – identify interesting patterns
  • Presentation – use visualization to present the knowledge

Types of Patterns

Data Mining tasks can be classified into two categories: Descriptive and Predictive

  1. Characterization and Discrimination
  2. Association and Correlation (frequent patterns)
  3. Classification and Regression for prediction
  4. Cluster Analysis
  5. Outlier Analysis

Interesting Patterns

Depending on the type of data mining task (as listed above), interesting patterns can be extracted based on some threshold. For instance, in association mining, measures such as ‘Support’ and ‘Confidence’ are used. In classification techniques, measures such as ‘accuracy’, ‘precision’, ‘recall’, etc. are used. Subjective interestingness measures based on our knowledge of data is also used.

1.1 Data Cleaning

Data may be incomplete, noisy and inconsistent. Data cleaning is required to deal with these issues.

  • Missing Values – One of the following solutions can be applied
    • Ignore the tuple
    • Fill in the missing value manually
    • Use a global constant to fill in the missing value (such as ‘unknown’)
    • Use a measure of central tendency (e.g. mean or median) to fill
    • Use attribute mean or median for all samples belonging to the same class as the given tuple
    • Use the most probable value (can be determined with regression or decision tree induction)
  • Noisy Data – Noise is a random error or variance in a variable. Outliers can represent noise. The goal is to smooth out the data to remove the noise. Some smoothing techniques are given below.
    • Binning – Sort the data and divide it into bins (equal frequency, bin means, bin medians)
    • Regression
    • Outlier Analysis – Identify outliers by way of clustering.

1.2 Data Integration

Combining data from multiple sources may be a necessary step in the data mining process. While integrating data from multiple sources, avoid redundancies and inconsistencies.

1.3 Data Selection/ Reduction

If the data set is huge, data reduction techniques such as dimensionality reduction, numerosity reduction, and data compression.

  • Dimensionality Reduction – process of reducing the number of random variables or attributes under consideration. The following techniques can be applied
    • Wavelet transforms: Linear signal processing technique to transform a data vector to another vector.
    • Principal Component Analysis: Searches for dimensions that represents the data best.
    • Attribute subset selection: Removing irrelevant/ redundant attributes. Some techniques for attribute selection are – stepwise forward selection, stepwise backward selection, combination of forward and backward, decision tree induction (attributes that do not appear in the tree are considered to be irrelevant).
  • Numerosity Reduction – replace the original data by smaller forms of data representation.
    • Parametric techniques such as Regression and Log-Linear models are used to approximate the data and hence reduce it.
    • Non parametric techniques such as histograms, clustering, sampling and data cube aggregation (e.g. total sales per quarter instead of monthly).

1.4 Data Transformation and Discretization

Transform the data into forms appropriate for mining.

  1. Smoothing: To remove noise. Techniques such as Regression, Clustering and Binning can be applied.
  2. Attribute Construction: Add new attributes
  3. Aggregation: E.g converting daily sales to monthly
  4. Normalization: Scale attributes so that they fall within a smaller range
  5. Discretization: raw values are replaced by intervals or labels.
  6. Concept Hierarchy: attributes such as street can be replaced by higher levels such as city or country.

1.5 Data Mining

Classification Techniques:

  • Decision Tree
  • Naive Bayes
  • Rule-Based Classification
    • Bagging
    • Boosting
    • Random Forests
  • Neural Network
  • Support Vector Machines
  • K-Nearest-Neighbor

Clustering Techniques:

  1. K-Means
  2. Single Link (Min)
  3. Complete Link (Max)
  4. Group Average
  5. DBSCAN

1.6 Pattern Evaluation

Classification Evaluation Criteria:

  • Confusion Matrix – True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate
  • Precision
  • Recall
  • F1 Measure (combination of precision and recall)
Sheet

Association Evaluation Criteria:

  • Lift
  • Correlation Analysis
  • IS Measure

Clustering Evaluation Criteria:

  • Cohesion
  • Separation
  • Silhouette Coefficient

References: Introduction to Data Mining, Data Mining Concepts and Techniques

For working with data in python, Pandas is an essential tool you must use. This is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

But even when you’ve learned pandas in python, it’s easy to forget the specific syntax for doing something. That’s why today I am giving you a cheat sheet to help you easily reference the most common pandas tasks.

It’s also a good idea to check to the official pandas documentation from time to time, even if you can find what you need in the cheat sheet. Reading documentation is a skill every data professional needs, and the documentation goes into a lot more detail than we can fit in a single sheet anyway!

Importing Data:

Use these commands to import data from a variety of different sources and formats.

Exporting Data:

Use these commands to export a DataFrame to CSV, .xlsx, SQL, or JSON.

Viewing/Inspecting Data:

Use these commands to take a look at specific sections of your pandas DataFrame or Series.

Data Cleaning Python Cheat Sheet Free

Selection:

Python Cheat Sheet

Use these commands to select a specific subset of your data.

Data Cleaning:

Use these commands to perform a variety of data cleaning tasks.

Filter, Sort, and Groupby:

Use these commands to filter, sort, and group your data.

Join/Combine:

Use these commands to combine multiple dataframes into a single one.

Statistics:

Python Cheat Sheet Download

These commands perform various statistical tests. (They can be applied to a series as well)

I hope this cheat sheet will be useful to you no matter you are new to python who is learning python for data science or a data professional. Happy Programming.

Data Cleaning Python Cheat Sheet Example

You can alsodownload the printable PDF file from here.

Data Cleaning Python Cheat Sheet Answers

.