The process of converting data from initial or raw state to another format.

Importing data

  1. Import the data:
    df = pd.read_csv(filename, names = headers)
  2. Look for any strange symbols such as ? and replace them with NaN, as this is what python understand as a missing values
    df.replace(“?”,np.nan, inplace = True)
  3. To find missing values:
    missing_data = df.isnull()
    for column in missing_data.columns.values.tolist():
    print (missing_data[column].value_counts())
  4. Once we have NaN, then its not always good idea to drop all the rows or column containing NaN, sometimes we chose to replace them with the suitable values, For example mean value
    avg_1 = df[“normalized-losses”].astype(float).mean(axis=0)
    df[“column-name”].replace(np.nan, avg_1, inplace = True)
  5. We need to drop rows where NaN is in Target Column, as that will be of no use:
    df.dropna(subset=[“Target-Column-Name”], axis=0, inplace = True)
    (Where axis =0 is for row)
  6. Changing index: Obviously, when we drop the rows then index number will be changed, so we need to fix it:
    df.reset_index(drop = True, inplace = True) and then you can check the index again by using df.index

Correct Data Format

  1. We can always check the data type of data frame by using the or df.dtypes
    Sometimes there is a need of changing the data type: to change data type of the column we can use: df.astype(float)

Data Standardization:

Transforming data into a common meaningful comparisons

  1. Standardization meaning keeping the data into the same measuring system like km/h, so make sure all are in the same standard scale.
    This can be achive by using some maths.

Data Normalization

Transforming values from several ranges to a common range. For example, one column (age) as value 1 to 100 and another corresponding column (salary) has values from 100 to 100000 then this variable can influence our model, so we need to take care of it.

  1. There are several ways to normalize the data one possible way is: replace original value by (original value)/(maximum value)
    # replace (origianl value) by (original value)/(maximum value)
    df[‘length’] = df[‘length’]/df[‘length’].max()
    df[‘width’] = df[‘width’]/df[‘width’].max()


Binning is process of transforming continuous numerical variables into discrete categorical ‘bins’, for grouped analysis

  1. Decide how many bins you want, say 3 (high, mid, low) than you need to cut the data 4 place including starting and end.
    binwidth = (max(df[“Column-name”])-min(df[“Column-name”]))/4
  2. Build an array to store these 4 points:
    bins = np.arange(min(df[“horsepower”]), max(df[“horsepower”]), binwidth)
  3. Set the group names
    group_names = [‘Low’, ‘Medium’, ‘High’]
  4. Now divide the entire column values into these 3 groups:
    df[‘New-binned-Column-name’] = pd.cut(df[‘Column-name’], bins, labels=group_names,include_lowest=True )

Bin Visulaization

Visualization is the best way to analyse the data.

  1. import packages
    import matplotlib as plt
    from matplotlib import pyplot
  2. Draw visual:
    # draw historgram of attribute “horsepower” with bins = 3
    plt.pyplot.hist(df[“Column-name”], bins = 3)# set x/y labels and plot title

Converting catagorical values into numerical values (Indicator Variables)

  1. This is to use catagorical variable to use into the regression analysis. As in regression we need them in numerical form.
  2. To do so, we use pandas method; get_dummies
    dummy_variable_1 = pd.get_dummies(df[“Column-name-with-catagorical-data”])
  3. Based on the catagory, number of columns will be created. For example if column name gender has two catagories;male and female then 2 column will be created: gender_male and gender_female.
  4. We can always rename these columns if we want to:
    dummy_variable_1.rename(columns={‘old-name1′:’new-name1’, ‘old-name2’: ‘new-name2’}, inplace=True)

Saving the cleaned data

  1. Once cleaning done, save it to your location:

Leave a Reply

Your email address will not be published. Required fields are marked *