What I’m about to share is less about Function Transformer, and more about a standardised template of sorts that you can use for exploratory data analysis I’ll start with an explanation of the dataset being used, then move on to the problem statement and the approach I’m taking to clean the dataset. The full solution to the dataset in question is out of the scope of this article, so maybe I’ll split this information into a small series on EDA.
So the dataset at hand is a traditional banking set on loan applications. The shape query gives us about 45211 rows and 19 columns. The problem statement is essentially about an analysis of risk assessment for clients. Wherein we need to predict the likely hood of a default based on the target variable of the same name. But first we need to clean the dataset. This would usually entail handling of outliers, checking for gaussian distribution, data cleaning, which includes imputation and null values, fixing rows and columns including combining the data frames etc
A quick glance at the head() function shows us the three continuous variables that I’ll be starting with. AMT_CREDIT is the Credit amount of the loan. AMT_INCOME_TOTAL is the clients income, and AMT_ANNUITTY is the loan annuity amount.
I usually begin with a quick variable for all continuous features as below, and a glance at the broad based statistical description. It’s amply clear that these three features have humungous outliers. We can plot the same to be certain.
Clearly these features have some nasty outlier problem that can’t be swept away. In cases like these we can take a closer look at the feature description and simply ask ourselves if a capping would diminish our output. In this case, the answer is no, it won’t. we can safely identify the upper and lower boundaries of the interquartile range and cap them accordingly.
def find_boun(df,variable,distance):IQR=df[variable].quantile(0.75)-df[variable].quantile(0.25)lower_boundary=df[variable].quantile(0.25)-(IQR*distance)upper_boundary=df[variable].quantile(0.75)+(IQR*distance)return upper_boundary,lower_boundary
running this returns the following data points below which we can then use to cap our features.
(AMT_INCOME_TOTAL 337500.0AMT_CREDIT 1616625.0AMT_ANNUITY 61704.0dtype: float64,AMT_INCOME_TOTAL -22500.0AMT_CREDIT -537975.0AMT_ANNUITY -10584.0dtype: float64)
although It’s also not required that we have a normal distribution for the variables at hand, we can check for normality with a simple qq plot and histogram. Here the QQ plot shows a distinct skew away from the ordered values and the theoretical quantiles. We can fix this with a simple log function through Function Transformer. To do that we can make a copy of the dataset, then we can instantiate the transformer to learn the logs before applying it to the variables at hand. Here my variable transformer is named ‘negatron’. Negatron learns the log functions of all the data points and transforms them in a flash.
the result is a more normally distributed outcome as seen below. In the next part we will tackle the missing values and the feature scaling for this dataset.
in the next article I will cover the imputation of missing values and scaling data as a part of the EDA process.