Part A- 30 marks
DOMAIN:
Medical
CONTEXT:
Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team.
Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.
DATA DESCRIPTION:
The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.
PROJECT OBJECTIVE:
To Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.
STEPS AND TASK [30 Marks]:
Data Understanding: [5 Marks]Read all the 3 CSV files as DataFrame and store them into 3 separate variables. [1 Mark]Print Shape and columns of all the 3 DataFrames. [1 Mark]Compare Column names of all the 3 DataFrames and clearly write observations. [1 Mark]Print DataTypes of all the 3 DataFrames. [1 Mark]Observe and share variation in ‘Class’ feature of all the 3 DaraFrames. [1 Mark]Data Preparation and Exploration: [5 Marks]Unify all the variations in ‘Class’ feature for all the 3 DataFrames. [1 Marks]
For Example: ‘tp_s’, ‘Type_S’, ‘type_s’ should be converted to ‘type_s’Combine all the 3 DataFrames to form a single DataFrame [1 Marks]
Checkpoint: Expected Output shape = (310,7)Print 5 random samples of this DataFrame [1 Marks]Print Feature-wise percentage of Null values. [1 Mark]Check 5-point summary of the new DataFrame. [1 Mark]Data Analysis: [10 Marks]Visualize a heatmap to understand correlation between all features [2 Marks]Share insights on correlation. [2 Marks]Features having stronger correlation with correlation value.Features having weaker correlation with correlation value.Visualize a pairplot with 3 classes distinguished by colors and share insights. [2 Marks]Visualize a jointplot for ‘P_incidence’ and ‘S_slope’ and share insights. [2 Marks]Visualize a boxplot to check distribution of the features and share insights. [2 Marks]Model Building: [6 Marks]Split data into X and Y. [1 Marks]Split data into train and test with 80:20 proportion. [1 Marks]Train a Supervised Learning Classification base model using KNN classifier. [2 Marks]Print all the possible performance metrics for both train and test data. [2 Marks]Performance Improvement: [4 Marks]Experiment with various parameters to improve performance of the base model. [2 Marks]
(Optional: Experiment with various Hyperparameters – Research required)Clearly showcase improvement in performance achieved. [1 Marks]
For Example:Accuracy: +15% improvementPrecision: +10% improvement.Clearly state which parameters contributed most to improve model performance. [1 Marks]
Part B- 30 marks
DOMAIN:
Banking, Marketing
CONTEXT:
A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.
DATA DICTIONARY:
ID: Customer IDAge: Customer’s approximate age.CustomerSince: Customer of the bank since. [unit is masked]HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]ZipCode: Customer’s zip code.HiddenScore: A score associated to the customer which is masked by the bank as an IP.MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]Level: A level associated to the customer which is masked by the bank as an IP.Mortgage: Customer’s mortgage. [unit is masked]Security: Customer’s security asset with the bank. [unit is masked]FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]InternetBanking: if the customer uses internet banking.CreditCard: if the customer uses bank’s credit card.LoanOnCard: if the customer has a loan on credit card.
PROJECT OBJECTIVE:
Build a Machine Learning model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.
STEPS AND TASK [30 Marks]:
Data Understanding and Preparation: [5 Marks]Read both the Datasets ‘Data1’ and ‘Data 2’ as DataFrame and store them into two separate variables. [1 Marks]Print shape and Column Names and DataTypes of both the Dataframes. [1 Marks]Merge both the Dataframes on ‘ID’ feature to form a single DataFrame [2 Marks]Change Datatype of below features to ‘Object’ [1 Marks] ‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’. [Reason behind performing this operation:- Values in these features are binary i.e. 1/0. But DataType is ‘int’/’float’ which is not expected.]Data Exploration and Analysis: [5 Marks]Visualize distribution of Target variable ‘LoanOnCard’ and clearly share insights. [2 Marks]Check the percentage of missing values and impute if required. [1 Marks]Check for unexpected values in each categorical variable and impute with best suitable value. [2 Marks]
[Unexpected values means if all values in a feature are 0/1 then ‘?’, ‘a’, 1.5 are unexpected values which needs treatment ]Data Preparation and model building: [10 Marks]Split data into X and Y. [1 Marks] [Recommended to drop ID & ZipCode. LoanOnCard is target Variable]Split data into train and test. Keep 25?ta reserved for testing. [1 Marks]Train a Supervised Learning Classification base model – Logistic Regression. [2 Marks]Print evaluation metrics for the model and clearly share insights. [1 Marks]Balance the data using the right balancing technique. [2 Marks]Check distribution of the target variableSay output is class A : 20% and class B : 80%Here you need to balance the target variable as 50:50.Try appropriate method to achieve the same.Again train the same previous model on balanced data. [1 Marks]Print evaluation metrics and clearly share differences observed. [2 Marks]Performance Improvement: [10 Marks]Train a base model each for SVM, KNN. [4 Marks]Tune parameters for each of the models wherever required and finalize a model. [3 Marks] (Optional: Experiment with various Hyperparameters – Research required)Print evaluation metrics for final model. [1 Marks]Share improvement achieved from base model to final model. [2 Marks]