Overview
The primary goal of this assignment is to familiarize you with key stages of a data science project, including formulating and answering questions about data, visualizing insights, and building predictive models capable of making new predictions. You will apply the knowledge gained throughout the module. The dataset is provided in the CSV format.
Learning Objectives
After this homework, students will be able to:
- Work with basic Python data structures such as dict, tuple, list etc.
- Use Pandas as the primary tool to process structured data in Python with CSV files. Handle extreme cases appropriately. Use appropriate methods to address missing data.
- Use PyPlot to make simple plots to investigate a specific phenomenon. Read plotting library documentation and use example plotting code to figure out how to create more complex Seaborn plots.
- Train a machine learning model and use it to make a prediction about the future using the scikit-learn library.
- Use PySpark to explore efficient approaches to handling big data.
Problem Statement
In this coursework, you will create a classification model that, given a Covid-19 patient’s current symptom, status, and medical history, will predict whether the patient is in high risk or not. This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. We have applied some simple data cleansing techniques to reduce data to 200031 unique patients and 21 unique features.
Objective
Using the given dataset, the goal is to determine if the patient is at high risk and will be admitted to ICU or not. You will use appropriate performance metrics to evaluate the performance of your model. The data is not clean, and you will have to apply appropriate methods to clean the data. Additionally, using unsupervised clustering, you will have to implement cluster-based classification model that may improve the performance of the model. The (partially) processed dataset is available to download from the blackboard.
Tasks
Your first task is to prepare the data and carry out data cleansing, bearing in mind the question you would like to answer. For example, which factor is the most important factor in predicting the readmission of a patient.
Part 1: Building up a basic predictive model
Load the dataset patients.csv into pandas dataframe and carry out the following tasks.
Data Cleaning and Transformation
If you have a closer look at the dataset, you will see that there are lots of inconsistencies in the dataset. While there are no explicit ‘null’ values, some binary attributes contain entries such as ‘?’, which represent missing values. These need to be handled appropriately. For the
first task, adopt an aggressive approach to address these issues. Below is a list of steps you should consider. This list is not exhaustive, so feel free to explore additional techniques that demonstrate your understanding of exploratory data analysis (EDA).
- Check dataset shape.
- Remove irrelevant columns. Clearly justify any deletions in your report.
- Identify and handle missing values. Some missing values are represented as strings like ‘?’. Some columns may contain values that fall outside their expected range. Identify all the missing values and convert them to NaN.
- Summarize missing values before and after handling them.
- Verify and adjust data types as needed for consistency.
- Drop rows containing null values.
- Analyse numerical features. Display summary statistics and identify potential outliers.
- Remove outliers if necessary.
- Normalize features where applicable.
- Check the final dataset shape after preprocessing.
Data Visualisation
Consider the resulting Dataframe. This first aggressive cleaning should give a smaller dataset, which you can start by exploring relationships between the various features of the dataset.
- Plot the distribution of unique classes of the target variable.
- Plot the count of number of ICU cases against age.
- Plot a graph that displays the count of target variable against ‘CLASIFFICATION_FINAL’.
- Show the scatter matrix plot and the correlation matrices. Can you identify pairs of highlymcorrelated features.
- Generate additional plots that demonstrate your understanding of the problem and the data. You are free to select the plot and features for visualisation. For better visualisation and understanding of data, consider using seaborn library.
Model Building
Consider the resulting Dataframe:
- Select the predictors that would have impact in predicting ICU.
- Build up a first linear model with appropriate predictors and evaluate it. Split the data into a training and test sets. Evaluate your model by using a cross-validation procedure.
- Use different performance metrics to evaluate the performance of your model. You might have noticed that the data is imbalanced. The number of positive examples is less than 8% of the total dataset. Choose appropriate performance metrics to evaluate the performance of your model.
- Balance your data using data balancing technique. Train your model again and evaluate its performance. Did you achieve better prediction accuracies with more balanced data?
Part 2: Improved model
This is an open-ended task, allowing you to apply your problem-solving skills to develop a high-performance model. Your goal is to explore various approaches and build an effective solution. For full credit, you must use PySpark to demonstrate your understanding of handling big data in a distributed environment.
- Consider the entire datasets again. Develop an improved classification model that predicts the patient’s risk. You should aim for a model with a higher performance while using a maximum of data points. This implies treating missing values differently for example through imputation rather than dropping them. Validate your model and compare its performance with the performance of the model that you built previously.
- Use the K-Means algorithm to cluster your cleansed dataset and compare the obtained clusters with the distribution found in the data. Justify your clustering and visualise your clusters as appropriate.
- Build up local classifiers based on your clustering and discuss how this clusters-based classification compares to your model obtained in the first part of Improved model.
- As in Part1, balance the data and train and test your model with the balanced data