ICT233 Load all CSV files containing transacted flats in a given `data` directory and merge all them into a single Pandas DataFrame: Data Programming Assignment, SUSS, Singapore

Question 1 (46 marks)
Objectives:
● Understand datasets with a data scientist mindset.
● Understand and design computation logic and routines in Python.
● Assess the use of Python only and Python data structures to perform extract, load, and transformation operations.
● Assess the use of the Pandas data frame to perform extract, load, transformation, and calculation operations.
● Structure code in appropriate methods (functions), looping and conditions.
● Conduct visualization in an appropriate way.

The dataset in question provides a rich overview of Housing and Development Board (HDB) flat transactions in Singapore. Derived from the national database managed by Singapore’s open data initiative.

The data captured includes vital information such as the resale price, flat type, address, lease commencement date, and floor area, among other details. These elements allow for robust analysis on a multitude of aspects such as price trends and geographical price disparities. You may refer to more information at `https://data.gov.sg/dataset/resale-flat-prices`.

Additionally, this dataset provides an invaluable resource for understanding the evolution of Singapore’s public housing landscape, the preferences of the populace, and market dynamics over time. As such, it is an essential tool for policy makers, real estate professionals, urban planners, and researchers studying Singapore’s unique public housing model.

By addressing the given tasks, you will gain data analysis competencies, including data reprocessing and manipulation, fundamental for preparing and managing datasets.

Reference source not found. you’ll enhance your ability to comprehend data relationships through the practice of creating data
visualizations and executing correlation analysis.

Write My Assignment
Hire a Professional Essay & Assignment Writer for completing your Academic Assessments

Native Singapore Writers Team

100% Plagiarism-Free Essay
Highest Satisfaction Rate
Free Revision
On-Time Delivery

(a) Load all CSV files containing transacted flats in a given `data` directory and merge all them into a single Pandas DataFrame. Drop the `remaining_lease` column from the merged DataFrame. Are there any columns that contain null values or empty strings?

(b) Convert the `month` column to date-time format. Design a visualization to analyse the `month` column by considering it as a numeric date-time and share insights.

(c) The column `storey_range` is in the format “lower TO upper” (e.g. 1 TO 3). Compute a new column called `storey_level` by calculating the average of the lower and upper storey values. Drop the `storey_range` column from the DataFrame.

(d) Identify inconsistent `flat_model` and `flat_type` values and perform the standardization of the values.

(e) To perform the following visualizations:
(i). Plot a histogram of the `resale_price` to understand its distribution. Is it normally distributed or skewed?
(ii). Generate a boxplot for the `floor_area_sqm` column. Are there any values that lie outside the expected range? If outliers are present, please provide an explanation for their occurrence.

(f) Design and identify FIVE (5) factors that influence the resale price and offer a rationale for each of these correlations.

Write My Assignment
Hire a Professional Essay & Assignment Writer for completing your Academic Assessments

Native Singapore Writers Team

100% Plagiarism-Free Essay
Highest Satisfaction Rate
Free Revision
On-Time Delivery

Question 2 (60 marks)
Objectives:
● Understand dataset with data scientist mindset
● Design computation logic and routines in Python
● Conduct visualization in an appropriate way
● Assess the design and use of database ORM / SQLite methods to perform extract, load, transformation and calculation operations

The Mass Rapid Transit (MRT) exits dataset, obtained via Singapore’s open data portal spatial dataset, providing data on exit coordinates and associated metadata, is instrumental in geographic-based analysis such as the calculation of distance metrics. Harnessing this data source facilitates a deeper understanding of the impact of public transportation infrastructure on various
urban phenomena, such as residential property resale prices.

(a) Use the `geopandas` and `contextily` libraries to visualize MRT exits based on the contents of the GeoJSON file named `mrt-exits.geojson`.

(b) Perform the following tasks:
Extract the longitude and latitude values from the `geometry` field and create two new columns in the GeoPandas DataFrame.
 Use `KMeans` (https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering from
the `sklearn` library to identify `5` clusters of these MRT exits based on their
geographical coordinates.
 Create a plot visualizing these clusters with different colors and add the map of Singapore as the background using `geopandas` and `contextily`.

Buy Custom Answer of This Assessment & Raise Your Grades
Get A Free Quote

(c) Perform the following tasks:
 Map each cluster of MRT exits to one of the five main regions of Singapore: Central
Region, East Region, North Region, North-East Region, and West Region.
 Update the GeoPandas DataFrame by adding a new column `region` representing the region to which each MRT exit belongs.

(d) Calculate the number of MRT exits for each region using three different methods:
1) Utilize the pandas DataFrame.
2) Leverage the sqlite3 library.
3) Employ SQLAlchemy and ORM approach: Here, we first define a Python class representing the MRT exits (`longitude`, `latitude`, `region`). We then use this class to insert our data into a SQLite database and execute a query to get the number of exits for each region.

(e) Perform the following tasks:
 Draw a random sample of 100 transacted flats from Question 1 with the random seed set to 0.
 Utilize the `geopy` library’s `Nominatim` or `GoogleV3` geocoder to obtain the longitude and latitude data for the 1000 transacted flats.

(f) Perform the following tasks:
 Incorporate the data from the `data/town_to_region_mapping.json` file to introduce a new column named `region` into the DataFrame. (Note: Disregard the `region` column present in the `addresses.csv` file during this process.)
 Based on your visualizations and data analyses, articulate two key conclusions.

(g) Perform the following tasks:
 Formulate a scatter plot to depict the correlation between the resale prices of flats and their haversine (https://scikitlearn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances. html) distances to the Central Business District.
 Incorporate additional dimensions into your plot: the year of the transaction
(specifically 2015, 2020, and 2023) and the region of the flat’s location.
 Use distinct color codes to denote different regions.
 Also, display the town of each transaction as individual data points on the plot.

Stuck with a lot of homework assignments and feeling stressed ?
Take professional academic assistance & Get 100% Plagiarism free papers
Get A Free Quote

The post ICT233 Load all CSV files containing transacted flats in a given `data` directory and merge all them into a single Pandas DataFrame: Data Programming Assignment, SUSS, Singapore appeared first on Singapore Assignment Help.