PDF should be submitted as the primary (first) resource.

The .ipynb file and the data set (in .csv format) should be zipped and submitted as the secondary resource.

Failure to comply with the instructions will result in 0 grade on the relevant portions of the assignment. Your instructor will grade your submission based on what you submitted. Failure to submit an assignment or submitting an assignment for another class will result in a 0 grade, without the opportunity to resubmit. Make sure that you submit your original work. Suspected plagiarism cases will be treated as possible academic misconduct and will be reported to the College Academic Integrity Committee for formal investigation. As part of this procedure, your instructor may require you to meet with them for an oral exam on the assignment.

You may watch following

VIDEO

that will help you with the main setup of the assignment.

**Important note: **You can use either Anaconda or Colab to work on the Jupyter notebook that you will submit as your final project on Forum:

1 – Start by downloading

this Jupyter notebook

to your local machine.

2 – Open a tab in your browser and type

https://colab.research.google.com/

.

3 – This will open a small window. Choose the last option

Show notebooks in Drive

on the upper menu, “Upload”. Then choose the Jupyter notebook you have saved in step 1.

4 – You can start working on your assignment by answering the questions in the corresponding cells.

5- The sample codes are provided for most of the tasks.

6 – If you have any questions, please reach out to your instructors and the CIS tutors.

Statistical Intuitions and Applications**** Assignment #1

An open dataset from Dubai Statistics Center will be used in this assignment and an in-depth analysis of different features of the main variables will be carried out through a thorough investigation using the statistical tools. You will analyze the data set and prepare a report by completing the tasks and answering the questions that follow.

Main Setup:

For this assignment you will select a random sample from a dataset using the code given below with all the detailed steps. To select your random sample and save your data set on your computer follow these instructions:

Run the code below to find the sample size ‘n’ on which you will work.

Go the 3rd line of the next code and enter your sample size n = ???

Now go the 4th line:df.to_csv(r’**Path where you want to store the exported CSV fileFileName.csv**’)

Change **Path where you want to store the exported CSV file** to where you want to store your data.

Change **File Name** to your first name.

Run the code.

Use this data set to complete your assignment. **Also include this CSV file in your assignment submission!**

import pandas

original_data =

pandas.read_csv(“https://raw.githubusercontent.com/zu-math/SIA-Fall-2023-Dataset/main/mod_mea_f.csv”)

df=original_data.sample(n=???)

df.to_csv(r’Path where you want to store the exported CSV fileFile Name.csv’)

print (df)

Once above is done, then you will work on the tasks stated below. Please be clear in explaining your analyses and your findings. Show you have been thorough and careful by explaining and discussing your findings, not by presenting huge amounts of computer output without appropriate interpretation. Your report should be clear and concise. (Consider how some tables might help to summarize a lot of results.) Please use normal margins and a readable font size.

Question 1.

Your first task is to briefly introduce the study and all the main variables in it using a brief report with clear wordings. Identify all the variables in the dataset. Explain what will you be analyzing in this report to the readers.

Question 2.

In this task, you will generate descriptive statistics for all the quantitative variables in the dataset using the histograms and describe their distributions in terms of shape, center, spread, and presence of outliers. The codes below will provide you the histogram and five-number summary of the relevant column. You need to replace ‘???’ with column name.

Question 3.

Your next task is to choose three quantitative variables of your choice and two categorical variables.

Suppose you have chosen column ‘Masters’ as a quantitative and ‘Gender_EN’ for a categorical variable. Replicate the steps for task below.

3a. Generate a grouped box plot to compare the distribution of Masters degree holders among male and female students. Describe your observations referring to the five-number-summaries of both genders.

In the same way, do it for other quantitative and categorical variables. This should give rise to six cases.

3b. Discuss any patterns you observe between male and female genders when you compare them.

Question 4.

In this task, you will work on the scatterplots to examine the relationship between dependent and independent variables. Treat ‘Academic_Year’ as an independent variable and use any of the two dependent variables you chose in Question 3 as the dependent variables.

**4a.** Create separate scatterplots to examine the relationship between the dependent variables and the independent variable. Describe the scatterplots in terms of the form, strength, and direction of the relationships. Further examine if the relationship between the independent variable and each of the dependent variables varies by gender (you will need to create scatterplots separately for each gender to answer this question.)

In the same way, do it for other quantitative variables. Now change the categorical variable and replicate calculations. This should give rise to ten cases.

**4b.** Explain in simple words what you observed by reporting your findings.

Question 5.

You will now work on the simple linear regression model that predicts for the dependent variable. Treat ‘Academic_Year’ as an independent variable and use any of the two dependent variables you chose in Questions 3 as the dependent variables.

**5a.** Fit a simple linear regression model between your dependent and independent variables. Generate and use the residual plot, the standard error, and the R^2 to assess the fit of each linear model. If the model is a good fit, interpret the slope and the intercept.

In the same way, do it for other quantitative variables. Now change the categorical variable and replicate calculations. This should give rise to ten cases.

**5b.** Summarize and present your findings in a sophisticated statistical terms.

Question 6.

The conservation and rehabilitation of local flora and natural habitats comes under part of the biodiversity program of Environment Agency, Abu Dhabi (EAD). A sophisticated and high-tech monitoring system provides the following annual production of plants within native plant nursery. The number of floras in a particular year is written on top of that column which may be used in calculations.

Answer the following questions.

**6a.** Calculate the percentage increase in plants products from year Y1 to year Y2, where both years Y1 and Y2 are obtained by running the code below.

**6b.** Which two years the percentage increase in the annual production of plants was equal to P%? where P is to be found by running the code below and it is calculated up to one decimal place.

Question 7.

Under the long-term marine water quality monitoring program of the Environment Agency Abu Dhabi (EAD), a red tide monitoring project was launched to look for harmful algal blooms (HAB) in marine water. The project accounted for number of such HAB incidents in Abu Dhabi from 2002 to 2022. Use your statistical skills of this course IDS-103, to make two strong observations (non-trivial) to be reported to EAD from the graph below.

i will need the codes done on my Google Collab Notebook ACC

gmail: XXXXXXXXXX

Pass : Esraa779@