For this assignment, you will be using the Framingham Heart Study Data. The Fram

For this assignment, you will be using the Framingham Heart Study Data. The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. We will be using this original data.

As you look over the Framingham Heart Study data and data dictionary to familiarize yourself with this data, you will notice that the study had a longitudinal design. This means that there were multiple observations on the same individuals at different points in time. You will notice variables with the same name, but with 1, 2 or 3 at the end of the name. These numbers indicate the data collection time points. For this assignment, we will only be using the primary variables and the variables at time point 1. Because of this, we can create an analysis file by retaining only the variables we want and removing the variables we do not need. This will make the data file easier to work with.

To reduce the dataset to a more manageable size, open the Framingham Heart Study data in EXCEL. Remove all variables that have a name that ends in a ‘2’ or ‘3’. Variables like: sex2, sex3, age2, age3, etc should all be removed. In EXCEL, you can simply highlight those variables you do not want and delete. Next, remove all variables that start with “TIME…..” These are variables like: TIMEAP, TIMEMI, etc. etc.
Save your reduced datafile to your computer using a different filename. Call this reduced dataset something like: FHS_assign7.xlxs.
Check on the records to see if there is missing values. Delete records with missing values. Re-save your dataset.
Read your new analysis file into R. You are good to go
ASSIGNMENT TASKS

Part A – Mechanics (25 points)

For this analysis, the variable “stroke” should be considered the response variable Y and the “diabetes1” variable should be considered the explanatory variable (X). Complete the following




:1) Construct a side by side bar graph to compare these two categorical variables. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.


2) Construct a contingency table complete with Marginal row and column totals for these two variables, then answer the following
a) What is the conditional probability of having a stroke given diabetes is present at time 1? What is the conditional probability of having a stroke given diabetes is NOT present at time 1?
b) What are the odds of having a stroke if diabetes is present at time 1? What are the odds of having a stroke of diabetes is not present at time 1?
c) Calculate the odds ratio of having a stroke when diabetes is present relative to when it is not. Interpret this result.d) Specify the null and alternative hypothesis, and then conduct a hypothesis test to see if diabetes is related to having a stroke. Interpret the results.

Part B – Open Ended Analysis (75 points)

3) In professional practice, when you have an observational dataset like the Framingham Heart Study data, one typically is looking for risk factors. In other words, explanatory variables that are related to specific response variables of interest. For this last task, you will identify and work with only categorical explanatory variables. The response variables of interest are ANYCHD, STROKE, and DEATH. What categorical explanatory variables seem to indicate elevated risk of Coronary Heart Disease, Stroke or Death? Conduct an analysis. Report and interpret your results




.4) Which of the continuous explanatory variables do you think is most likely indicative of elevated risk of Coronary Heart Disease (ANYCHD), Stroke, or Death? Pick one such variable. Create a new variable that maps the continuous variable’s values into a categorical variable with at least 3 levels. Conduct contingency table analyses relating this newly created categorical variable to ANYCHD, STROKE and DEATH. These analyses should be done separately. In other words, you will have at least 3 separate contingency tables. Do NOT attempt to have multiple dimension contingency tables! Report on the results of your analysis and discuss the results.
5) Reflect on your experiences here. What are your recommendations for future analysis? Congratulations! You’ve completed the Assignment 8. Please save your R-code, because you can re-use or cannibalize this code in future assignments. Your write-up should address each task.