Team – B: [Every document and screenshot should be named as ‘Team-B’]
Overview:
This week you will work with your group on your final project.
Objectives
Apply concepts learned about Hadoop Ecosystem
Apply concepts learned in data preparation to preprocess data
Construct an external table using basic SQL commands in BigQuery
Develop queries in BigQuery.
Construct a well-defined schema using basic HQL commands in Hive
Develop queries in Hive
Develop queries in Spark
Instructions
Each group will research their assigned use case. They will select a static dataset and streaming data source from the approved list provided or locate another and obtain the instructors’ approval.
Each group will create an executive summary. This summary should be between 400 and 550 words, not including the title page, references, or other supporting documents. It should read like a summary of your presentation, giving the use case project, stepping through the data lifecycle, identifying tools/applications used during certain phases of the data lifecycle, and concluding with the next steps for the data science or analyst teams. The executive summary is in Times New Roman, 12-point, with one-inch margins.
Each group will create a document with screenshots that includes the project and storage they created for their use case in GCP, setting up their Hadoop ecosystem, performing data processing with their static and streaming data, and performing queries in BigQuery, Hive, and Spark to ensure the quality of their data for the data science or analysts teams. Through each step, the team will take screenshots of their work and present them in a word document with brief explanations of the screenshots. The desciption should include the application used, the task performed, and why it was performed. Do not include how-to instructions.
Each group will create a presentation that tells a story using the data lifecycle as a guide, and they will present their work during the designated time. You may be creative with the presentation with PowerPoint. The presentation is a professional business presentation. Each member of the group should speak. After the presentation, the group will entertain questions from the audience. The presentation should be at least 10-15 minutes in length.
Meeting_Notes_Template:
I have provided the ‘Meeting_Notes_Template.docx’, please fill the provided template.
Approved Data Sources:
I have provided ‘Approved Data Sources.pdf’ please select a two datasets from any USE CASES Approved Data Sources provided in the pdf.
PPT and Word:
Topic: Use cases from the discussion post
Data: Use approved data sources (two or more)
Executive Summary (25%): This paper should be between 400 and 550 words, not including the title page, code, and references.
Screenshots (25%): These screenshots should show how you applied what you learned. Create a new project in GCP for this use case.
Presentation (50%): The group will present
Grading: This project is worth 20% of your final course grade. The Executive Summary will comprise 25% of this grade, screenshots 25%, and the presentation will be 50%
Document Type: Word and PPT
Executive Summary Requirements:
400 to 550 words, not counting the title page, references, or supporting documents.
Title page: Organization Name, Logo, Use case, group number, and group members
Introduction: Introduce the use case and its purpose (Example: Data Engineering Request)
Body: Step through the data lifecycle with your use case and the tasks you did
Conclusion: Summarize and discuss the next steps for the data science and analyst teams
Double-spaced Word Document
References
Application Screenshots Requirements:
GCP project & storage
Hadoop
OpenRefine
BigQuery
Hive
Spark
Include an explanation (3-10 sentences) with the screenshots telling the application used and the task performed.
Supporting Documents:
Reference page
Meeting notes or Task board
Data Sheet – List of Data sources and any
additional information such as the website
address
Other documents
Word Document
Meeting Notes Template:
Date:
Start and End time:
Attendees:
Note-taker:
Notes:
Decisions:
Action Items:
Task board:
Create a Task board using MS Teams – Planner,
Excel, or Word
Task board Columns
To-Do
In Progress
Review
Done
Task Info: Description, Owner, Due Date
Presentation Requirements:
Business casual
TELL A STORY
Every group member must present
10-15 minutes to present
2 minutes for questions
10-20 PowerPoint Slides
Title page: Organization Name, Logo, Use case, group number, and group members
Outline or Agenda
Every step of the data lifecycle – No definitions
Hive and Spark SQL comparison chart
A few of your screenshots (No more than 5)
Cite the source on the slide if not your own words
Word document regarding rubrics instructions.
Requirements: Follow each and everything mentioned in the description.