0% found this document useful (0 votes)

20 views3 pages

Project2 - 158755. 4.21

The document outlines Project 2 for Massey University's 158.755 course, due by April 21, 2025, which accounts for 25% of the final grade. Students must implement a full data science workflow, including data acquisition, wrangling, analysis, and predictive modeling using machine learning techniques on original datasets. The project emphasizes originality, critical engagement with academic sources, and prohibits the direct use of generative AI for completing assignment tasks.

Uploaded by

slowsand1005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views3 pages

Project2 - 158755. 4.21

Uploaded by

slowsand1005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

158.

755-2025
Semester 1 Massey University
Project 2

Deadline: Hand-in by midnight Sunday, 21 April 2025

Evaluation: 25% of your final course grade.
Late Submission: See Course Guide
Work This assignment is to be done individually.
Purpose: Implement the entire data science/analytics workflow. Learn to correctly apply and reason
about using different machine learning techniques to solve real-world problems. Gain
skills in extracting data from the web using APIs and web scraping. Build on the data
wrangling, data visualization and introductory data analysis skills gained up to this point
as well as problem formulation and presentation of findings. Learning outcomes 1 - 5
from the course outline.

Project outline:

This project requires that you apply machine learning techniques taught so far to build predictive regression models on
current, topical and original data from your chosen domain. You are expected to carry out an entire data science/analytics
workflow by: (1) acquiring data from multiple sources, (2) performing data wrangling, (3) integrating the data, (4)
conducting analysis to answer some key research questions and finally (5) perform predictive modelling.

Data Acquisition Data Wrangling Data Integration Data Analysis Predictive Modelling

Web Scraping Transform Concatenation Group-by Regression

Web API Clean Merging Pivot tables kNN
Static dataset Impute Cross-tabulation Other(?)
Other(?) User-defined functions Visualisation
EDA
Visualisation

The data should primarily come from sources such as web APIs and/or scraped web pages. This data can also be combined
with a static datasets found in various repositories if needed, or some of the datasets you used from Project 1. The
important point is that you are predicting continuous-valued outputs, thus you are entirely free to choose a domain or a
combination of domains that interest you.

Project Requirements:

Project details:
- Each student should aim to create a unique and distinctive data-problem to work on that is made original by
combinations of different data sources. The goal of the project is to perform prediction analysis.

Questions to consider in your experiments and tasks to perform once you have chosen your domain:
- Build multiple regression and kNN models and compare their outputs.
- Experiment with models using different features. Which features are most effective? Why?
- Experiment with kNN using different distance metrics and different values of k, and compare the outputs. Which
values of k are most robust for the size of your dataset and your problem domain? Are variables in your data
having different scales affecting the algorithm’s accuracy? How have you tried to overcome this?
- Experiment with linear, multiple linear and polynomial regression models and compare them. At what point does
a regression model become too complex and no longer captures the true relationships in the data?
- How reliable are your prediction models? What do the confidence intervals and prediction bands tell you? Could
you recommend this predictive model to a client? Would you expect this model to preserve its accuracy on data
beyond the range it was built on?
- Is your evaluation approach robust enough to be able to draw conclusions about the utility of your models?

Submit a single Jupyter Notebook that contains the most integral parts of analysis, together with a thorough description of
findings. Make sure you interpret all model outputs and figures. The Python code in the notebook must be entirely self-
contained and all the experiments and the graphs must be replicable.

Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some of your Python code in
your notebook by putting it into a .py files that you can import. This might help the readability of your final notebook by

1
158.755-2025
Semester 1 Massey University
removing unnecessary python code that can clutter and distract from your actual findings and discussions.

You may install and use any additional Python packages you wish that will help you with this project. When submitting
your project, include a README file that specifies what additional python packages you have installed in order to make
your project repeatable on my computer, should I need to install extra modules.

Follow the general structure of the Project Notebook Template provided. Make your notebook professional and tidy
(avoid large data dumps) and run your text through an IPython Notebook spell-checker extension. You can also pretend
that you are a consultant performing an analysis for a client.

NOTE: Topics of web scraping, using web APIs and kNN algorithms will be covered in weeks 5 and 6. Therefore,
begin your assignment as soon as you can using the concepts covered thus far. Once the material in week 6 is all
covered, you will be able to complete all remaining components of this assignment in the week that it is due.

Marking criteria:

Marks will be awarded for different components of the project using the following rubric:

Component Marks Requirements and expectations

Data Acquisition 15 - diversity of sources: data from a web API and/or data scraped from a web
site should be included to get maximum marks
- appropriate use of merging and concatenation.
- ethical data collection (make sure that terms and conditions of use permit
you to collect the data) and state clearly that you have complied with this in
the notebook
Data Wrangling 10 - thoroughness in data cleaning,
- visualisations,
- handling of missing values and outliers.
Data Analysis 20 - quality of your exploratory data analysis
- presentation of the characteristics of the data,
- discussion of assumptions being made if any
- formulation of the problem as a machine learning problem
- diversity of techniques used to achieve this.
- presentation of findings.
Predictive Modelling 40 - diversity of experiments.
- quality of the evaluations and testing using hold out data
- comparisons, presentation and interpretation of results.
Originality and Rigour 15 - discuss how your academic readings have informed your analyses
- originality of the datasets
- quality of research questions
- difficulty of the problem
- degree to which the problem domain is original, challenging, topical and
presented in an interesting way

Reading Log PASS - The compiled reading logs up to the current period.
- The peer discussion summaries for each week.
- Any relevant connections between your readings and your analytical work
in the notebook. If a research paper influenced how you approached an
implementation, mention it.

Hand-in: Make sure that the notebook you submit has all the outputs embedded. Also, export your notebook into HTML.
Zip-up your notebook (.ipynb and .html) and dataset(s) you have chosen, as well as any other .py files you might have
written, into a single file and submit through Stream. Include your reading log too in the zipped file. Do not email your
submission to the lecturer unless there are problems with the submission site.

If you have any questions or concerns about this assignment, please ask the lecturer sooner rather than closer to
the submission deadline.

2
158.755-2025
Semester 1 Massey University
Use of Generative AI in This Assignment
In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university,
the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently.
Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a
helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.

For this project, you are required to independently select, wrangle, analyze, and interpret datasets from your chosen
domain. You will also maintain a reading log, where critical engagement with academic sources is expected and
integrated into your analyses where relevant. The use of generative AI is restricted to planning, explanation, and concept
development, as outlined below.

Allowed Uses of AI for assignment 2

You may use AI for conceptual understanding, guidance, and general problem-solving strategies, but NOT for directly
completing any part of your assignment. Specifically, AI can be used to:
• Understand background knowledge relevant to data science, regression analysis, and kNN – as well as other
models.
o Example: "How does kNN differ from linear regression in terms of assumptions and use cases?"
o Example: "What are common challenges when performing web scraping at scale?"
• Seek feedback on your problem formulation and methodology without directly generating code or statistical
analysis.
o Example: "I plan to predict housing prices using data from a real estate API. Does this make sense?"
o Example: "What are some potential pitfalls in merging datasets from different sources?"
• Clarify technical concepts or debugging hints, provided you write the code yourself.
o Example: "Why might my web scraping code be returning an empty dataset?"
o Example: "How does feature scaling affect kNN classification?"
• Explore different methods for data visualization, but without directly copying AI-generated visualizations.
o Example: "What are effective ways to visualize feature importance in regression models?"
o Example: "How can I compare multiple regression models visually?"
• Enhance critical engagement with research articles by summarizing complex concepts or suggesting alternative
interpretations.
o Example: "What are some alternative methods for assessing regression model reliability?"

Prohibited Uses of AI for assignment 2

You must NOT:
• Copy AI-generated code directly into your submission.
• Input the assignment questions directly into AI and use its responses as your own.
• Paraphrase AI-generated explanations/code and present them as original work.
• Ask AI to write step-by-step solutions to any of the assignment tasks.

Making Restaurant Reservation Phone Call
100% (3)
Making Restaurant Reservation Phone Call
3 pages
AP SummativeAssesment OL2
No ratings yet
AP SummativeAssesment OL2
8 pages
Assignment 3-PDS Python-24S3
No ratings yet
Assignment 3-PDS Python-24S3
5 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
CETM 24 Part 2
No ratings yet
CETM 24 Part 2
3 pages
Final Coursework - 24.2 Ad Cert Python
No ratings yet
Final Coursework - 24.2 Ad Cert Python
2 pages
PWD Midterm
No ratings yet
PWD Midterm
4 pages
FinalProject Instruction
No ratings yet
FinalProject Instruction
5 pages
Phase 2
No ratings yet
Phase 2
14 pages
Ce473 Project - Fall 2024
No ratings yet
Ce473 Project - Fall 2024
8 pages
CS502M Project Spec
No ratings yet
CS502M Project Spec
8 pages
dsm020 Coursework
No ratings yet
dsm020 Coursework
3 pages
Final Project Report
No ratings yet
Final Project Report
34 pages
VAC Mini Project Guidelines
No ratings yet
VAC Mini Project Guidelines
2 pages
Capstone Project Guidelines
No ratings yet
Capstone Project Guidelines
2 pages
Python Essentials Objectives
No ratings yet
Python Essentials Objectives
2 pages
CWBrief
No ratings yet
CWBrief
2 pages
Final Project
No ratings yet
Final Project
4 pages
Group Assignment 01
No ratings yet
Group Assignment 01
3 pages
ML2 Write-Ups Prac 1-5
No ratings yet
ML2 Write-Ups Prac 1-5
11 pages
First
No ratings yet
First
35 pages
Arsalan Shirzad's Mini Projects Portfolio
No ratings yet
Arsalan Shirzad's Mini Projects Portfolio
24 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Mid Term Project
No ratings yet
Mid Term Project
3 pages
Milestone
No ratings yet
Milestone
7 pages
Context: Description
No ratings yet
Context: Description
5 pages
Project 1
No ratings yet
Project 1
3 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Machine Learning Assignment-02
No ratings yet
Machine Learning Assignment-02
2 pages
DS Project Requirements Ver 2021
No ratings yet
DS Project Requirements Ver 2021
2 pages
Data Science Course Outline CES LUMS
No ratings yet
Data Science Course Outline CES LUMS
4 pages
Final Project Documentation
No ratings yet
Final Project Documentation
53 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Skill Based Projects - Data - Science (See List On Last Page)
No ratings yet
Skill Based Projects - Data - Science (See List On Last Page)
4 pages
Python Itinerary
No ratings yet
Python Itinerary
4 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
FinalProject Requirements
No ratings yet
FinalProject Requirements
3 pages
Final Project
No ratings yet
Final Project
4 pages
Data Mining & Machine Learning Courseoutline
No ratings yet
Data Mining & Machine Learning Courseoutline
7 pages
Pds Leb Manual
No ratings yet
Pds Leb Manual
54 pages
Dav - Lab Manual
No ratings yet
Dav - Lab Manual
34 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
Data Analytics QP May 25
No ratings yet
Data Analytics QP May 25
4 pages
Kavin
No ratings yet
Kavin
13 pages
HarvardX PH527X Planning Checklist 2017
No ratings yet
HarvardX PH527X Planning Checklist 2017
5 pages
Computational
No ratings yet
Computational
7 pages
Spark for Data Science
From Everand
Spark for Data Science
Srinivas Duvvuri
No ratings yet
Dsa Report
No ratings yet
Dsa Report
24 pages
Himanshu Gupta Configuration Manual
No ratings yet
Himanshu Gupta Configuration Manual
16 pages
Python - Data Analysis
No ratings yet
Python - Data Analysis
11 pages
F21DL 2024-25 Coursework-1 - 240918 - 110502
No ratings yet
F21DL 2024-25 Coursework-1 - 240918 - 110502
7 pages
IDS MIdterm Project - Section (C) Fall 24-25
No ratings yet
IDS MIdterm Project - Section (C) Fall 24-25
2 pages
Project Requirements Student Version 1.0
No ratings yet
Project Requirements Student Version 1.0
6 pages
Predictive Modeling (MP) Project Report
100% (1)
Predictive Modeling (MP) Project Report
73 pages
Standard Structure of Exploratory Data Analysis
No ratings yet
Standard Structure of Exploratory Data Analysis
6 pages
Introduction of Machine Learning Course Code: 4350702
No ratings yet
Introduction of Machine Learning Course Code: 4350702
9 pages
CA One
No ratings yet
CA One
4 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
From Everand
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Pool Activity Level (PAL) Instrument For Occupational Profiling
No ratings yet
Pool Activity Level (PAL) Instrument For Occupational Profiling
35 pages
The Control Handbook Control System Fundamentals 2ed Edition Levine W.S. (Ed.) - The Ebook Is Now Available, Just One Click To Start Reading
No ratings yet
The Control Handbook Control System Fundamentals 2ed Edition Levine W.S. (Ed.) - The Ebook Is Now Available, Just One Click To Start Reading
83 pages
Definition
No ratings yet
Definition
4 pages
Design Thinking Question Paper
100% (4)
Design Thinking Question Paper
8 pages
IMS Questions 2024 - Bangalore (English) Above 15 Years
No ratings yet
IMS Questions 2024 - Bangalore (English) Above 15 Years
2 pages
4 January 2020 r137 Bitcoin Crypto Update Rev0
No ratings yet
4 January 2020 r137 Bitcoin Crypto Update Rev0
9 pages
God's Promises: The Major Biblical Covenants
No ratings yet
God's Promises: The Major Biblical Covenants
3 pages
HBS Neighborhood Map
No ratings yet
HBS Neighborhood Map
1 page
Business Model - 2021 - SCE
No ratings yet
Business Model - 2021 - SCE
23 pages
Classification of Rocks
No ratings yet
Classification of Rocks
2 pages
Certprepod Sfpartner
No ratings yet
Certprepod Sfpartner
10 pages
Adobe Scan 09 Jan 2025
No ratings yet
Adobe Scan 09 Jan 2025
1 page
STIHL FS 110 Owners Instruction Manual
No ratings yet
STIHL FS 110 Owners Instruction Manual
116 pages
PIXD230707ZZ
No ratings yet
PIXD230707ZZ
2 pages
Oid Esp All Eat A Paper Thailand Final
No ratings yet
Oid Esp All Eat A Paper Thailand Final
6 pages
Ratio Analysis On Coal India
No ratings yet
Ratio Analysis On Coal India
8 pages
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
No ratings yet
Group 3 KALAGAN - MANDAYA Reporter Rodolfo D Mendez
7 pages
OSI Security Architecture
No ratings yet
OSI Security Architecture
5 pages
Guided Masturbation
100% (1)
Guided Masturbation
6 pages
2019.08.19 Jowers's Answer, Affirmative Defenses, Counterclaims, and Third-Party Complaint (Filed) PDF
No ratings yet
2019.08.19 Jowers's Answer, Affirmative Defenses, Counterclaims, and Third-Party Complaint (Filed) PDF
71 pages
Resume 2011
No ratings yet
Resume 2011
2 pages
Manappuram - 3302 Titus Nagar Quote
No ratings yet
Manappuram - 3302 Titus Nagar Quote
3 pages
Mataguisi Group Final Research
No ratings yet
Mataguisi Group Final Research
41 pages
Asm 21970
No ratings yet
Asm 21970
18 pages
Website Blogs
No ratings yet
Website Blogs
7 pages
Manual de Operacion y Servicios
100% (1)
Manual de Operacion y Servicios
148 pages
In Any Lighting Conditions: Easy-to-Read Display
No ratings yet
In Any Lighting Conditions: Easy-to-Read Display
2 pages
Technical Specifications Baby Warmer
No ratings yet
Technical Specifications Baby Warmer
1 page
De Chardin, Pierre Teilhard - The Future of Mankind
No ratings yet
De Chardin, Pierre Teilhard - The Future of Mankind
249 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Project2 - 158755. 4.21

Uploaded by

Project2 - 158755. 4.21

Uploaded by

158.

Deadline: Hand-in by midnight Sunday, 21 April 2025

Web Scraping Transform Concatenation Group-by Regression

Component Marks Requirements and expectations

Allowed Uses of AI for assignment 2

Prohibited Uses of AI for assignment 2

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.