0% found this document useful (0 votes)
20 views3 pages

Project2 - 158755. 4.21

The document outlines Project 2 for Massey University's 158.755 course, due by April 21, 2025, which accounts for 25% of the final grade. Students must implement a full data science workflow, including data acquisition, wrangling, analysis, and predictive modeling using machine learning techniques on original datasets. The project emphasizes originality, critical engagement with academic sources, and prohibits the direct use of generative AI for completing assignment tasks.

Uploaded by

slowsand1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

Project2 - 158755. 4.21

The document outlines Project 2 for Massey University's 158.755 course, due by April 21, 2025, which accounts for 25% of the final grade. Students must implement a full data science workflow, including data acquisition, wrangling, analysis, and predictive modeling using machine learning techniques on original datasets. The project emphasizes originality, critical engagement with academic sources, and prohibits the direct use of generative AI for completing assignment tasks.

Uploaded by

slowsand1005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

158.

755-2025
Semester 1 Massey University
Project 2

Deadline: Hand-in by midnight Sunday, 21 April 2025


Evaluation: 25% of your final course grade.
Late Submission: See Course Guide
Work This assignment is to be done individually.
Purpose: Implement the entire data science/analytics workflow. Learn to correctly apply and reason
about using different machine learning techniques to solve real-world problems. Gain
skills in extracting data from the web using APIs and web scraping. Build on the data
wrangling, data visualization and introductory data analysis skills gained up to this point
as well as problem formulation and presentation of findings. Learning outcomes 1 - 5
from the course outline.

Project outline:

This project requires that you apply machine learning techniques taught so far to build predictive regression models on
current, topical and original data from your chosen domain. You are expected to carry out an entire data science/analytics
workflow by: (1) acquiring data from multiple sources, (2) performing data wrangling, (3) integrating the data, (4)
conducting analysis to answer some key research questions and finally (5) perform predictive modelling.

Data Acquisition Data Wrangling Data Integration Data Analysis Predictive Modelling

Web Scraping Transform Concatenation Group-by Regression


Web API Clean Merging Pivot tables kNN
Static dataset Impute Cross-tabulation Other(?)
Other(?) User-defined functions Visualisation
EDA
Visualisation

The data should primarily come from sources such as web APIs and/or scraped web pages. This data can also be combined
with a static datasets found in various repositories if needed, or some of the datasets you used from Project 1. The
important point is that you are predicting continuous-valued outputs, thus you are entirely free to choose a domain or a
combination of domains that interest you.

Project Requirements:

Project details:
- Each student should aim to create a unique and distinctive data-problem to work on that is made original by
combinations of different data sources. The goal of the project is to perform prediction analysis.

Questions to consider in your experiments and tasks to perform once you have chosen your domain:
- Build multiple regression and kNN models and compare their outputs.
- Experiment with models using different features. Which features are most effective? Why?
- Experiment with kNN using different distance metrics and different values of k, and compare the outputs. Which
values of k are most robust for the size of your dataset and your problem domain? Are variables in your data
having different scales affecting the algorithm’s accuracy? How have you tried to overcome this?
- Experiment with linear, multiple linear and polynomial regression models and compare them. At what point does
a regression model become too complex and no longer captures the true relationships in the data?
- How reliable are your prediction models? What do the confidence intervals and prediction bands tell you? Could
you recommend this predictive model to a client? Would you expect this model to preserve its accuracy on data
beyond the range it was built on?
- Is your evaluation approach robust enough to be able to draw conclusions about the utility of your models?

Submit a single Jupyter Notebook that contains the most integral parts of analysis, together with a thorough description of
findings. Make sure you interpret all model outputs and figures. The Python code in the notebook must be entirely self-
contained and all the experiments and the graphs must be replicable.

Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some of your Python code in
your notebook by putting it into a .py files that you can import. This might help the readability of your final notebook by

1
158.755-2025
Semester 1 Massey University
removing unnecessary python code that can clutter and distract from your actual findings and discussions.

You may install and use any additional Python packages you wish that will help you with this project. When submitting
your project, include a README file that specifies what additional python packages you have installed in order to make
your project repeatable on my computer, should I need to install extra modules.

Follow the general structure of the Project Notebook Template provided. Make your notebook professional and tidy
(avoid large data dumps) and run your text through an IPython Notebook spell-checker extension. You can also pretend
that you are a consultant performing an analysis for a client.

NOTE: Topics of web scraping, using web APIs and kNN algorithms will be covered in weeks 5 and 6. Therefore,
begin your assignment as soon as you can using the concepts covered thus far. Once the material in week 6 is all
covered, you will be able to complete all remaining components of this assignment in the week that it is due.

Marking criteria:

Marks will be awarded for different components of the project using the following rubric:

Component Marks Requirements and expectations


Data Acquisition 15 - diversity of sources: data from a web API and/or data scraped from a web
site should be included to get maximum marks
- appropriate use of merging and concatenation.
- ethical data collection (make sure that terms and conditions of use permit
you to collect the data) and state clearly that you have complied with this in
the notebook
Data Wrangling 10 - thoroughness in data cleaning,
- visualisations,
- handling of missing values and outliers.
Data Analysis 20 - quality of your exploratory data analysis
- presentation of the characteristics of the data,
- discussion of assumptions being made if any
- formulation of the problem as a machine learning problem
- diversity of techniques used to achieve this.
- presentation of findings.
Predictive Modelling 40 - diversity of experiments.
- quality of the evaluations and testing using hold out data
- comparisons, presentation and interpretation of results.
Originality and Rigour 15 - discuss how your academic readings have informed your analyses
- originality of the datasets
- quality of research questions
- difficulty of the problem
- degree to which the problem domain is original, challenging, topical and
presented in an interesting way

Reading Log PASS - The compiled reading logs up to the current period.
- The peer discussion summaries for each week.
- Any relevant connections between your readings and your analytical work
in the notebook. If a research paper influenced how you approached an
implementation, mention it.

Hand-in: Make sure that the notebook you submit has all the outputs embedded. Also, export your notebook into HTML.
Zip-up your notebook (.ipynb and .html) and dataset(s) you have chosen, as well as any other .py files you might have
written, into a single file and submit through Stream. Include your reading log too in the zipped file. Do not email your
submission to the lecturer unless there are problems with the submission site.

If you have any questions or concerns about this assignment, please ask the lecturer sooner rather than closer to
the submission deadline.

2
158.755-2025
Semester 1 Massey University
Use of Generative AI in This Assignment
In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university,
the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently.
Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a
helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.

For this project, you are required to independently select, wrangle, analyze, and interpret datasets from your chosen
domain. You will also maintain a reading log, where critical engagement with academic sources is expected and
integrated into your analyses where relevant. The use of generative AI is restricted to planning, explanation, and concept
development, as outlined below.

Allowed Uses of AI for assignment 2


You may use AI for conceptual understanding, guidance, and general problem-solving strategies, but NOT for directly
completing any part of your assignment. Specifically, AI can be used to:
• Understand background knowledge relevant to data science, regression analysis, and kNN – as well as other
models.
o Example: "How does kNN differ from linear regression in terms of assumptions and use cases?"
o Example: "What are common challenges when performing web scraping at scale?"
• Seek feedback on your problem formulation and methodology without directly generating code or statistical
analysis.
o Example: "I plan to predict housing prices using data from a real estate API. Does this make sense?"
o Example: "What are some potential pitfalls in merging datasets from different sources?"
• Clarify technical concepts or debugging hints, provided you write the code yourself.
o Example: "Why might my web scraping code be returning an empty dataset?"
o Example: "How does feature scaling affect kNN classification?"
• Explore different methods for data visualization, but without directly copying AI-generated visualizations.
o Example: "What are effective ways to visualize feature importance in regression models?"
o Example: "How can I compare multiple regression models visually?"
• Enhance critical engagement with research articles by summarizing complex concepts or suggesting alternative
interpretations.
o Example: "What are some alternative methods for assessing regression model reliability?"

Prohibited Uses of AI for assignment 2


You must NOT:
• Copy AI-generated code directly into your submission.
• Input the assignment questions directly into AI and use its responses as your own.
• Paraphrase AI-generated explanations/code and present them as original work.
• Ask AI to write step-by-step solutions to any of the assignment tasks.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy