Project2 - 158755. 4.21
Project2 - 158755. 4.21
755-2025
Semester 1 Massey University
Project 2
Project outline:
This project requires that you apply machine learning techniques taught so far to build predictive regression models on
current, topical and original data from your chosen domain. You are expected to carry out an entire data science/analytics
workflow by: (1) acquiring data from multiple sources, (2) performing data wrangling, (3) integrating the data, (4)
conducting analysis to answer some key research questions and finally (5) perform predictive modelling.
Data Acquisition Data Wrangling Data Integration Data Analysis Predictive Modelling
The data should primarily come from sources such as web APIs and/or scraped web pages. This data can also be combined
with a static datasets found in various repositories if needed, or some of the datasets you used from Project 1. The
important point is that you are predicting continuous-valued outputs, thus you are entirely free to choose a domain or a
combination of domains that interest you.
Project Requirements:
Project details:
- Each student should aim to create a unique and distinctive data-problem to work on that is made original by
combinations of different data sources. The goal of the project is to perform prediction analysis.
Questions to consider in your experiments and tasks to perform once you have chosen your domain:
- Build multiple regression and kNN models and compare their outputs.
- Experiment with models using different features. Which features are most effective? Why?
- Experiment with kNN using different distance metrics and different values of k, and compare the outputs. Which
values of k are most robust for the size of your dataset and your problem domain? Are variables in your data
having different scales affecting the algorithm’s accuracy? How have you tried to overcome this?
- Experiment with linear, multiple linear and polynomial regression models and compare them. At what point does
a regression model become too complex and no longer captures the true relationships in the data?
- How reliable are your prediction models? What do the confidence intervals and prediction bands tell you? Could
you recommend this predictive model to a client? Would you expect this model to preserve its accuracy on data
beyond the range it was built on?
- Is your evaluation approach robust enough to be able to draw conclusions about the utility of your models?
Submit a single Jupyter Notebook that contains the most integral parts of analysis, together with a thorough description of
findings. Make sure you interpret all model outputs and figures. The Python code in the notebook must be entirely self-
contained and all the experiments and the graphs must be replicable.
Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some of your Python code in
your notebook by putting it into a .py files that you can import. This might help the readability of your final notebook by
1
158.755-2025
Semester 1 Massey University
removing unnecessary python code that can clutter and distract from your actual findings and discussions.
You may install and use any additional Python packages you wish that will help you with this project. When submitting
your project, include a README file that specifies what additional python packages you have installed in order to make
your project repeatable on my computer, should I need to install extra modules.
Follow the general structure of the Project Notebook Template provided. Make your notebook professional and tidy
(avoid large data dumps) and run your text through an IPython Notebook spell-checker extension. You can also pretend
that you are a consultant performing an analysis for a client.
NOTE: Topics of web scraping, using web APIs and kNN algorithms will be covered in weeks 5 and 6. Therefore,
begin your assignment as soon as you can using the concepts covered thus far. Once the material in week 6 is all
covered, you will be able to complete all remaining components of this assignment in the week that it is due.
Marking criteria:
Marks will be awarded for different components of the project using the following rubric:
Reading Log PASS - The compiled reading logs up to the current period.
- The peer discussion summaries for each week.
- Any relevant connections between your readings and your analytical work
in the notebook. If a research paper influenced how you approached an
implementation, mention it.
Hand-in: Make sure that the notebook you submit has all the outputs embedded. Also, export your notebook into HTML.
Zip-up your notebook (.ipynb and .html) and dataset(s) you have chosen, as well as any other .py files you might have
written, into a single file and submit through Stream. Include your reading log too in the zipped file. Do not email your
submission to the lecturer unless there are problems with the submission site.
If you have any questions or concerns about this assignment, please ask the lecturer sooner rather than closer to
the submission deadline.
2
158.755-2025
Semester 1 Massey University
Use of Generative AI in This Assignment
In industry, AI and online resources are commonly used to improve efficiency and productivity. However, at university,
the primary goal is to develop your understanding, analytical skills, and ability to work through problems independently.
Mastering these skills first will allow you to use AI tools more effectively and critically in the future. While AI can be a
helpful tool for learning, relying on it to generate answers directly will short-circuit your learning and development.
For this project, you are required to independently select, wrangle, analyze, and interpret datasets from your chosen
domain. You will also maintain a reading log, where critical engagement with academic sources is expected and
integrated into your analyses where relevant. The use of generative AI is restricted to planning, explanation, and concept
development, as outlined below.