0% found this document useful (0 votes)
136 views66 pages

FIT1043 - Lecture 1 - 2024 Data Science

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views66 pages

FIT1043 - Lecture 1 - 2024 Data Science

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

FIT1043 Lecture 1

Introduction to Data Science

Mahsa Salehi*

Faculty of Information Technology, Monash University

Semester 2, 2024

* with material from © Wray Buntine


Outline

§ Unit motivation (Why data science?)


§ About this Unit
§ Teaching team
§ Unit logistics
§ Overview of data science
§ Data science definition
§ Data science process
§ Relevant fields to data science
Motivation for the Unit
from The Digital Universe of Opportunities: Rich Data and the Increasing Value
of the Internet of Things, April 2014
Data is everywhere!
►Google: processes 24 peta bytes of data per day.
►Facebook: 10 million photos uploaded every hour.
►YouTube: 1 hour of video uploaded every second.
►Astronomy: Satellite data is in hundreds of PB.
►Twitter: 400 million tweets per day.
►In2020, every person generated 1.7
megabytes of data in just a second.

►Cloud data storage around the world:


200+ Zettabytes by 2025 (Cybercrime
Magazine).

image src: corporatecomplianceinsights.com


Motivation for the Unit
Data Science is in its growth phase:
► The demand for skilled data science practitioners in industry,
academia, and government is rapidly growing.
Motivation for the Unit
Data Science is very well paying field
► Seek.com: average annual salary 2023

We try and cover the full extent of what makes Data Science:
Teaching Team
Staff Role Email
Dr. Mahsa Salehi Chief Examiner mahsa.salehi@monash.edu
and Lecturer
Dr. Heshan Kumarage Admin TA heshan.kumarage@monash.edu

Dr. Zahra Mirzamomen TA zahra.mirzamomen@monash.edu

Dr. Chris Yun TA chris.yun@monash.edu

Mike Wang TA mike.wang@monash.edu

Dr. Zhinoos Razavi TA TBA

Naveen Kaushik TA Naveen.Kaushik@monash.edu

Note: Consultation times will be posted on Moodle, no consultation on week 1


Resources
1. Moodle contains
► Unit Information, Assessments and Discussion Forums
► Lecture Notes: contain active links to recommended
videos & readings

2. additional textbook:
► no “perfect” Introduction to Data Science textbook available
► but a good introductory text available for purchase is:
The Art of Data Science by Peng & Matsui
► Doing Data Science by Rachel Schutt and Cathy O’Neil
► Python Data Science Handbook by Jake vanderPlas
Resources
3. review of Ed Lessons (will be added)
► LOTS of additional resources and exercises
► get the big picture from articles/videos

4. be aware also of the:


► library services available
► disability support available
►. special consideration policies
Special Consideration
All extensions and special consideration requests will be processed via the
Service Console Lite (SCL) system.

Please refer to this page for more information:


https://www.monash.edu/students/admin/assessments/extensions-special-
consideration
Prerequisites
You will need:
► high school level of mathematics and statistics
► basic programming

► a “critical mindset”:
► you will read/view a variety of material
► basicexposure to information technology and internet
businesses:
► Amazon, Google, Twitter, ...
Getting Started

How these classes are run


►2 hour online lecture, Monday 08:00 – 10:00 AM
Zoom Link
Passcode: 901666

►2hour applied session,


Check Allocate+
► watch videos & read background material between
classes
► participate in class activities
► prepare for applied sessions
Unit Schedule
Week Activities Assignments
1 Overview of data science

2 Introduction to Python for data science

3 Data visualisation and descriptive statistics

4 Data sources and data wrangling

5 Data analysis theory Assignment 1

6 Regression analysis

7 Classification and clustering

8 Introduction to R for data science

9 Characterising data and "big" data Assignment 2

10 Big data processing

11 Issues in data management

12 Industry guest lecture Assignment 3


Assessment
Assessment task Value Due date
Assignment 1 10% Week 5

Assignment 2 20% Week 9

Assignment 3 20% Week 12

Examination 1 50% To be advised

► Assignments 1-3 coding tasks based on Python/R/bash


covered in lectures and applied sessions
► Exam based on material covered in lectures and applied
sessions
Instructions to participate in the
poll (using )
§ Visit https://www.polleverywhere.com on your
phone, tablet or laptop
§ Or simply scan the QR code
§ Register using your Monash account
§ Join presentation with the following presenter’s
username:
§ PollEv.com/mahsasalehi868
§ Answer questions when they pop up
§ That’s it J
Poll: Your Background

1. What programming language are you most


experienced in?

2. What kinds of data are you familiar with?

PollEv.com/mahsasalehi868
Learning Outcomes (Week 1)

By the end of this week, you should be able to:


► Explain what is data science and Drew Conway’s Venn
diagram
► Comprehend the usefulness of machine learning

► Explain different components of a data science process

► Differentiate data science from other related disciplines

► Learn how to install and start coding in Python with

Jupyter Notebook
► To be achieved in your applied session
Overview of Data Science

a quick overview of the context


Poll : Who are the Data Scientists?

person A person B

person C person D
What is Data Science?

how can we define data science?


Defining Data Science
What is Data Science?
‘‘name contains the word ‘science’, so it can’t be one”
► Note: this is an old joke ...
“data science is what a data scientist does”
►a circular definition!
“data science is the technology of handling and extracting value
from data”
► less circular and a bit more useful
“machine learning on big data”
► useful, but too narrow!
What is Data Science?
Definitions: from Wikipedia

Data Science is the extraction of knowledge


from data, which is a continuation of the field
data mining and predictive analytics.

Big data is a broad term for data sets so large or


complex that traditional data processing
applications are inadequate.
What is Data Science?
A quote from Hal Varian (From What is Data
Science?)
The ability to take data and
• to be able to understand it,
• to process it,
• to extract value from it,
• to visualize it,
• to communicate it
that’s going to be a hugely important skill in the
next decades.
Summary

narrow: machine learning on big data


broad: extraction of knowledge/value from data through the
complete data lifecycle process

► broad concern with the different stages


► focus on the learning/knowledge discovery
Poll
Which of the following data science
definition you like most?

Data Science is
A. machine learning on big data
B. extraction of knowledge/value from data
through the complete data lifecycle mahsasalehi868
process
C. almost everything that has something to
do with data: collecting, analyzing,
modeling, etc, yet the most important
part is its applications — all sorts of
applications
Data Science Venn Diagram
Drew Conway’s Venn diagram of data science
Data Science Venn Diagram
Drew Conway’s Venn diagram of data science

Hacking skills: needs creativity to prepare data


for analysis

Math and stat: to extract insight from data,


baseline familiarity with these are required

Substantive domain knowledge: what are the


goals, constraints of a domain
Data Science Venn Diagram
Drew Conway’s Venn diagram of data science

Hacking skills + Math and stat: Machine


learning

Substantive domain knowledge + Math and stat:


PhD researchers spend a lot of their time
acquiring expertise in specific areas

Hacking skills + Substantive domain knowledge:


Appears to be a legitimate analysis without any
understanding of how they got there or to
interpret what they have created.
Data Science Venn Diagram
Drew Conway’s Venn diagram of data science

Conclusion:
§ Combination of different skill sets
§ Diverse skills are needed
Data Science Examples

Some famous data science projects and investigations:


• Google’s spell checker and translation engine
Poll
Provide another data science example.

mahsasalehi868
Data Science Examples

Some famous data science projects and investigations:


• Google’s spell checker and translation engine

Other examples to explore:


• Amazon.com’s recommendation engine
• Facebook’s social network analysis
• Microsoft’s predictive analytics for traffic
Defining Machine Learning
Unlike Data Science, the definition for Machine Learning
is better understood and more agreed upon:
Machine Learning is concerned with the development of
algorithms and techniques that allow computers to learn.

► concerned with building computer programs that


can learn, oftentimes with computational output
► but the underlying theory is statistics

see A Gentle Guide to Machine Learning


Why use Machine Learning?
Machine learning is useful when:
► Human expertise is not available
e.g. Martian exploration

► Many solutions need to be adapted automatically


e.g. user personalisation

► Humans are expensive to use for the work


e.g. handwritten zipcode recognition

image src: theconversation.com, meduim.com, blog.prioridata.com


Why use Machine Learning?
Machine learning is useful when:
► Situation changes over time
e.g. junk email

► There are large amounts of data


e.g. discover astronomical objects

image src: lifewire.com, clrealyexplained.com


Why use Machine Learning?

► because you do not


want to be this poor
guy!
► sifting through all
the data by hand
Poll

Which of the following is real world applications of


Machine Learning?

A. Video Games
B. Self-driving cars
C. Spam filtering
D. Predictions
mahsasalehi868
E. All of the options
The Data Science Process

what happens in a Data Science project?


► illustrating the process
► a quick walkthrough illustrating the steps
► the standard value chain
► our model of the process
The Data Science Process:
Illustrating the Process
a quick walkthrough illustrating the steps
The Data Science Process
► Many different tasks come together to complete a
Data Science project
► a data scientist should be familiar with most, but doesn’t
need to be an expert in all
► Not all are labelled as Data Science
► some from other field such as computer engineering,
business, ...
1. Pitching ideas for data science projects to
investors/managers.

image src: “Young Business Man Holding a Tablet” by Pic Basement


2. Collecting data: researchers preparing to x-ray
a patient.

image src: Stephen Ausmus acquired from USDA ARS, public domain.
3. Integration: Data can come from many different
sources.

icons from by Openclipart.org, public domain


4. Interpretation: e.g. data can be described using
a database schema.

image src: Eric, Sql Designer


5. Governance: caring for the data and its subjects.

icons from by Openclipart.org, public domain; Good and Evil by AJC ajcann.wordpress.com
5. Governance: managing data standards and formats

“The Web is Agreement” cropped, by Paul Downey


6. Engineering: Data engineers make the back-end
work

image src: by Intel Free Press


7. Wrangling: Inspecting and cleaning the data.

image src: “rstudio” by mararie


8. Modelling: Proposing a conceptual / mathematical
/ functional model.

image src: “Mathematics” by Tom Brown


8. Modelling: Analyst building models with his
favourite tool.
8. Modelling: Analysis, statistics and/or machine learning
works on the data.

image src: “From Data to Wisdom” by Nick Webb


9. Visualisation: Visualising data to interpret it
and present results.

image src: Stephen Ausmus acquired from USDA ARS, public domain
9. Visualisation: Choosing appropriate
visualizations for the data. Many different options
exist!

image src: “Visualization Matrix” cropped, by Lauren Manning


10. Operationalization: putting the results to work.

image src: "Illustration of Strategy“ by Denis Fadeev


Poll
Using a short phrase or word, which activity in data
science process is the most interesting to you.

mahsasalehi868
The Data Science Process:
Our Standard Value Chain
our model of the process
Data Science Project Tasks
Collection: getting the data
Engineering: storage and computational resources across full lifecycle
Governance: overall management of data such as security across full
lifecycle
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing the case that the results are significant and
useful
Operationalisation: putting the results to work, so as to gain benefits
or value
We call this the
Standard Value Chain
Data Science Process
from Doing Data Science by Schutt and O’Neil, 2013, (available digitally through library)

Chapter 1 of the book provides the following visualisation of the


standard value chain for a data science project:
Data Science Process

A typical data scientist has a different mix of skills as well as


domain knowledge
Data Science Process

Data scientist ::= addresses the data science process to


extract meaning/value from data
Data Science Process

Chief data scientist: a form of chief scientist who addresses


data management, data engineering and data
science goals.
Relationship of Data Science to
Other Disciplines
Related: Data Engineering

building scalable systems for storage, processing data


► e.g. Hadoop

► databases, distributed processing, datalakes, cloud


computing, GPUs, wrangling, ...
Related: Data Analysis

performing analysis and understanding results


► e.g. R and Microsoft Azure Machine Learning

► machine learning, computational statistics,


visualisation, ...
Related: Data Management

managing data through its lifecycle


► ethics, privacy, curation, backup, governance, ...
Learning Outcomes (Recap)

We learnt
► what is data science and
► what is Drew Conway’s Venn diagram

► Why machine learning is useful

► Different components of a data science process

► Differentiate data science from other related disciplines


Applied session- week 1

§ Guide to install Anaconda and Python

§ Be prepared for Lecture 2 to code together


(Introduction to Python)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy