0% found this document useful (0 votes)

23 views52 pages

IDS - Lecture 1

Introduction to Data Science Lecture.

Uploaded by

HคMMคď MцʂŤคϝค

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views52 pages

IDS - Lecture 1

Introduction to Data Science Lecture.

Uploaded by

HคMMคď MцʂŤคϝค

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Intro To Data Science: Introduction

CUST
Fall 2024

Slide credits: MIT AI, Jeffery

IBM Data Science 1
Course Learning Outcomes
Course Code: SE4883
CLO1: Understand basic concepts of data science, statistics and probability
and their application in understanding behavior of data
CLO2: Apply basic tools for performing exploratory data analysis and
visualization
CLO3: Understand basic predictive modeling and data analysis methods
CLO4: Learn Python for performing different data science steps
Reference Material
● Material shared through Oddo
● Text Books
○ Cathy O'Neil and Rachel Schutt. Doing Data Science, Straight
Talk From The Frontline. O'Reilly. 2014. (Text Book)
○ Steven S. Skiena, “The Data Science Design Manual”,
Springer, 2017
● Reference Book
○ David M Deiz and Christopher D Barr. OpenIntro Statistics.
Third Edition
○ An Introduction to Statistical Learning, by Gareth James,
Daniela Witten, Trevor Hastie, and Rob Tibshirani
Some Useful Links
● https://www.Kaggle.com/
● You will need the following things installed on your computer:
•Python External link, version 3.8 or higher.
•Pandas External link, any version that's compatible with your version
of Python.
• Numpy External link, any version that's compatible with your version
of Python.
● Google Colab/ IBM Watson Studio
Assessment
● Quizzes 10%
● Assignments 10%
● Project 20%
● Mid-Term 20%
● Final-Term 40%
Project
● A major component of the class: goal is to take a real-world problem that you
are interested in, and apply data science methodologies to gain insight/solve
problem of that the domain

● Work to be done in groups of 3-4 students

● Class projects must be focused on some real data problem (ideally one that
you collect yourself), not an already-curated data set
How to take course
● Course is interesting, challenging, has high demand in industry
● Try to spend extra time learning about the contents discussed during class
● Do maximum programming practice
● Try to grasp Math involved (Not too much here)

Last but not least, HAVE FUN Learning!

Defining Data Science and Data Scientists
Basically,
Data science is the field of exploring, manipulating,
and analyzing data, and using data to answer
questions or make recommendations.
Data Science
Some possible definitions

Data science is the application of

computational and statistical
techniques to address or gain insight
into some problem in the real world

1
1
Some possible definitions

Data science = statistics +

data processing +
machine learning +
scientific inquiry +
visualization +
business analytics +
big data + …

1
2
So what is the process of data science?

13
So Now, Who Is a Data Scientist?
Data scientists use their skills to extract knowledge and insights from data to solve real-world
problems.
In Academic:
Background: Typically has a scientific background (social science, biology, etc.)
• Working with large datasets
• Addressing challenges related to data structure, size, quality, and complexity
• Applying computational methods to solve problems using data
In Industry :
• How to design the experiments,
• How to the process of collecting, cleaning, and munging of data.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
Pop Up Quiz!!!
Question!
What is Data?
What is Data?

The quantities, characters, or symbols on which operations are performed by a

computer,
• which may be stored and transmitted in the form of electrical signals and recorded
on magnetic, optical, or mechanical recording media.
Now, let’s learn Big Data definition
Types of Data?
What is Data?

Following are the types of Big Data:

→Structured
→Semi- Structured
→Unstructured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Example of Structured Data
What is Data?

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

What is Data?
Type of Data
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it.
A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Now a days organizations have wealth of data available with them but unfortunately,
they don’t know how to derive value out of it since this data is in its raw form or
unstructured format.
Example of Unstructured Data
What is Data?

The output returned by ‘Google Search’

What is Data?
Type of Data
Semi-structured
Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it is actually not defined
with e.g. a table definition in relational DBMS.
Example of semi-structured data is
• Data represented in an XML file.
• Personal data stored in an XML file
Related Concepts
Cloud Computing

The term “cloud computing” can be used to describe applications and data that users
access over the Internet rather than on their local computer.
Cloud Computing Services/Service Providers
● Amazon Web Services
● Google Cloud
● IBM Watson Studio
What is Data?
Big Data

Big Data is a collection of data that is huge in volume, yet growing exponentially with
time.
• It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
• Big data is also a data but with huge size.
What is Data?
Example of Big Data?

The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
What is Data?
Characteristics of Big Data

Characteristics Of Big Data

Big data can be described by the following characteristics:
- Volume
- Variety
- Velocity

Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial
role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data
or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.
What is Data?
Characteristics of Big Data

Variety – The next aspect of Big Data is its variety.

- Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
- During earlier days, spreadsheets and databases were the only sources of data considered by most of the
applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
are also being considered in the analysis applications.
- This variety of unstructured data poses certain issues for storage, mining and analyzing data.
What is Data?
Characteristics of Big Data

Velocity – The term ‘velocity’ refers to the speed of generation of data.

- How fast the data is generated and processed to meet the demands, determines real potential in the data.
- Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc.
- The flow of data is massive and continuous.
???

What are the 5 V’s of Big data?

Big Data Processing Tools

Big Data processing technologies provide ways to work with

large sets of structured, semi-structured, and

unstructured data so that the value can be derived from
big data.
What is Hadoop
Hadoop is an open source framework based on Java that
manages the storage and processing of large amounts of
data for applications.

Hadoop uses distributed storage and parallel processing to

handle big data and analytics jobs, breaking workloads down
into smaller workloads that can be run at the same time.
Self Learning

What is Apache hive?

What is Apache Spark?
HomeWork
Data Science and AI
AI
AI is the branch of computer science that includes the development of systems that
can replicate tasks associated with human intelligence.
Machine learning
Machine learning is a subset of AI that uses computer algorithms to learn about data
and make predictions with it
Deep Learning
Deep learning is a subset of machine learning that uses layered neural networks to
simulate human decision-making.
Generative AI
Generative artificial intelligence refers to the use of AI to create new content, like text,
images, music, audio, and videos.
Data Science combines math and statistics,
specialized programming, advanced analytics, AI and
machine learning with specific subject matter
expertise to uncover actionable insights hidden in an
organization's data.
-- Leveraging Data Science?
Class Activity
Explore Data Science Job Listing
For this Activity, you should find a data science job posting on a job
board of your choice, such as LinkedIn, Indeed, Rozi.pk.
Analyze the posting by responding to the following questions and statements

Identify the following aspects of data science job post:

1. What is the company name that is advertising the job?

2. What is the job title?
3. Where is the role located?
4. What is the expected salary or salary range?
5. What is the total number of results from the search for the job post?
6. What is one technical responsibility from the job post related to something you
learned about in this course?
7. What are two required technical skills from the job post?
8. What are at least two ideas or concepts you learned about in this course relevant
to these job posts?

Case Study On Hotel Management
100% (1)
Case Study On Hotel Management
16 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
mod 3
No ratings yet
mod 3
96 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
1. Data Science
No ratings yet
1. Data Science
54 pages
UNIT 3 Notes by ARUN JHAPATE
No ratings yet
UNIT 3 Notes by ARUN JHAPATE
9 pages
UNIT- 1_DA_Notes
No ratings yet
UNIT- 1_DA_Notes
51 pages
Data Science
No ratings yet
Data Science
35 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
1.introduction To Data Science
No ratings yet
1.introduction To Data Science
23 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
Introduction to Big Data Analytics_thendral1
No ratings yet
Introduction to Big Data Analytics_thendral1
26 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Unit-1 Final sgs
No ratings yet
Unit-1 Final sgs
24 pages
BIG DATA & Hadoop Tutorial
No ratings yet
BIG DATA & Hadoop Tutorial
23 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Converted 4011171
No ratings yet
Converted 4011171
144 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
03-07-2024-Data Science - Orentation Programme
No ratings yet
03-07-2024-Data Science - Orentation Programme
53 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Unit 01
No ratings yet
Unit 01
32 pages
DS231_Week_3
No ratings yet
DS231_Week_3
41 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Big Data Pgdca
No ratings yet
Big Data Pgdca
23 pages
Data v2
No ratings yet
Data v2
25 pages
Module 1
No ratings yet
Module 1
35 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Big Data in Data Science
No ratings yet
Big Data in Data Science
3 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
The Predictive Project Manager
From Everand
The Predictive Project Manager
Puneet Mathur
No ratings yet
Software Construction & Development: Created by Muhammad Zeeshan Khan Mehsud
No ratings yet
Software Construction & Development: Created by Muhammad Zeeshan Khan Mehsud
41 pages
Ug1327 DNNDK User Guide
No ratings yet
Ug1327 DNNDK User Guide
172 pages
HNC
No ratings yet
HNC
82 pages
Challenges and Limitations of Conventional Data Processing Approaches in IT (1)
No ratings yet
Challenges and Limitations of Conventional Data Processing Approaches in IT (1)
9 pages
Final PDF
No ratings yet
Final PDF
45 pages
DigitalHR-Session12forDT_MISstudents
No ratings yet
DigitalHR-Session12forDT_MISstudents
33 pages
Portfolio - Yashica Jain
No ratings yet
Portfolio - Yashica Jain
4 pages
Logcat
No ratings yet
Logcat
818 pages
More Efficient Iot Communication Through Lora Network With Lora@Fiit and Stiot Protocols
No ratings yet
More Efficient Iot Communication Through Lora Network With Lora@Fiit and Stiot Protocols
6 pages
Unit-1: Dav School of Business Management, BBSR SUBJECT: E-Business BBA 6th Semester
No ratings yet
Unit-1: Dav School of Business Management, BBSR SUBJECT: E-Business BBA 6th Semester
9 pages
Excel_Learning_Roadmap
No ratings yet
Excel_Learning_Roadmap
3 pages
Internet Technologies Unit 4
No ratings yet
Internet Technologies Unit 4
11 pages
Google Ads Audit Template Download Version 1
No ratings yet
Google Ads Audit Template Download Version 1
15 pages
PCVL Brgy 1207002
No ratings yet
PCVL Brgy 1207002
19 pages
Software Engineering: Prof. Archana Deshpande
No ratings yet
Software Engineering: Prof. Archana Deshpande
34 pages
01 Assignment 1
No ratings yet
01 Assignment 1
1 page
Mcafee Enterprise Security Manager Data Source Configuration Reference Guide 9-18-2020
No ratings yet
Mcafee Enterprise Security Manager Data Source Configuration Reference Guide 9-18-2020
601 pages
Up-To-Date 312-50 Eccouncil Certified Ethical Hacker v10 (2019) PDF Exam Demo
No ratings yet
Up-To-Date 312-50 Eccouncil Certified Ethical Hacker v10 (2019) PDF Exam Demo
5 pages
Device Contract
No ratings yet
Device Contract
4 pages
New HDL ParticipantEnrollment Not Working As Expected PDF
No ratings yet
New HDL ParticipantEnrollment Not Working As Expected PDF
2 pages
Staff Daily Report 05082020
No ratings yet
Staff Daily Report 05082020
112 pages
Chapter-3
No ratings yet
Chapter-3
44 pages
Internshp Diary 6th Sem Mukund
No ratings yet
Internshp Diary 6th Sem Mukund
28 pages
Gruna Damp
No ratings yet
Gruna Damp
8 pages
Instructions IHTASim RAVEN Installation
No ratings yet
Instructions IHTASim RAVEN Installation
2 pages
A-008550-1624862702015-80342-BIIII-converted (44445)
No ratings yet
A-008550-1624862702015-80342-BIIII-converted (44445)
69 pages
ISM Assignment Qusetions
No ratings yet
ISM Assignment Qusetions
8 pages
Openshift - Container - Platform 4.11 Architecture en Us
No ratings yet
Openshift - Container - Platform 4.11 Architecture en Us
72 pages
1.2 System Software
No ratings yet
1.2 System Software
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IDS - Lecture 1

Uploaded by

IDS - Lecture 1

Uploaded by

Intro To Data Science: Introduction

Slide credits: MIT AI, Jeffery

● Work to be done in groups of 3-4 students

Last but not least, HAVE FUN Learning!

Data science is the application of

Data science = statistics +

The quantities, characters, or symbols on which operations are performed by a

Following are the types of Big Data:

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

The output returned by ‘Google Search’

Characteristics Of Big Data

Variety – The next aspect of Big Data is its variety.

Velocity – The term ‘velocity’ refers to the speed of generation of data.

What are the 5 V’s of Big data?

Big Data processing technologies provide ways to work with

large sets of structured, semi-structured, and

Hadoop uses distributed storage and parallel processing to

What is Apache hive?

Identify the following aspects of data science job post:

1. What is the company name that is advertising the job?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.