0% found this document useful (0 votes)
75 views35 pages

00 Overview

This document provides an overview of an introductory course on big data analytics. It outlines the instructor and teaching assistant contact information, course description, materials, grading policy, homework and projects, and topics to be covered including data mining concepts, distributed platforms, and advanced topics. The goal of the course is to analyze large diverse data using effective algorithms efficiently. The tentative schedule covers topics before and after midterm ranging from introduction to data preprocessing, mining, classification, clustering, distributed platforms, and a term project.

Uploaded by

Musa Savage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views35 pages

00 Overview

This document provides an overview of an introductory course on big data analytics. It outlines the instructor and teaching assistant contact information, course description, materials, grading policy, homework and projects, and topics to be covered including data mining concepts, distributed platforms, and advanced topics. The goal of the course is to analyze large diverse data using effective algorithms efficiently. The tentative schedule covers topics before and after midterm ranging from introduction to data preprocessing, mining, classification, clustering, distributed platforms, and a term project.

Uploaded by

Musa Savage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Course Overview:

Introduction to Big Data Analytics

B. Ceesay
March. 23, 2022
Instructor & TA
• Instructor
– B. Ceesay
– Senior Instructor, Dept. CS, School of ITC
– E-mail: bamfa@utg.edu.gm
– Tel: No call, Call Teaching Assistance
– Office Hour: 08:00-10:30am, every Wednesday
• TA
– Mr. Baldeh
– E-mail: hb 21812457 @utg.edu.gm

3/23/2022 UTG CS, ITC 2


Course Description
• Course Web Page:
– School portal- https://utg.gm/
– Please check often for the latest announcements and updates of
schedule, slides, and homeworks
• Time: 08:00-10:30am
• Classroom: PB105

3/23/2022 UTG CS, ITC 3


Course Materials
• Textbook:
– Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, 3rd ed., Morgan Kaufmann Publishers, July 2011. [DM3]
(selected chapters)
• Prerequisites:
– Basic knowledge of data structures (and algorithms), database
systems, discrete math (probability), and operating systems
– Programming experience is *required* for completing
homeworks & projects

3/23/2022 UTG CS, ITC 4


Target Audience
• CSIE juniors and seniors
• Students who are interested in big data
analytics and willing to learn practical
skills

3/23/2022 UTG CS, ITC 5


Additional Reading Materials
• Reference Books
– Tom White, Hadoop: The Definitive Guide, 4th ed., O’Reilly Media,
2015.
– Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee,
Learning Spark: Lightning-Fast Big Data Analysis, 2nd ed., O'Reilly
Media, Jul. 2020.
– Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with
MapReduce, Morgan & Claypool Publishers, 2010.
– Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining
of Massive Datasets, 3rd Edition, Cambridge University Press,
2020.
• Official online documents: Hadoop, Spark, …
• Academic papers
3/23/2022 UTG CS, ITC 6
Grading Policy
• Homework assignments and programming
exercises: ~25%
• Attendance: ~10%
– 75% attendance is required.
– No final exam for less than 75% attendance.
• Term project: ~15%
– Including proposal, presentation, and final report
– Physical (or online) presentation
• All homeworks, projects, and reports must be
submitted before the deadline
– Late submission has to be done *before* the end of
the semester, if not specified otherwise
3/23/2022 UTG CS, ITC 7
Homeworks and Projects
• About 3 written assignments
– Concepts
• About 3 team-based programming exercises
– Maximum number of students per team: 2
• The term project for data analysis or system
development
– Team-based (the same as programming exercises)
– e.g. to find interesting datasets and appropriate methods
to analyze
• Responsibility of each member must be specified in
the document

3/23/2022 UTG CS, ITC 8


About the Term Project
– Proposals, presentations, and reports are *required*
for each team, and will be counted in the score
• The score you’ll get depends on the functionality, difficulty,
and quality of your project
– Proposal: A one-page description of your idea on how
you are going analyze what kinds of data
• You can either write your own program, or call existing open
source code or library (but NOT executing binary code only)
– Presentation: 20 minutes per team
• System functions and correctness
– Reports: slides, source code, and document

3/23/2022 UTG CS, ITC 9


Homework Submission
• Way of Submission
– Written exercises are to be submitted in class
• Electronic version is also acceptable online
– Systems, programs, project proposals, and
project reports in electronic files must be
submitted online to portal
– You will get zero score when there’s delay in
or no homework submission

3/23/2022 UTG CS, ITC 10


What This Course is about
• This course can help you
– Understand the general idea of big data
– Obtain a rough idea of data mining
– Have an idea of what distributed platforms like
Hadoop/Spark can do
• And MORE than these!
– Technical and practical skills for:
• Environment setup of distributed computing
• Data analysis
• Data mining methods
• Parallel programming

3/23/2022 UTG CS, ITC 11


Big Data: Some Examples
• Topic detection and tracking
• Trend analysis
• Social network analysis
• PageRank
• Predictive analytics
• Many others: healthcare, natural resources,
education, public sector, insurance,
transportation, finance and crime detection,

3/23/2022 UTG CS, ITC 12
What is Big Data?
• Big data is a term for data sets that are so
large or complex that traditional data
processing application softwares are
inadequate to deal with them
• Challenges include capture, storage,
analysis, data curation, search, sharing,
transfer, visualization, querying, updating
and information privacy [source:
Wikipedia]
3/23/2022 UTG CS, ITC 13
The Four V’s of Big Data

[source: IBM]
3/23/2022 UTG CS, ITC 14
Data analysis vs. Data analytics
• “Analysis is the separation of a whole into its
component parts, and analytics is the method of
logical analysis.” [source: Merriam-Webster
dictionary]
• “Analysis is really a heuristic activity, where
scanning through all the data the analyst gains
some insight. “ [source: Quora.com]
• “Analytics is about applying a mechanical or
algorithmic process to derive the insights for
example running through various data sets
looking for meaningful correlations between
them. ” [source: Quora.com]

3/23/2022 UTG CS, ITC 15


Related Terms
• Data science, predictive analytics
• Business intelligence, FinTech
• IoT, CPS, Industry 4.0
• Smart homes, smart cities
• Data mining, machine learning, artificial
intelligence
• Cloud computing, data-intensive computing,
parallel computing, distributed computing
•…
3/23/2022 UTG CS, ITC 16
The Major Focus in Big Data
Analytics
• The management of complete data
lifecycle
– Collecting
– Cleansing
– Organizing
– Storing
– Analyzing
– Visualization
– Governing

3/23/2022 UTG CS, ITC 17


The Goal of the Course
• Big Data Analytics is to analyze huge
amount of diverse data with effective
algorithms in an efficient way
– Data preprocessing
– Data mining
– Parallel programming in distributed
platforms

3/23/2022 UTG CS, ITC 18


The Big Picture of An Example
Workflow
Data source

Topic-relevant
Data Open APIs / crawler Search
analyst
Data Extraction
Topic
Distributed
Similarity Estimation Storage

Analytical
Data/Task Dispatch
need
Distributed Analysis Feedback

Regression/ Machine Visualization


Feature
Classification/ Learning & Feedback
Extraction
Clustering

Presentation

3/23/2022 UTG CS, ITC 19


The Topics to be Covered
• Introduction
• Data mining concepts
– Data preprocessing
– Frequent pattern mining
– Data classification
– Data clustering
• Distributed platform and parallel programming
– Introducing distributed platforms: Hadoop, Spark
– Parallel programming paradigm and concepts
– MapReduce programming
• Advanced topics
– Link analysis
– Mining social network graphs
– Dimensionality reduction
– Large-scale machine learning

3/23/2022 UTG CS, ITC 20


Tentative Schedule
• Before midterm
– Introduction (1 wk)
– Data preprocessing (1 wk)
– Frequent pattern mining (2 wks)
– Data classification (2 wks)
– Data clustering (2 wks)
• After midterm
– Introducing distributed platforms: Hadoop, Spark (1
wk)
– Parallel programming paradigm and concepts (1 wk)
– MapReduce programming (1-2 wks)
– Term Project Presentation (2-3 wks)
3/23/2022 UTG CS, ITC 21
Tentative Schedule
Week Date Content Note
1 3/16 (Off)
3/24 Course Overview
2 3/30 Introduction to Big Data Analytics
4/06 Getting to Know Your Data
3 4/13
Data Preprocessing
4/20 HW#1
4 4/20 Frequent Pattern Mining
4/27
5 5/4 Ch.6, Frequent Pattern Mining
5/11 Term Project Proposal & Team Registration Due: HW#1
6 5/18 Ch.8, Classification: Basic Concepts HW#2
5/25
7 6/1 Ch.8, Classification

3/23/2022 UTG CS, ITC 22


Wee Date Content Note
k
8 6/8 Ch.9, Classification: Advanced Methods Due: HW#2
HW#3
9 6/15 Ch.10, Cluster Analysis: Basic Concepts and
Methods
10 6/22 Due: HW#3
Distributed Platforms: Hadoop, Spark

11 6/30 Parallel Programming Paradigms & Concepts Due: Proposal

12 7/6 MapReduce Programming HW#4:


Hadoop &
Spark
13 7/13 Spark Programming Due: HW#4

3/23/2022 UTG CS, ITC 23


Week Date Content Note
14 7/20 Term Project Presentation: Week 1

15 7/27 Term Project Presentation: Week 2 Due: Final report

3/23/2022 UTG CS, ITC 24


Notes on Homeworks
• Rule 1: Plagiarism is prohibited.
– Near-duplicate codes will get equal and
minimum basic scores
• Rule 2: Clear documentation is required in
your system projects.
– Instructions on downloading, installing,
configuring, and executing your code and
open source library, APIs, or codes must be
submitted
– Package control is recommended
3/23/2022 UTG CS, ITC 25
More on Term Projects
• Tentative schedule for all teams:
– Proposal: *required* one week after midterm (June.
06, 2022)
– Presentations (including demos): *required* in the
last two-three weeks (starting as early as July. 20 & 27,
2022)
– Final report: *required* before the end of the
semester (July. 27, 2022)
• Slides, source code, documentation

3/23/2022 UTG CS, ITC 26


Notes on System Development
• You can write your own code in any
programming language
• You can also call existing open source APIs or
libraries
– E.g. Hadoop, Spark, TensorFlow, Keras, PyTorch, …
– But simply running existing binary codes or
commercial tools is NOT acceptable
• Any topic relevant to big data analytics
– Analysis of various types of media (text, Web, social
media) with big data characteristics (3V’s) using data
mining methods (regression, classification, clustering)
– Public datasets: UCI ML repository, kaggle, …

3/23/2022 UTG CS, ITC 27


Some Example Open Source
Tools for Big Data Analytics
• Apache Hadoop, Spark (in Java, Scala,
Python, R)
– For distributed computing and data analysis
• Apache Pig, Hive, Flume, Hbase, Cassandra,
Alluxio, Mahout, …
– For data flow, SQL, streaming data, distributed
databases, distributed storage, machine learning,
• Spark SQL, Streaming, Mlib, GraphX
– For SQL, streaming, machine learning, and graph
processing
3/23/2022 UTG CS, ITC 28
Conferences and Journals on
Data Mining
• KDD Conferences
– ACM SIGKDD Int. Conf. on Knowledge
Discovery in Databases and Data Mining (KDD)
– SIAM Data Mining Conf. (SDM)
– (IEEE) Int. Conf. on Data Mining (ICDM)
– European Conf. on Machine Learning and
Principles and practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
– Pacific-Asia Conf. on Knowledge Discovery and
Data Mining (PAKDD)
– Int. Conf. on Web Search and Data Mining
(WSDM)
3/23/2022 UTG CS, ITC 29
• Other related conferences
– DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT,
ICDT, …
– Web and IR conferences: WWW, SIGIR, WSDM
– ML conferences: ICML, NIPS
– PR conferences: CVPR,
• Journals
– Data Mining and Knowledge Discovery (DAMI or
DMKD)
– IEEE Trans. On Knowledge and Data Eng. (TKDE)
– KDD Explorations
– ACM Trans. on KDD

3/23/2022 UTG CS, ITC 30


Where to Find References?
DBLP, CiteSeer, Google
• Data mining and KDD:
– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD,
PAKDD, etc.
– Journal: Data Mining and Knowledge Discovery, KDD
Explorations, ACM TKDD, ACM TKDE
• Database systems
– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE,
EDBT, ICDT, DASFAA
– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J.,
Info. Sys., etc.
• AI & Machine Learning
– Conferences: Machine learning (ML), AAAI, IJCAI, COLT
(Learning Theory), CVPR, NIPS, etc.
– Journals: Machine Learning, Artificial Intelligence, Knowledge
and Information Systems, IEEE-PAMI, etc.

3/23/2022 UTG CS, ITC 31


• Web and IR
– Conferences: ACM SIGIR, WWW, CIKM, etc.
– Journals: WWW: Internet and Web Information
Systems,
• Statistics
– Conferences: Joint Stat. Meeting, etc.
– Journals: Annals of statistics, etc.
• Visualization
– Conferences: CHI, ACM-SIGGraph, etc.
– Journals: IEEE Trans. visualization and computer
graphics, etc.
3/23/2022 UTG CS, ITC 32
Recent Resources on Big Data
• Conferences
– IEEE International Conference on Big Data
(IEEE Big Data)
– IEEE International Congress on Big Data
(BigData Congress)
• Journals
– IEEE Transactions on Big Data
– ACM Transactions on Data Science
– Journal of Big Data
3/23/2022 UTG CS, ITC 33
When it comes to Big Data…
• It’s all about scalability!
– Scale up (vertical scalability)
– Scale out (horizontal scalability)

3/23/2022 UTG CS, ITC 34


Thanks for Your Attention!

3/23/2022 UTG CS, ITC 35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy