0% found this document useful (0 votes)

65 views

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

This document provides definitions and explanations of key concepts related to big data fundamentals. It defines big data as datasets that are too large for traditional databases to handle due to their volume, velocity, and variety. It also notes that most big data is unstructured data like text files. The document then defines and explains additional big data terms like data science, data analytics, data mining, data visualization, machine learning, data clustering, structured vs unstructured data, Hadoop, NoSQL, Internet of Things (IoT), and MapReduce.

Uploaded by

Fernando Andrés Hinojosa Villarreal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

Uploaded by

Fernando Andrés Hinojosa Villarreal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

CSD 1043: Big Data Fundamentals

Week1: Big Data Landscape

Definitions:
“3-D Data Management: Controlling Data Volume, Velocity and Variety.” Volume refers to the
sheer size of the datasets. The McKinsey report, “Big Data: The Next Frontier for Innovation,
Competition, and Productivity,” expands on the volume aspect by saying that, “’Big data’ refers
to datasets whose size is beyond the ability of typical database software tools to capture, store,
manage, and analyze.”
Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big
data is unstructured data. Many of the tools designed to analyze big data can handle unstructured data.
Big Data Terminology

Data vs information: data is facts, but information is the meaning behind the data.
Data science: Data science is the study of where information comes from, what it represents
and how it can be turned into a valuable resource in the creation of business. Data science
offers a powerful new approach to making discoveries. By combining aspects of statistics,
computer science, applied mathematics, and visualization, data science can turn the vast
amounts of data the digital age generates into new insights and new knowledge.
Data analytics: The application of software to derive information or meaning from data. The
end result might be a report, an indication of status, or an action taken automatically based on
the information received.
Data mining: The process of deriving patterns or knowledge from large data sets.
Data marts & data warehouse: A data mart is one piece of a data warehouse where all the
information is related to a specific business area. Therefore, it is considered a subset of all the
data stored in that particular database, since all data marts together create a data warehouse.
Data visualization: Today, data visualization has become a rapidly evolving blend of science and
art that is certain to change the corporate landscape over the next few years. oday, data
visualization has become a rapidly evolving blend of science and art that is certain to change
the corporate landscape over the next few years. Tableau public is a popular data visualization
tool that's also completely free.
Data wrangling: the messy, incomplete data or data that is too complex and simplify and/or
clean it so that it’s useable for analysis — and you’ll have done data wrangling. Pandas is one of
the most popular Python library for data wrangling
Data governance: data governance establishes the rules of the data-use game. Data
governance becomes the function that owns the quality of data across the organization. The
participating policy makers ensure that standards are in place, that data quality is monitored,
and that new/emerging data and data sources are always tied into the rest of the data picture
for the business. As many industries have very refined data that is easier to capture and always
consistent, data governance is sometimes viewed as an information technology (IT) function.
Machine learning: The use of algorithms to allow a computer to analyze data for the purpose of
“learning” what action to take when a specific pattern or event occurs.
Data clustering: Data analysis for the purpose of identifying similarities and differences among data
sets so that similar data sets can be clustered together.

Correlation analysis
Descriptive analytics
Predictive analytics (modelling)
Structured data vs Unstructured data: Spreadsheeets vs Emails

Unstructured data files often include text and multimedia content. Examples include e-mail
messages, word processing documents, videos, photos, audio files, presentations, webpages
and many other kinds of business documents. Note that while these sorts of files may have an
internal structure, they are still considered "unstructured" because the data they contain
doesn't fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the
amount of unstructured data in enterprises is growing significantly — often many times faster
than structured databases are growing.

Semi-Structured Data: In addition to structured and unstructured data, there's also a third
category: semi-structured data. Semi-structured data is information that doesn't reside in a
relational database but that does have some organizational properties that make it easier to
analyze. Examples of semi-structured data might include XML documents.

Data integrity: The measure of trust an organization has in the accuracy, completeness,
timeliness, and validity of the data.
Data set: A collection of data, typically in tabular form
Data security: The practice of protecting data from destruction or unauthorized access.
Petabyte: One million gigabytes or 1,024 terabytes.
Exabyte: One million terabytes, or 1 billion gigabytes of information.
Exploratory data analysis: An approach to data analysis focused on identifying general patterns
in data, including outliers and features of the data that are not anticipated by the
experimenter’s current knowledge or preconceptions. EDA aims to uncover underlying
structure, test assumptions, detect mistakes, and understand relationships between variables.
Hadoop: Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing environment. It
is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it
possible to run applications on systems with thousands of commodity hardware nodes, and to
handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer
rates among nodes and allows the system to continue operating in case of a node failure. This
approach lowers the risk of catastrophic system failure and unexpected data loss, even if a
significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a
foundation for big data processing tasks, such as scientific analytics, business and sales
planning, and processing enormous volumes of sensor data, including from internet of
things sensors.

NoSQL: class of database management system that does not use the relational model. NoSQL
is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited
for use with very large data volumes that do not require the relational model. graph database
A type of NoSQL database that uses graph structures for semantic queries with nodes, edges,
and properties to store, map, and query relationships in data. MongoDB is an open-source
NoSQL database

IoT: A thing, in the Internet of Things, can be a person with a heart monitor implant, a farm
animal with a biochip transponder, an automobile that has built-in sensors to alert the driver
when tire pressure is low -- or any other natural or man-made object that can be assigned an IP
address and provided with the ability to transfer data over a network. huge increase in address
space is an important factor in the development of the Internet of Things. According to Steve
Leibson, who identifies himself as “occasional docent at the Computer History Museum,” the
address space expansion means that we could “assign an IPV6 address to every atom on the
surface of the earth, and still have enough addresses left to do another 100+ earths.” In other
words, humans could easily assign an IP address to every "thing" on the planet. An increase in
the number of smart nodes, as well as the amount of upstream data the nodes generate, is
expected to raise new concerns about data privacy, data sovereignty and security. Practical
applications of IoT technology can be found in many industries today, including precision
agriculture, building management, healthcare, energy and transportation.

Map reduce: A general term that refers to the process of breaking up a problem into pieces
that are then distributed across multiple computers on the same network or cluster, or across a
grid of disparate and possibly geographically separated systems (map), and then collecting all
the results and combines them into a report (reduce). Google’s branded framework to perform
this function is called MapReduce.

Econometrics Cheat Sheet Stock and Watson
100% (5)
Econometrics Cheat Sheet Stock and Watson
2 pages
Data Analytics
100% (3)
Data Analytics
14 pages
Week Two Assignment, Econometrics
No ratings yet
Week Two Assignment, Econometrics
4 pages
Solutions To Chapter 10 Problems
No ratings yet
Solutions To Chapter 10 Problems
40 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
Big Data: Management Information Systems
No ratings yet
Big Data: Management Information Systems
11 pages
ACC IT APP MIdterm Bigdata
No ratings yet
ACC IT APP MIdterm Bigdata
12 pages
Module 1
No ratings yet
Module 1
35 pages
Unit 1
No ratings yet
Unit 1
61 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Pathfinder Glossary Web Final
No ratings yet
Pathfinder Glossary Web Final
4 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Lesson 3 Big Data Overview
No ratings yet
Lesson 3 Big Data Overview
30 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Data Curation and Managment Chap1-5 1-5
No ratings yet
Data Curation and Managment Chap1-5 1-5
31 pages
data evolution unit 1 material.docx
No ratings yet
data evolution unit 1 material.docx
28 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Data and Information Management
No ratings yet
Data and Information Management
18 pages
Unit 1
No ratings yet
Unit 1
14 pages
Bigdata Unit1
No ratings yet
Bigdata Unit1
62 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
module 2-3 fuba midterms
100% (1)
module 2-3 fuba midterms
5 pages
Big Data Module 1
No ratings yet
Big Data Module 1
14 pages
Data For Business Analytics Unit 2
No ratings yet
Data For Business Analytics Unit 2
23 pages
R19 BDA UNIT-1
No ratings yet
R19 BDA UNIT-1
22 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Big Data: the Revolution That Is Transforming Our Work, Market and World
From Everand
Big Data: the Revolution That Is Transforming Our Work, Market and World
PAT NAKAMOTO
No ratings yet
Big Data study 1
No ratings yet
Big Data study 1
77 pages
Data Analysis
No ratings yet
Data Analysis
6 pages
Big Data in CRM
No ratings yet
Big Data in CRM
12 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
All
No ratings yet
All
62 pages
Database To Improve Performance and Decision Making
50% (2)
Database To Improve Performance and Decision Making
3 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Unit 1 Understanding Big Data
No ratings yet
Unit 1 Understanding Big Data
17 pages
Critical Data Warehouse Trends
No ratings yet
Critical Data Warehouse Trends
30 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Big Data Class 27Feb
No ratings yet
Big Data Class 27Feb
48 pages
Fda 1
No ratings yet
Fda 1
5 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Big Data
No ratings yet
Big Data
9 pages
DAVAI Macro
No ratings yet
DAVAI Macro
6 pages
Unit 1
No ratings yet
Unit 1
55 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
Unit 1
No ratings yet
Unit 1
17 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Cloud computing
No ratings yet
Cloud computing
86 pages
Module_1_Session_1 Introduction to Big Data Platform_Characteristics_Sources_Nature
No ratings yet
Module_1_Session_1 Introduction to Big Data Platform_Characteristics_Sources_Nature
4 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
Cap01-Data-Management
No ratings yet
Cap01-Data-Management
18 pages
Hadoop All Installations
No ratings yet
Hadoop All Installations
19 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Big Data Analytics With Hadoop and Apache Spark
No ratings yet
Big Data Analytics With Hadoop and Apache Spark
17 pages
RTU Evaluation Checklist: Good No
No ratings yet
RTU Evaluation Checklist: Good No
9 pages
Three "Events" That Define An REA Methodology For Systems Analysis, Design, and Implementation
No ratings yet
Three "Events" That Define An REA Methodology For Systems Analysis, Design, and Implementation
38 pages
Jurnal Zafran New
No ratings yet
Jurnal Zafran New
15 pages
GEM Rep 2007 12 001
No ratings yet
GEM Rep 2007 12 001
27 pages
K Means EM Cobweb WEKA PDF
No ratings yet
K Means EM Cobweb WEKA PDF
6 pages
Interpret The Key Results For Attribute Agreement Analysis
100% (1)
Interpret The Key Results For Attribute Agreement Analysis
28 pages
Product Manager Role @ Skydo _ JD 2024
No ratings yet
Product Manager Role @ Skydo _ JD 2024
2 pages
College of Business and Economics
No ratings yet
College of Business and Economics
22 pages
CV Akanksha Pandey
No ratings yet
CV Akanksha Pandey
4 pages
Ankur IndiaBulls
No ratings yet
Ankur IndiaBulls
7 pages
RM - HW 3
No ratings yet
RM - HW 3
3 pages
Training Report
100% (1)
Training Report
38 pages
QBUS2820 UoS Outline S2 2017
No ratings yet
QBUS2820 UoS Outline S2 2017
7 pages
Conducting Systematic Literature Review in Operations Management
No ratings yet
Conducting Systematic Literature Review in Operations Management
14 pages
Aguilar, Cherry Mhay M. - MMW
No ratings yet
Aguilar, Cherry Mhay M. - MMW
5 pages
Maritime Cluster Types
No ratings yet
Maritime Cluster Types
36 pages
22067515 Kushal Kadayat
No ratings yet
22067515 Kushal Kadayat
33 pages
The Analysis of Variance: I S M T 2002
No ratings yet
The Analysis of Variance: I S M T 2002
31 pages
Introduction To Classical Test Theory
No ratings yet
Introduction To Classical Test Theory
32 pages
KANCHAN SONKAR - CV - New
No ratings yet
KANCHAN SONKAR - CV - New
3 pages
Statistics Final Exam - 20S2
No ratings yet
Statistics Final Exam - 20S2
8 pages
Chap4 Cond Mark Res
No ratings yet
Chap4 Cond Mark Res
38 pages
Model Sum of Squares DF Mean Square F Sig. 1 Regression .471 4 .118 1.576 .196 Residual 3.590 48 .075 Total 4.061 52 A. Predictors: (Constant), LC, EXT, DEBT, TANG B. Dependent Variable: DPR
No ratings yet
Model Sum of Squares DF Mean Square F Sig. 1 Regression .471 4 .118 1.576 .196 Residual 3.590 48 .075 Total 4.061 52 A. Predictors: (Constant), LC, EXT, DEBT, TANG B. Dependent Variable: DPR
3 pages
One-Way ANOVA
No ratings yet
One-Way ANOVA
37 pages
Sap Predictive Analytics Certification Training
No ratings yet
Sap Predictive Analytics Certification Training
7 pages
Descriptive Sta-WPS Office
No ratings yet
Descriptive Sta-WPS Office
3 pages
Chapter - 1: Marketing Is Essentially About Marshalling The Resources of An Organization So That They
No ratings yet
Chapter - 1: Marketing Is Essentially About Marshalling The Resources of An Organization So That They
59 pages
Impact of Different Determinants On Customer's Satisfaction Level (A Case of Fast Food Restaurant)
No ratings yet
Impact of Different Determinants On Customer's Satisfaction Level (A Case of Fast Food Restaurant)
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

Uploaded by

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

Uploaded by

CSD 1043: Big Data Fundamentals

Week1: Big Data Landscape

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.