0% found this document useful (0 votes)
65 views

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

This document provides definitions and explanations of key concepts related to big data fundamentals. It defines big data as datasets that are too large for traditional databases to handle due to their volume, velocity, and variety. It also notes that most big data is unstructured data like text files. The document then defines and explains additional big data terms like data science, data analytics, data mining, data visualization, machine learning, data clustering, structured vs unstructured data, Hadoop, NoSQL, Internet of Things (IoT), and MapReduce.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions

This document provides definitions and explanations of key concepts related to big data fundamentals. It defines big data as datasets that are too large for traditional databases to handle due to their volume, velocity, and variety. It also notes that most big data is unstructured data like text files. The document then defines and explains additional big data terms like data science, data analytics, data mining, data visualization, machine learning, data clustering, structured vs unstructured data, Hadoop, NoSQL, Internet of Things (IoT), and MapReduce.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CSD 1043: Big Data Fundamentals

Week1: Big Data Landscape


Definitions:
“3-D Data Management: Controlling Data Volume, Velocity and Variety.” Volume refers to the
sheer size of the datasets. The McKinsey report, “Big Data: The Next Frontier for Innovation,
Competition, and Productivity,” expands on the volume aspect by saying that, “’Big data’ refers
to datasets whose size is beyond the ability of typical database software tools to capture, store,
manage, and analyze.”
Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big
data is unstructured data. Many of the tools designed to analyze big data can handle unstructured data.
Big Data Terminology

Data vs information: data is facts, but information is the meaning behind the data.
Data science: Data science is the study of where information comes from, what it represents
and how it can be turned into a valuable resource in the creation of business. Data science
offers a powerful new approach to making discoveries. By combining aspects of statistics,
computer science, applied mathematics, and visualization, data science can turn the vast
amounts of data the digital age generates into new insights and new knowledge.
Data analytics: The application of software to derive information or meaning from data. The
end result might be a report, an indication of status, or an action taken automatically based on
the information received.
Data mining: The process of deriving patterns or knowledge from large data sets.
Data marts & data warehouse: A data mart is one piece of a data warehouse where all the
information is related to a specific business area. Therefore, it is considered a subset of all the
data stored in that particular database, since all data marts together create a data warehouse.
Data visualization: Today, data visualization has become a rapidly evolving blend of science and
art that is certain to change the corporate landscape over the next few years. oday, data
visualization has become a rapidly evolving blend of science and art that is certain to change
the corporate landscape over the next few years. Tableau public is a popular data visualization
tool that's also completely free.
Data wrangling: the messy, incomplete data or data that is too complex and simplify and/or
clean it so that it’s useable for analysis — and you’ll have done data wrangling. Pandas is one of
the most popular Python library for data wrangling
Data governance: data governance establishes the rules of the data-use game. Data
governance becomes the function that owns the quality of data across the organization. The
participating policy makers ensure that standards are in place, that data quality is monitored,
and that new/emerging data and data sources are always tied into the rest of the data picture
for the business. As many industries have very refined data that is easier to capture and always
consistent, data governance is sometimes viewed as an information technology (IT) function.
Machine learning: The use of algorithms to allow a computer to analyze data for the purpose of
“learning” what action to take when a specific pattern or event occurs.
Data clustering: Data analysis for the purpose of identifying similarities and differences among data
sets so that similar data sets can be clustered together.

Correlation analysis
Descriptive analytics
Predictive analytics (modelling)
Structured data vs Unstructured data: Spreadsheeets vs Emails

Unstructured data files often include text and multimedia content. Examples include e-mail
messages, word processing documents, videos, photos, audio files, presentations, webpages
and many other kinds of business documents. Note that while these sorts of files may have an
internal structure, they are still considered "unstructured" because the data they contain
doesn't fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the
amount of unstructured data in enterprises is growing significantly — often many times faster
than structured databases are growing.

Semi-Structured Data: In addition to structured and unstructured data, there's also a third
category: semi-structured data. Semi-structured data is information that doesn't reside in a
relational database but that does have some organizational properties that make it easier to
analyze. Examples of semi-structured data might include XML documents.

Data integrity: The measure of trust an organization has in the accuracy, completeness,
timeliness, and validity of the data.
Data set: A collection of data, typically in tabular form
Data security: The practice of protecting data from destruction or unauthorized access.
Petabyte: One million gigabytes or 1,024 terabytes.
Exabyte: One million terabytes, or 1 billion gigabytes of information.
Exploratory data analysis: An approach to data analysis focused on identifying general patterns
in data, including outliers and features of the data that are not anticipated by the
experimenter’s current knowledge or preconceptions. EDA aims to uncover underlying
structure, test assumptions, detect mistakes, and understand relationships between variables.
Hadoop: Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing environment. It
is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it
possible to run applications on systems with thousands of commodity hardware nodes, and to
handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer
rates among nodes and allows the system to continue operating in case of a node failure. This
approach lowers the risk of catastrophic system failure and unexpected data loss, even if a
significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a
foundation for big data processing tasks, such as scientific analytics, business and sales
planning, and processing enormous volumes of sensor data, including from internet of
things sensors.

NoSQL: class of database management system that does not use the relational model. NoSQL
is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited
for use with very large data volumes that do not require the relational model. graph database
A type of NoSQL database that uses graph structures for semantic queries with nodes, edges,
and properties to store, map, and query relationships in data. MongoDB is an open-source
NoSQL database

IoT: A thing, in the Internet of Things, can be a person with a heart monitor implant, a farm
animal with a biochip transponder, an automobile that has built-in sensors to alert the driver
when tire pressure is low -- or any other natural or man-made object that can be assigned an IP
address and provided with the ability to transfer data over a network. huge increase in address
space is an important factor in the development of the Internet of Things. According to Steve
Leibson, who identifies himself as “occasional docent at the Computer History Museum,” the
address space expansion means that we could “assign an IPV6 address to every atom on the
surface of the earth, and still have enough addresses left to do another 100+ earths.” In other
words, humans could easily assign an IP address to every "thing" on the planet. An increase in
the number of smart nodes, as well as the amount of upstream data the nodes generate, is
expected to raise new concerns about data privacy, data sovereignty and security. Practical
applications of IoT technology can be found in many industries today, including precision
agriculture, building management, healthcare, energy and transportation.

Map reduce: A general term that refers to the process of breaking up a problem into pieces
that are then distributed across multiple computers on the same network or cluster, or across a
grid of disparate and possibly geographically separated systems (map), and then collecting all
the results and combines them into a report (reduce). Google’s branded framework to perform
this function is called MapReduce.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy