CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
Data vs information: data is facts, but information is the meaning behind the data.
Data science: Data science is the study of where information comes from, what it represents
and how it can be turned into a valuable resource in the creation of business. Data science
offers a powerful new approach to making discoveries. By combining aspects of statistics,
computer science, applied mathematics, and visualization, data science can turn the vast
amounts of data the digital age generates into new insights and new knowledge.
Data analytics: The application of software to derive information or meaning from data. The
end result might be a report, an indication of status, or an action taken automatically based on
the information received.
Data mining: The process of deriving patterns or knowledge from large data sets.
Data marts & data warehouse: A data mart is one piece of a data warehouse where all the
information is related to a specific business area. Therefore, it is considered a subset of all the
data stored in that particular database, since all data marts together create a data warehouse.
Data visualization: Today, data visualization has become a rapidly evolving blend of science and
art that is certain to change the corporate landscape over the next few years. oday, data
visualization has become a rapidly evolving blend of science and art that is certain to change
the corporate landscape over the next few years. Tableau public is a popular data visualization
tool that's also completely free.
Data wrangling: the messy, incomplete data or data that is too complex and simplify and/or
clean it so that it’s useable for analysis — and you’ll have done data wrangling. Pandas is one of
the most popular Python library for data wrangling
Data governance: data governance establishes the rules of the data-use game. Data
governance becomes the function that owns the quality of data across the organization. The
participating policy makers ensure that standards are in place, that data quality is monitored,
and that new/emerging data and data sources are always tied into the rest of the data picture
for the business. As many industries have very refined data that is easier to capture and always
consistent, data governance is sometimes viewed as an information technology (IT) function.
Machine learning: The use of algorithms to allow a computer to analyze data for the purpose of
“learning” what action to take when a specific pattern or event occurs.
Data clustering: Data analysis for the purpose of identifying similarities and differences among data
sets so that similar data sets can be clustered together.
Correlation analysis
Descriptive analytics
Predictive analytics (modelling)
Structured data vs Unstructured data: Spreadsheeets vs Emails
Unstructured data files often include text and multimedia content. Examples include e-mail
messages, word processing documents, videos, photos, audio files, presentations, webpages
and many other kinds of business documents. Note that while these sorts of files may have an
internal structure, they are still considered "unstructured" because the data they contain
doesn't fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the
amount of unstructured data in enterprises is growing significantly — often many times faster
than structured databases are growing.
Semi-Structured Data: In addition to structured and unstructured data, there's also a third
category: semi-structured data. Semi-structured data is information that doesn't reside in a
relational database but that does have some organizational properties that make it easier to
analyze. Examples of semi-structured data might include XML documents.
Data integrity: The measure of trust an organization has in the accuracy, completeness,
timeliness, and validity of the data.
Data set: A collection of data, typically in tabular form
Data security: The practice of protecting data from destruction or unauthorized access.
Petabyte: One million gigabytes or 1,024 terabytes.
Exabyte: One million terabytes, or 1 billion gigabytes of information.
Exploratory data analysis: An approach to data analysis focused on identifying general patterns
in data, including outliers and features of the data that are not anticipated by the
experimenter’s current knowledge or preconceptions. EDA aims to uncover underlying
structure, test assumptions, detect mistakes, and understand relationships between variables.
Hadoop: Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing environment. It
is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it
possible to run applications on systems with thousands of commodity hardware nodes, and to
handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer
rates among nodes and allows the system to continue operating in case of a node failure. This
approach lowers the risk of catastrophic system failure and unexpected data loss, even if a
significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a
foundation for big data processing tasks, such as scientific analytics, business and sales
planning, and processing enormous volumes of sensor data, including from internet of
things sensors.
NoSQL: class of database management system that does not use the relational model. NoSQL
is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited
for use with very large data volumes that do not require the relational model. graph database
A type of NoSQL database that uses graph structures for semantic queries with nodes, edges,
and properties to store, map, and query relationships in data. MongoDB is an open-source
NoSQL database
IoT: A thing, in the Internet of Things, can be a person with a heart monitor implant, a farm
animal with a biochip transponder, an automobile that has built-in sensors to alert the driver
when tire pressure is low -- or any other natural or man-made object that can be assigned an IP
address and provided with the ability to transfer data over a network. huge increase in address
space is an important factor in the development of the Internet of Things. According to Steve
Leibson, who identifies himself as “occasional docent at the Computer History Museum,” the
address space expansion means that we could “assign an IPV6 address to every atom on the
surface of the earth, and still have enough addresses left to do another 100+ earths.” In other
words, humans could easily assign an IP address to every "thing" on the planet. An increase in
the number of smart nodes, as well as the amount of upstream data the nodes generate, is
expected to raise new concerns about data privacy, data sovereignty and security. Practical
applications of IoT technology can be found in many industries today, including precision
agriculture, building management, healthcare, energy and transportation.
Map reduce: A general term that refers to the process of breaking up a problem into pieces
that are then distributed across multiple computers on the same network or cluster, or across a
grid of disparate and possibly geographically separated systems (map), and then collecting all
the results and combines them into a report (reduce). Google’s branded framework to perform
this function is called MapReduce.