Data Evolution Unit 1 Material
Data Evolution Unit 1 Material
Concept of Data
States of Data
• Data exists in three states: data at rest, data in transit, and data in use.
• Data within a computer moves as parallel data, while data moving to or from a computer moves as
serial data.
• Data representing quantities, characters, or symbols are stored and recorded on various recording
media and transmitted in the form of digital signals.
UNDERSTANDING DATA
META DATA and its CHARACTERISTICS
DATA about DATA is called META DATA
• Metadata helps translate data to information.
• Data relating to physical events or processes has a temporal component.
• Computers follow a sequence of instructions given in the form of data.
• A single datum is a value stored at a specific location, allowing computer programs to operate on
other computer programs by manipulating their programmatic data.
2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing
• Companies like Amazon, eBay, and Google generated large amounts of web traffic and unstructured
data.
• AWS launched in 2002, offering a range of cloud infrastructure services, attracting customers like
Dropbox, Netflix, and Reddit.
• Social media platforms like MySpace, Facebook, and Twitter spread unstructured data, leading to
the creation of Hadoop and NoSQL database queries.
BIG DATA
• Traditional data analysis methods and computing are difficult to work with large datasets.
• The new field of data science uses machine learning and AI methods for efficient applications of
analytic methods to big data.
“Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools,
and machines.
• It requires new, innovative, and scalable technology to collect, host, and analytically process
the vast amount of data gathered in order to derive real-time business insights that relate to
consumers, risk, profit, performance, productivity management, and enhanced
shareholder value.”
• There is no one definition of Big Data, but there are certain elements that are common across
the different definitions, such as velocity, volume, variety, veracity, and value. These are the
V's of Big Data.
Velocity is the speed at which data accumulates.
• Data is being generated extremely fast, in a process that never stops.
• Near or real-time streaming, local, and cloud-based technologies can process information
very quickly.
Example for Velocity: Every 60 seconds, hours of footage are uploaded to YouTube which is
generating data.
• Think about how quickly data accumulates over hours, days, and years.
Volume is the scale of the data, or the increase in the amount of data stored.
• Drivers of volume are the increase in data sources, higher resolution sensors, and
scalable infrastructure.
Example for Volume: The world population is approximately seven billion people and the vast
majority are now using digital devices; mobile phones, desktop and laptop computers, wearable
devices, and so on. These devices all generate, capture, and store data -- approximately 2.5 quintillion
bytes every day.
Variety is the diversity of the data.
• Structured data fits neatly into rows and columns, in relational databases while
unstructured data is not organized in a pre-defined way, like Tweets, blog posts, pictures,
numbers, and video.
• Variety also reflects that data comes from different sources, machines, people, and
processes, both internal and external to organizations.
• Drivers are mobile technologies, social media, wearable technologies, geo technologies,
video, and many, many more.
Example for Variety: Let's think about the different types of data; text, pictures, film, sound,
health data from wearable devices, and many different types of data from devices connected to
the Internet of Things.
Veracity is the quality and origin of data, and its conformity to facts and accuracy.
• Attributes include consistency, completeness, integrity, and ambiguity.
• Drivers include cost and the need for traceability.
• With the large amount of data available, the debate rages on about the accuracy of data in the
digital age. Is the information real, or is it false?
• Example for Veracity: 80% of data is considered to be unstructured and we must devise ways
to produce reliable and accurate insights.
• The data must be categorized, analyzed, and visualized.
Value is our ability and need to turn data into value.
• Value isn't just profit.
• It may have medical or social benefits, as well as customer, employee, or personal
satisfaction.
• The main reason that people invest time to understand Big Data is to derive value from it.
DATA AND TYPES OF DATA
• Data is unorganized information that is processed to make it meaningful.
• Generally, data comprises of facts, observations, perceptions, numbers, characters, symbols,
and images that can be interpreted to derive meaning.
• One of the ways in which data can be categorized is by its structure.
Data can be:
• Structured;
• Semi-structured, or
• Unstructured.
Structured data
• --Structured data has a well-defined structure or adheres to a specified data model, can be
stored in well-defined schemas such as databases, and in many cases can be represented in a
tabular manner with rows and columns.
• --Structured data is objective facts and numbers that can be collected, exported, stored,
and organized in typical databases.
• --Some of the sources of structured data could include:
1.SQL Databases and Online Transaction Processing (or OLTP) Systems that focus on business
transactions,
2.Spreadsheets such as Excel and Google Spreadsheets, Online forms,
3.Sensors such as Global Positioning Systems (or GPS) and Radio Frequency Identification (or RFID)
tags; and
4.Network and Web server logs.
structured data is stored in relational or SQL databases.
Semi-structured data
• Semi-structured data is data that has some organizational properties but lacks a fixed or rigid
schema.
• Semi-structured data cannot be stored in the form of rows and columns as in databases.
• It contains tags and elements, or metadata, which is used to group data and organize it in a
hierarchy.
Some of the sources of semi-structured data could include:
• E-mails, XML, and other markup languages, Binary executables, TCP/IP packets, Zipped
files,
• Integration of data from different sources.
• XML and JSON allow users to define tags and attributes to store data in a hierarchical form
and are used widely to store and exchange semi-structured data.
Unstructured data
• Unstructured data is data that does not have an easily identifiable structure and,
therefore, cannot be organized in a mainstream relational database in the form of rows and
columns.
• It does not follow any particular format, sequence, semantics, or rules.
• Unstructured data can deal with the heterogeneity of sources and has a variety of business
intelligence and analytics applications.
Some of the sources of unstructured data could include:
• Web pages, Social media feeds, Images in varied file formats (such as JPEG, GIF, and PNG),
video and audio files, documents and PDF files, PowerPoint presentations, media logs; and surveys.
• Unstructured data can be stored in files and documents (such as a Word doc) for
manual analysis or in NoSQL databases that have their own analysis tools for examining this
type of data.
DATA SOURCES
some common sources such as
Relational Databases;
Flatfiles and XML Datasets
APIs and Web Services;
Web Scraping;
Data Streams and Feeds
Some of the standard file formats that we use
Delimited text file formats,
Microsoft Excel Open XML Spreadsheet, or XLSX
Extensible Markup Language, or XML,
Portable Document Format, or PDF,
JavaScript Object Notation, or JSON
Typically, organizations have internal applications to support them in managing their day to
day business activities, customer transactions, human resource activities, and their
workflows.
These systems use relational databases such as SQL Server, Oracle, MySQL, and IBM
DB2, to store data in a structured way. Data stored in databases and data warehouses can
be used as a source for analysis. For example, data from a retail transactions system can be
used to analyze sales in different regions, and data from a customer relationship
management system can be used for making sales projections.
External to the organization, there are other publicly and privately available datasets. For
example, government organizations releasing demographic and economic datasets on an
ongoing basis. Then there are companies that sell specific data, for example, Point-of-Sale
data or Financial data, or Weather data, which businesses can use to define strategy, predict
demand, and make decisions related to distribution or marketing promotions, among other
things. Such data sets are typically made available as flat files, spreadsheet files, or XML
documents.
Flat files, store data in plain text format, with one record or row per line, and each
value separated by delimiters such as commas, semi-colons, or tabs.
Data in a flat file maps to a single table, unlike relational databases that contain
multiple tables.
One of the most common flat-file format is CSV in which values are separated by commas.
Spreadsheet files are a special type of flat files, that also organize data in a tabular
format–rows and columns.
But a spreadsheet can contain multiple worksheets, and each worksheet can map to a
different table.
Although data in spreadsheets is in plain text, the files can be stored in custom formats and
include additional information such as formatting, formulas, etc.
Microsoft Excel, which stores data in .XLS or .XLSX format is probably the most
common spreadsheet.
Others include Google sheets, Apple Numbers, and LibreOffice.
XML files, contain data values that are identified or marked up using tags.
While data in flat files is “flat” or maps to a single table, XML files can support
more complex data structures, such as hierarchical.
Some common uses of XML include data from online surveys, bank statements, and
other unstructured data sets.
Many data providers and websites provide APIs, or Application Program Interfaces, and
Web Services, which multiple users or applications can interact with and obtain data for
processing or analysis.
APIs and Web Services typically listen for incoming requests, which can be in the form of
web requests from users or network requests from applications, and return data in plain text,
XML, HTML, JSON, or media files.
Let’s look at some popular examples of APIs being used as a data source for data
analytics: The use of Twitter and Facebook APIs to source data from tweets and posts for
performing tasks such as opinion mining or sentiment analysis—which is to summarize the
amount of appreciation and criticism on a given subject, such as policies of a government, a
product, a service, or customer satisfaction in general.
Stock Market APIs used for pulling data such as share and commodity prices, earnings
per share, and historical prices, for trading and analysis.
Data Lookup and Validation APIs, which can be very useful for Data Analysts for
cleaning and preparing data, as well as for co-relating data—for example, to check which
city or state a postal or zip code belongs to.
APIs are also used for pulling data from database sources, within and external to the
organization.
Web Scraping is used to extract relevant data from unstructured sources.
Also known as screen scraping, web harvesting, and web data extraction, web scraping
makes it possible to download specific data from web pages based on defined parameters.
Web scrapers can, among other things, extract text, contact information, images,
videos, product items, and much more from a website.
Some popular uses of web scraping include collecting product details from retailers,
manufacturers, and eCommerce websites to provide price comparisons; generating sales
leads through public data sources; extracting data from posts and authors on various forums
and communities; and collecting training and testing datasets for machine learning
models Some of the popular web scraping tools include BeautifulSoup, Scrapy, Pandas, and
Selenium.
Data streams are another widely used source for aggregating constant streams of data
flowing from sources such as instruments, IoT devices, and applications, GPS data from
cars, computer programs, websites, and social media posts.
This data is generally timestamped and also geo-tagged for geographical identification.
Some of the data streams and ways in which they can be leveraged include:
stock and market tickers for financial trading; retail transaction streams for predicting
demand and supply chain management; surveillance and video feeds for threat
detection; social media feeds for sentiment analysis; sensor data feeds for monitoring
industrial or farming machinery; web click feeds for monitoring web performance and
improving design; and real-time flight events for rebooking and rescheduling.
Some popular applications used to process data streams include Apache Kafka,
Apache Spark Streaming, and Apache Storm.
RSS (or Really Simple Syndication) feeds, are another popular data source.
These are typically used for capturing updated data from online forums and news sites
where data is refreshed on an ongoing basis.
Using a feed reader, which is an interface that converts RSS text files into a stream
of updated data, updates are streamed to user devices.
• Source code repositories (repos) are software products designed to manage source code.
• Modern repos are based on versioning systems, which track changes and allow for creation and
comparison of different versions.
• Repos and versioning take time to learn and integrate into workflows.
• Bitbucket.org and GitHub.com offer free web hosting of code repos, with Git being the most popular
versioning system.
• Remote repo-hosting services serve as a backup, ensuring code safety even in the event of a
computer crash.
• Some code-hosting services offer web interfaces for viewing code history, versions, and
development status.
• Remote repos allow code access from any location with web access, with language-specific code
highlighting and other useful features.
• Tips for repos and versioning include using a remote source code repo, learning Git or another
versioning system, committing code changes frequently, working in a location that doesn't affect
production version or development by other team members, using versioning, branching, and forking
instead of copying and pasting code, and asking Git gurus for best practices.
• Data scientists possess a strong sense of awareness, which can be a strength or a weakness.
• The stereotype of introverted academics being too shy to ask for help is often misguided.
• Data scientists, including software engineers, business strategists, sales executives, marketers, and
researchers, possess vast knowledge about their domains or projects.
• In a business setting, it's important to learn about the company and the industry, or domain
knowledge.
• Nontechnical business people often treat data scientists as smart, but they also have a deeper
understanding of project goals and business problems.
• Engaging in discussions with those who understand the business side of the project can illuminate
projects and contribute to domain knowledge.
• The concept of being "close to the data" involves minimizing the complexity of the methods and
algorithms used.
• Complex methods, such as black box methods, can be beneficial in certain fields. (but be conscious
of the possibility of mistakes)
• In cases where complex methods have advantages, the concept of being close to the data can be
adapted.
• This can involve verifying, justifying, or supporting results from complex methods with simpler,
close-to-data methods.
• Overstraying from the data without a safety line can lead to difficulties in diagnosing problems.
• Data science is the extraction of knowledge from data, a concept not found in other fields like
operations research, decision sciences, analytics, data mining, mathematical modeling, or applied
statistics.
• The term is often used to describe the unique tasks data scientists perform, which previous applied
statisticians and data-oriented software engineers did not.
• The data science process involves setting goals, preparing, building, finishing, exploring, wrapping
up, wrangle, revising, assessing, delivering, planning, analyzing, engineering, optimizing, and
executing.
• The Information Age, from the second half of the 20th century to the beginning of the 21st century,
is characterized by the rise of computers and the internet.
• Early computers were used for computationally intensive tasks like cracking military codes,
navigating ships, and performing simulations in applied physics.
• The internet developed in size and capacity, allowing data and results to be sent easily across a large
distance, enabling data analysts to amass larger and more varied data sets for study.
• Internet access for the average person in a developed country increased dramatically in the 1990s,
giving hundreds of millions of people access to published information and data.
• Websites and applications began collecting user data in the form of clicks, typed text, site visits, and
other actions a user might take, leading to more data production than consumption.
• The advent of mobile devices and smartphones connected to the internet allowed for an enormous
advance in the amount and specificity of user data being collected.
• The Internet of Things (IoT) includes data collection and internet connectivity in almost every
electronic device, making the online world not just a place for consuming information but a
data-collection tool in itself.
Access to Data
• Some companies, like Twitter, provide access to their data for a fee.
• An industry has developed around brokering the sale of data for profit.
• Academic and nonprofit organizations often make data sets available publicly and for free, but there
may be limitations on their use.
• There has been a trend towards consolidation of data sets within a single scientific field.
• Discusses various data formats such as flat files, XML, and JSON.
• Each format has unique properties and idiosyncrasies, making it easier to access and extract data.
• The discussion includes a discussion of databases and APIs, as they are essential for data science
projects.
• Data can be accessed as a file on a file system, in a database, or behind an API.
• Data storage and delivery are intertwined in some systems, making them a single concept: getting
data into analysis tools.
• The goal is to provide descriptions that make readers comfortable discussing and approaching each
data format or system.
• The section aims to make data science accessible to beginners, allowing them to move on to the
most important part: what the data can tell.
Understanding Flat Files
XML Overview
• XML is a more flexible format than HTML, suitable for storing and transmitting documents and
data.
• XML documents begin with a tag declaring a specific XML version.
• XML works similarly to HTML but without most of the overhead associated with web pages.
• XML is used as a standard format for offline documents like OpenOffice and Microsoft Office.
• XML specification is designed to be machine-readable, allowing for data transmission through APIs.
• XML is popular in applications and documents using non-tabular data and other formats requiring
flexibility.
JSON Overview
• JavaScript Object Notation (JSON) is a functionally similar language for data storage or
transmission.
• JSON describes data structures like lists, maps, or dictionary in programming languages.
• Unlike XML, JSON is leaner in terms of character count.
• JSON is popular for transmitting data due to its ease of use.
• It can be read directly as JavaScript code, and many programming languages like Python and Java
have natural representations of JSON.
• JSON is highly efficient for interoperability between programming languages.
• Relational databases are data storage systems optimized for efficient data storage and retrieval.
• They are designed to search for specific values or ranges of values within the table entries.
• A database query can be expressed in plain English, with the most common basis query language
being Structured Query Language (SQL).
• A well-designed database can retrieve a set of table rows matching certain criteria much faster than a
scan of a flat file.
• The main reason for databases' quick retrieval is the database index, which is a data structure that
helps the database software find relevant data quickly.
• The administrator of the database needs to choose which columns of the tables are to be indexed, if
default settings aren't appropriate.
• Databases are also good at joining tables, which involves taking two tables of data and combining
them to create another table that contains some of the information of both the original tables.
• Joining can be a large operation if the original tables are big, so it should be minimized the size of
those tables.
• It's a good general rule to query the data first before joining, as there might be far less matching to
do and the execution of the operation will be much faster overall.
• For more information and guidance on optimizing database operations, practical database books are
available.
• If you have a relatively large data set and your code or software tool is spending a lot of time
searching for the data it needs at any given moment, setting up a database is worth considering.
• NoSQL (Not only SQL) allows for database schemas outside traditional SQL-style relational
databases.
• Graph and document databases are typically classified as NoSQL databases.
• Many NoSQL databases return query results in familiar formats, like Elasticsearch and MongoDB.
• Elasticsearch is a document-oriented database that excels at indexing text contents, ideal for
operations like counting word occurrences in blog posts or books.
• NoSQL databases offer flexibility in schema, allowing for the incorporation of various data types.
• MongoDB is easy to set up but may lose performance if not optimized for rigorous indexing and
schema.
Understanding APIs and Data Collection
Tumblr's API
• Tumblr's public API allows users to request and receive information about Tumblr content in JSON
format.
• The API is a REST API accessible via HTTP.
Aptitude in APIs
• Accurate API usage can be a powerful tool in data collection due to the vast amount of data
available through these gateways.
Common Bad Formats
• Avoids typical office software suites like word processing programs, spreadsheets, and mail clients.
• Avoids these formats when data science is involved.
• Uses specialized programs for data analysis, as these programs are usually incapable of the analysis
needed.
• OpenOffice Calc and Microsoft Excel allow for exporting individual sheets into CSV formats.
• Exports text from Microsoft Word documents into plain text, HTML, or XML.
• Exports text from PDFs into plain text files for analysis.
Unusual Formats
• This category includes data formats and storage systems unfamiliar to the user.
• Some formats are archaic or have been superseded by another format.
• Some formats are highly specialized.
• When encountering unfamiliar data storage systems, the user searches online for examples of similar
systems and decides if it's worth the trouble.
• If the data is worth it, the user generalizes from similar examples and gradually expands from them.
• Dealing with unfamiliar data formats or storage systems requires exploration and seeking help.
• The decision to use existing data or seek more is complex due to the variability of data sets.
• An example of this is Uber's data sharing with the Taxi and Limousine Commission (TLC).
• The TLC required ZIP codes for pick-up and drop-off locations, which are not specific enough to
cover large areas.
• Addresses or city blocks would be better for data analysis, but this poses legal issues regarding user
privacy.
• After initial disappointment, it's important to check if the data will suffice or if additional data is
needed.
• A simple way to assess this is to run through specific examples of your intended analyses.
• The decision should be based on the project's goals and the specific questions you're aiming to
answer.
• If your current data set is insufficient, it might be possible to combine data sets to find answers.
• This can be likened to fitting puzzle pieces together, where each piece needs to cover precisely what
the other pieces don't.
CHAPTER : Data Wrangling Overview
• World records are often compared based on their age or closeness to breaking.
• Usain Bolt's 200 m dash world record was 12 years old when he broke it, while Usain Bolt's 100 m
world record was less than a year old when he broke it in early 2008.
• Age of a world record does not necessarily indicate strength, as Bolt's 19.19 sec 200 m mark was not
worse than his 19.30 sec mark.
• The percentage improvement of a mark over the second-best mark is often used as evidence of good
performance.
• However, this is not perfect due to the high variance of second-best performances.
Common Heuristics
Comparisons in Athletics
• Armchair track and field enthusiasts often compare world records based on their age or closeness to
breaking them.
• Michael Johnson's 200 m dash world record was 12 years old when Usain Bolt broke it.
• Usain Bolt's 100 m world record was less than a year old when he broke it in early 2008, then again
at the 2008 Olympics and 2009 World Championships.
• The age of a world record does not necessarily indicate strength, as Bolt's 19.19 sec mark for the 200
m was not worse than his 19.30 sec mark.
• The percentage improvement of a mark over the second-best mark is often used as evidence of good
performance, but this is not perfect due to the high variance of second-best performances.
• The IAAF Scoring Tables of Athletics are the most widely accepted method for comparing
performance between events in track and field.
• The IAAF publishes an updated set of points tables every few years.
• The tables are used in multidiscipline events like men’s decathlon and women’s heptathlon.
• The scoring tables for individual events have little effect on competition, except for certain track and
field meetings that award prizes based on the tables.
• The 2008 Scoring Tables gave Usain Bolt’s 2009 performance a score of 1374, indicating a dramatic
change in his performance.
• The 2011 tables, based on a relatively small set of the best performances in each event, could have
affected the scores in the next update.
• The 2008 and 2009 track and field seasons produced incredible 100 m performances, which affected
the next set of scores, released in 2011.
• The author aims to use all available data to generate a scoring method that is less sensitive to
changes in the best performances and a good predictor of future performance levels.
• Understanding the starting point of valuable data within each HTML page is crucial.
• Character sequences in raw HTML can cause confusion and potential errors.
• These sequences are often HTML representations of characters like ü or é.
• It's important to double-check everything, both manually and programmatically.
• A quick scroll through post-wrangle, clean data files can reveal obvious mistakes.
• Extra tab characters can interfere with parsing algorithms, including standard R packages and Excel
imports.
• Every case requires careful consideration of potential parsing errors.
• Awareness is the most important aspect of data wrangling.
• Wrangling scripts start at the beginning of the file and finish at the end, but unexpected changes can
occur in the middle.
• It's crucial to examine the wrangled data file(s) at the beginning, end, and some places in the middle
to ensure the expected state.
• Nonstandard lists of best performances can be found at the bottom of the pages.
• The HTML tag that denotes the beginning of the desired data is closed at the end of the main list.
• This tag closure is a good way to end the parsing of the data set.
• If the wrangling script ignores the end of the useful data set, it may collect nonstandard results at the
bottom of the page or fail to know what to do when the data stops fitting the established column
format.
• Looking at the end of the wrangled data file is crucial to determine if the data wrangling was
successful.
• The data scientist should decide which aspects of wrangling are most important and ensure they are
completed properly.
• The process involves imagining oneself as a wrangling script, parsing through raw data, and
extracting necessary parts.
• A potential solution is to download all web pages containing all Olympic track and field events and
parse them using HTML structure.
• However, this requires a list of web addresses for individual events to download programmatically.
• Each page has a unique address that needs to be copied or typed manually, which could be
time-consuming.
• The author decided not to go with web scraping, instead opting for the post-HTML, already rendered
web page version.
• The author would visit each of the 48 web pages, select all text, copy the text, and paste the text into
a separate flat file.
• This method eliminates the need for translating HTML or scripting the downloading of the pages.
• The choice of data wrangling plan should depend on all the information discovered during initial
investigation.
• The author suggests pretending to be a wrangling script, imagine what might happen with the data,
and then write the script later.
• Imagine being a script and reading through files to understand the complexity of the task.
• Use simple tools like the Unix command line for simpler tasks like extracting lines, converting
occurrences of a word, and splitting files.
• For more complex operations, a scripting language like Python or R is recommended.
• Writing a wrangling script is not a well-orchestrated affair; it involves trying various techniques until
finding the best one.
• The most important capability to strive for when choosing scripting languages or tools is the ability
to load, manipulate, write, and transform data quickly.
• Make informed decisions about how best to wrangle the data or guess and check if it's more time
efficient.
• Consider the manual-versus-automate question: can you wrangle manually in a shorter time than you
can write a script?
• Stay aware of the status of the data, the script, the results, the goals, and what each wrangling step
and tool is gaining you.
• Major operating systems still have disagreements on line endings in text files.
• Unix and Linux have used line feed (LF) denotation for new lines since the 1970s.
• Mac OS before version 9.0 used carriage return (CR) character for new lines.
• Mac OS joined Unix derivatives in using line feed since 1999, but Microsoft Windows uses a hybrid
CR+LF line ending.
• Improper line ending parsing can lead to various problems.
• Each programming language has its own capabilities for reading different file types.
• OS file formats include : more lines of text than expected, too few lines of text, and interspersed
weird-looking characters.
• Incorrect data can sneak into projects without causing an error or making itself obvious.
• Summary statistics and exploratory graphs can help catch these errors.
• Checking the range of values—minimum to maximum—could catch the error.
• Plotting histograms of all data can help check for errors and gain awareness about the data sets.
• Generating statistical or visual summaries can prevent errors and promote awareness.
• Techniques like basic descriptive statistics, summaries, and diagnostics can be used to ensure
successful data assessment.
Facets of Data
• Structured data: Depends on a data model and resides in a fixed field within a record. It can be
stored in tables within databases or Excel files. SQL is the preferred way to manage and query data
that resides in databases.
• Unstructured data: Content is context-specific or varying, making it difficult to fit into a data model.
Examples include regular emails, which contain structured elements but are challenging to find due to
the numerous ways to refer to a person and the thousands of different languages and dialects.
• Graph data refers to mathematical structures that model pair-wise relationships between objects.
• Graph structures use nodes, edges, and properties to represent and store graphical data.
• Graph-based data is a natural representation of social networks, allowing for calculation of specific
metrics like influence and shortest path.
• Examples of graph-based data include LinkedIn's company list and Twitter's follower list.
• Power and sophistication come from multiple, overlapping graphs of the same nodes.
• Graph databases store graph-based data and are queried with specialized query languages like
SPARQL.
• Graph data presents challenges, especially for computer interpretation of additive and image data.