0% found this document useful (0 votes)
25 views28 pages

Data Evolution Unit 1 Material

The document outlines the syllabus for a course on data science, covering topics such as data evolution, types of data, data wrangling, and the significance of big data. It explains the definitions and characteristics of data, including structured, semi-structured, and unstructured data, as well as various data sources and collection methods. The evolution of data science as a discipline and its integration with artificial intelligence are also discussed, highlighting the ethical concerns and tools used in the field.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views28 pages

Data Evolution Unit 1 Material

The document outlines the syllabus for a course on data science, covering topics such as data evolution, types of data, data wrangling, and the significance of big data. It explains the definitions and characteristics of data, including structured, semi-structured, and unstructured data, as well as various data sources and collection methods. The evolution of data science as a discipline and its integration with artificial intelligence are also discussed, highlighting the ethical concerns and tools used in the field.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT 1 SYLLABUS AND MATERIAL

1. Introduction. Course Overview. Objectives of the course


2. Data Evolution: Data to Data Science – Understanding data, Introduction – Type of Data
3. Data Evolution – Data Sources
4. Preparing and gathering data and knowledge - Philosophies of data science.
5. data all around us: the virtual wilderness
7. Data wrangling: from capture to domestication
8. Data science in a big data world
9. Benefits and uses of data science and big data - facets of data
Data Evolution : Data to Data Science
Definition and Use of Data (general perspective)
• Data are discrete or continuous values conveying information.
• Data can be abstract ideas or concrete measurements.
• Data are used in scientific research, economics, and human organizational activities.

Concept of Data

• Data are the smallest units of factual information.


• Thematically connected data is viewed as information.
• Contextually connected pieces of information are described as data insights or intelligence.
• The stock of insights and intelligence resulting from the synthesis of data into information is
described as knowledge.

Definition of Data (COMPUTER SCIENCE)


• Data is any sequence of symbols, with datum being a single symbol of data.
• Data requires interpretation to become information.
• Digital data is represented using the binary number system of ones and zeros.

States of Data
• Data exists in three states: data at rest, data in transit, and data in use.
• Data within a computer moves as parallel data, while data moving to or from a computer moves as
serial data.
• Data representing quantities, characters, or symbols are stored and recorded on various recording
media and transmitted in the form of digital signals.

Storage of Data (Data Structures)


• Physical computer memory elements consist of an address and a byte/word of data storage.
• Digital data are often stored in relational databases and can be represented as abstract key/value
pairs.
• Data can be organized in various types of data structures, including arrays, graphs, and objects
• To store data bytes in a file, they must be serialized in a file format.
• Executable files contain programs, all other files are also data files.
• The line between program and data can become blurry.

UNDERSTANDING DATA
META DATA and its CHARACTERISTICS
DATA about DATA is called META DATA
• Metadata helps translate data to information.
• Data relating to physical events or processes has a temporal component.
• Computers follow a sequence of instructions given in the form of data.
• A single datum is a value stored at a specific location, allowing computer programs to operate on
other computer programs by manipulating their programmatic data.

Collection and Analysis of Data


• Data are collected using techniques like measurement, observation, query, or analysis.
• Field data are collected in an uncontrolled in-situ environment.
• Experimental data are generated in a controlled scientific experiment.
• Data are analyzed using techniques such as calculation, reasoning, discussion, presentation,
visualization, or other forms of post-analysis.

Data Evolution : The Evolution of Personal Computing

1940s to 1989 – Data Warehousing and Personal Desktop Computers


• The world's first programmable computer, ENIAC, was developed by the U.S. army during World
War 2.
• IBM released the first transistorized computer, TRADIC, in the early 1960s, allowing data centers to
branch out of the military.
• The first personal desktop computer with a Graphical User Interface (GUI) was Lisa, released by
Apple Computers in 1983.
• Companies like Apple, Microsoft, and IBM released a wide range of personal desktop computers,
leading to widespread personal computer use.

1989 to 1999 – Emergence of the World Wide Web


• Between 1989 and 1993, Sir Tim Berners-Lee created the fundamental technologies for the World
Wide Web.
• The decision to make the underlying code for these web technologies free led to a massive explosion
in data access and sharing.

2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing
• Companies like Amazon, eBay, and Google generated large amounts of web traffic and unstructured
data.
• AWS launched in 2002, offering a range of cloud infrastructure services, attracting customers like
Dropbox, Netflix, and Reddit.
• Social media platforms like MySpace, Facebook, and Twitter spread unstructured data, leading to
the creation of Hadoop and NoSQL database queries.

Data To Big Data and Data Science Evolution


Early Beginnings (1960s-1970s):
• Rooted in the world of statistics, with early computers enabling data analysis and visualization.
Emergence of Data Mining (1980s-1990s):
• Researchers developed algorithms to uncover meaningful patterns and insights within vast datasets.
• Worked on classification, clustering, and association rule mining.
Growth of Data Warehousing (1990s):
• Data warehousing technologies allowed organizations to centralize and manage large data volumes.
Rise of Machine Learning (1990s-Present):
• Advances in algorithms and techniques paved the way for predictive modeling and data-driven
decision-making.

Big Data Era (2000s-Present):


• The 2000s saw the explosive growth of data due to the internet, social media, and sensors.
• Technologies like Hadoop and MapReduce were developed to process and analyze these massive
datasets.

Data Science as a Discipline (2000s-Present):


• The term "data science" gained popularity in the early 2000s.
Data Science in Industry (2010s-Present):
• Data science became indispensable across various industries.

Tools and Frameworks (2010s-Present):


• The open-source ecosystem for data science tools and frameworks expanded rapidly.

Ethical and Regulatory Concerns (2010s-Present):


• Concerns about data privacy, algorithmic bias, and ethical considerations gained prominence.

Artificial Intelligence and Data Science Integration (2020s-Present):


• Data science has become tightly intertwined with AI, with machine learning and deep learning
playing central roles.

BIG DATA

• Big data refers to very large quantities of data


This large amount of data is of structured, semi-structured, and unstructured data.
• It arrives at a higher volume, faster rate, in a wider variety of file formats, and from a wider variety
of sources.
• The term was officially coined by NASA researchers in 1997 to describe processing and visualizing
vast amounts of data from supercomputers.
• Doug Laney's 2001 paper established three primary components of big data: Volume (size of data),
Velocity (speed of data growth), and Variety (number of data types and sources).

Big Data and Processing

• Traditional data analysis methods and computing are difficult to work with large datasets.
• The new field of data science uses machine learning and AI methods for efficient applications of
analytic methods to big data.
“Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools,
and machines.
• It requires new, innovative, and scalable technology to collect, host, and analytically process
the vast amount of data gathered in order to derive real-time business insights that relate to
consumers, risk, profit, performance, productivity management, and enhanced
shareholder value.”
• There is no one definition of Big Data, but there are certain elements that are common across
the different definitions, such as velocity, volume, variety, veracity, and value. These are the
V's of Big Data.
Velocity is the speed at which data accumulates.
• Data is being generated extremely fast, in a process that never stops.
• Near or real-time streaming, local, and cloud-based technologies can process information
very quickly.
Example for Velocity: Every 60 seconds, hours of footage are uploaded to YouTube which is
generating data.
• Think about how quickly data accumulates over hours, days, and years.
Volume is the scale of the data, or the increase in the amount of data stored.
• Drivers of volume are the increase in data sources, higher resolution sensors, and
scalable infrastructure.
Example for Volume: The world population is approximately seven billion people and the vast
majority are now using digital devices; mobile phones, desktop and laptop computers, wearable
devices, and so on. These devices all generate, capture, and store data -- approximately 2.5 quintillion
bytes every day.
Variety is the diversity of the data.
• Structured data fits neatly into rows and columns, in relational databases while
unstructured data is not organized in a pre-defined way, like Tweets, blog posts, pictures,
numbers, and video.
• Variety also reflects that data comes from different sources, machines, people, and
processes, both internal and external to organizations.
• Drivers are mobile technologies, social media, wearable technologies, geo technologies,
video, and many, many more.
Example for Variety: Let's think about the different types of data; text, pictures, film, sound,
health data from wearable devices, and many different types of data from devices connected to
the Internet of Things.
Veracity is the quality and origin of data, and its conformity to facts and accuracy.
• Attributes include consistency, completeness, integrity, and ambiguity.
• Drivers include cost and the need for traceability.
• With the large amount of data available, the debate rages on about the accuracy of data in the
digital age. Is the information real, or is it false?
• Example for Veracity: 80% of data is considered to be unstructured and we must devise ways
to produce reliable and accurate insights.
• The data must be categorized, analyzed, and visualized.
Value is our ability and need to turn data into value.
• Value isn't just profit.
• It may have medical or social benefits, as well as customer, employee, or personal
satisfaction.
• The main reason that people invest time to understand Big Data is to derive value from it.
DATA AND TYPES OF DATA
• Data is unorganized information that is processed to make it meaningful.
• Generally, data comprises of facts, observations, perceptions, numbers, characters, symbols,
and images that can be interpreted to derive meaning.
• One of the ways in which data can be categorized is by its structure.
Data can be:
• Structured;
• Semi-structured, or
• Unstructured.
Structured data
• --Structured data has a well-defined structure or adheres to a specified data model, can be
stored in well-defined schemas such as databases, and in many cases can be represented in a
tabular manner with rows and columns.
• --Structured data is objective facts and numbers that can be collected, exported, stored,
and organized in typical databases.
• --Some of the sources of structured data could include:
1.SQL Databases and Online Transaction Processing (or OLTP) Systems that focus on business
transactions,
2.Spreadsheets such as Excel and Google Spreadsheets, Online forms,
3.Sensors such as Global Positioning Systems (or GPS) and Radio Frequency Identification (or RFID)
tags; and
4.Network and Web server logs.
structured data is stored in relational or SQL databases.
Semi-structured data
• Semi-structured data is data that has some organizational properties but lacks a fixed or rigid
schema.
• Semi-structured data cannot be stored in the form of rows and columns as in databases.
• It contains tags and elements, or metadata, which is used to group data and organize it in a
hierarchy.
Some of the sources of semi-structured data could include:
• E-mails, XML, and other markup languages, Binary executables, TCP/IP packets, Zipped
files,
• Integration of data from different sources.
• XML and JSON allow users to define tags and attributes to store data in a hierarchical form
and are used widely to store and exchange semi-structured data.
Unstructured data
• Unstructured data is data that does not have an easily identifiable structure and,
therefore, cannot be organized in a mainstream relational database in the form of rows and
columns.
• It does not follow any particular format, sequence, semantics, or rules.
• Unstructured data can deal with the heterogeneity of sources and has a variety of business
intelligence and analytics applications.
Some of the sources of unstructured data could include:
• Web pages, Social media feeds, Images in varied file formats (such as JPEG, GIF, and PNG),
video and audio files, documents and PDF files, PowerPoint presentations, media logs; and surveys.
• Unstructured data can be stored in files and documents (such as a Word doc) for
manual analysis or in NoSQL databases that have their own analysis tools for examining this
type of data.

DATA SOURCES
some common sources such as
Relational Databases;
Flatfiles and XML Datasets
APIs and Web Services;
Web Scraping;
Data Streams and Feeds
Some of the standard file formats that we use
Delimited text file formats,
Microsoft Excel Open XML Spreadsheet, or XLSX
Extensible Markup Language, or XML,
Portable Document Format, or PDF,
JavaScript Object Notation, or JSON
Typically, organizations have internal applications to support them in managing their day to
day business activities, customer transactions, human resource activities, and their
workflows.
These systems use relational databases such as SQL Server, Oracle, MySQL, and IBM
DB2, to store data in a structured way. Data stored in databases and data warehouses can
be used as a source for analysis. For example, data from a retail transactions system can be
used to analyze sales in different regions, and data from a customer relationship
management system can be used for making sales projections.
External to the organization, there are other publicly and privately available datasets. For
example, government organizations releasing demographic and economic datasets on an
ongoing basis. Then there are companies that sell specific data, for example, Point-of-Sale
data or Financial data, or Weather data, which businesses can use to define strategy, predict
demand, and make decisions related to distribution or marketing promotions, among other
things. Such data sets are typically made available as flat files, spreadsheet files, or XML
documents.
Flat files, store data in plain text format, with one record or row per line, and each
value separated by delimiters such as commas, semi-colons, or tabs.
Data in a flat file maps to a single table, unlike relational databases that contain
multiple tables.
One of the most common flat-file format is CSV in which values are separated by commas.
Spreadsheet files are a special type of flat files, that also organize data in a tabular
format–rows and columns.
But a spreadsheet can contain multiple worksheets, and each worksheet can map to a
different table.
Although data in spreadsheets is in plain text, the files can be stored in custom formats and
include additional information such as formatting, formulas, etc.
Microsoft Excel, which stores data in .XLS or .XLSX format is probably the most
common spreadsheet.
Others include Google sheets, Apple Numbers, and LibreOffice.
XML files, contain data values that are identified or marked up using tags.
While data in flat files is “flat” or maps to a single table, XML files can support
more complex data structures, such as hierarchical.
Some common uses of XML include data from online surveys, bank statements, and
other unstructured data sets.
Many data providers and websites provide APIs, or Application Program Interfaces, and
Web Services, which multiple users or applications can interact with and obtain data for
processing or analysis.
APIs and Web Services typically listen for incoming requests, which can be in the form of
web requests from users or network requests from applications, and return data in plain text,
XML, HTML, JSON, or media files.
Let’s look at some popular examples of APIs being used as a data source for data
analytics: The use of Twitter and Facebook APIs to source data from tweets and posts for
performing tasks such as opinion mining or sentiment analysis—which is to summarize the
amount of appreciation and criticism on a given subject, such as policies of a government, a
product, a service, or customer satisfaction in general.
Stock Market APIs used for pulling data such as share and commodity prices, earnings
per share, and historical prices, for trading and analysis.
Data Lookup and Validation APIs, which can be very useful for Data Analysts for
cleaning and preparing data, as well as for co-relating data—for example, to check which
city or state a postal or zip code belongs to.
APIs are also used for pulling data from database sources, within and external to the
organization.
Web Scraping is used to extract relevant data from unstructured sources.
Also known as screen scraping, web harvesting, and web data extraction, web scraping
makes it possible to download specific data from web pages based on defined parameters.
Web scrapers can, among other things, extract text, contact information, images,
videos, product items, and much more from a website.
Some popular uses of web scraping include collecting product details from retailers,
manufacturers, and eCommerce websites to provide price comparisons; generating sales
leads through public data sources; extracting data from posts and authors on various forums
and communities; and collecting training and testing datasets for machine learning
models Some of the popular web scraping tools include BeautifulSoup, Scrapy, Pandas, and
Selenium.
Data streams are another widely used source for aggregating constant streams of data
flowing from sources such as instruments, IoT devices, and applications, GPS data from
cars, computer programs, websites, and social media posts.
This data is generally timestamped and also geo-tagged for geographical identification.
Some of the data streams and ways in which they can be leveraged include:
stock and market tickers for financial trading; retail transaction streams for predicting
demand and supply chain management; surveillance and video feeds for threat
detection; social media feeds for sentiment analysis; sensor data feeds for monitoring
industrial or farming machinery; web click feeds for monitoring web performance and
improving design; and real-time flight events for rebooking and rescheduling.
Some popular applications used to process data streams include Apache Kafka,
Apache Spark Streaming, and Apache Storm.
RSS (or Really Simple Syndication) feeds, are another popular data source.
These are typically used for capturing updated data from online forums and news sites
where data is refreshed on an ongoing basis.
Using a feed reader, which is an interface that converts RSS text files into a stream
of updated data, updates are streamed to user devices.

PHILOSOPHIES OF DATA SCIENCE


Data science as a set of processes and concepts that act as a guide for making progress and decisions
within a data-centric project.
This contrasts with the view of data science as a set of statistical and software tools and the
knowledge to use them
The thought processes of a data scientist is more important than the specific tools used and how
certain concepts pervade nearly all aspects of work in data science.
The origins of data science as a field of study or vocational pursuit lie somewhere between statistics
and software development. Statistics can be thought of as the schematic drawing and software as
the machine.
In addition to statistics and software, many folks say that data science has a third major component,
which is something along the lines of subject matter expertise or domain knowledge.
Although it certainly is important to understand a problem before you try to solve it, a good data
scientist can switch domains and begin contributing relatively soon.
Data should drive software only when that software is being built expressly for moving, storing, or
otherwise handing the data. Software that’s intended to address project or business goals should not
be driven by data.
Problems and goals exist independently of any data, software, or other resources, but those
resources may serve to solve the problems and to achieve the goals. The term data-centric reflects
that data is an integral part of the solution, and we need to view the problems not from the
perspective of the data but from the perspective of the goals and problems that data can help us
address.
Locating and maintaining focus on the most important aspects of a project is one of the most
valuable skills
Data scientists must have many hard skills—knowledge of software development and statistics
among them—but this soft skill of maintaining appropriate perspective and awareness of the many
moving parts in any data-centric problem to be very difficult yet very rewarding for most data
scientists
● Data analyst
● Data engineer
● Data scientist
● Machine learning engineer
● Database administrator
● Business analyst
● Chief data officer
● Artificial intelligence engineer
Sometimes data quality becomes an important issue; sometimes the major issue is data volume,
processing speed, parameters of an algorithm, interpretability of results, or any of the many other
aspects of the problem. Ignoring any of these at the moment it becomes important can compromise
or entirely invalidate subsequent results. The goal of a data scientist is to make sure that no
important aspect of a project goes awry unnoticed. When something goes wrong—and something
will—data scientist has to notice it and can fix it. Data scientist must also maintain awareness of all
aspects of a project, particularly those in which there is uncertainty about potential outcomes.

1. DATA SCIENCE PROCESS OR LIFE CYCLE OF DATA SCIENCE


The lifecycle of a data science project can be divided into three phases, as illustrated in below figure.
This book is organized around these phases. The first part covers preparation, emphasizing that a bit
of time and effort spent gathering information at the beginning of the project
The second part covers building a product for the customer, from planning to execution, using what
you’ve learned from the first section as well as all of the tools that statistics and software can
provide.
The third and final part covers finishing a project: delivering the product, getting feedback, making
revisions, supporting the product, and wrapping up a project neatly.

CHARACTERISTICS OF A DATA SCIENTIST


2. AWARENESS IS VALUABLE
The most pervasive discrepancies between the perspective of a data scientist and that of a “pure”
software developer is—one who doesn’t normally interact with raw or “unwrangled” data.
DATA that is to be processed may come in all shapes and sizes, and parsing them for useful
information is a challenge.
Two main strategies for addressing this challenge:
manual brute force and scripting.
We could also use some mixture of the two.
Given that brute force would entail creating a template for each data format as well as a new
template every time the format changed. A script that could parse any data format and extract the
relevant information is a better method, but it is extremely complex and almost impossible to write.
A compromise between the two extreme approaches seemed best, as it usually does.
Compromise between brute force and pure scripting: develop some simple templates for the most
common formats, check for similarities and common structural patterns, and then write a simple
script that could match and extract data. Practically this solution may not give 100% success.
awareness is incredibly valuable when working on problems involving data. A good developer using
good tools to address what seems like a very tractable problem can run into trouble if they haven’t
considered the many possibilities that can happen when code begins to process data.
A data scientist’s main responsibility is to try to imagine all of the possibilities, address the ones that
matter, and reevaluate them all as successes and failures happen.

3. DEVELOPER VS. DATA SCIENTIST


A good software developer (or engineer) and a good data scientist have several traits in common.
Both are good at designing and building complex systems with many interconnected parts; both are
familiar with many different tools and frameworks for building these systems; both are adept at
foreseeing potential problems in those systems before they’re actualized.
---In general, software developers design systems consisting of many well-defined components,
whereas data scientists work with systems wherein at least one of the components isn’t well defined
prior to being built, and that component is usually closely involved with data processing or analysis.
That component is usually closely involved with data processing or analysis.
The systems of software developers and those of data scientists can be compared with the
mathematical concepts of logic and probability, respectively. The logical statement “if A, then B” can
be coded easily in any programming language. The probabilistic statement “if A, then probably B”
isn’t nearly as straightforward.
Examples : Any good data-centric application contains many such statements—
consider the Google search engine (“These are probably the most relevant pages”), product
recommendations on Amazon.com (“We think you’ll probably like these things”),website analytics
(“Your site visitors are probably from North America and each views about three pages”).
Data scientists specialize in creating systems that rely on probabilistic statements about data and
results.
A software developer needs expertise, but a data scientist needs exploration.

4. DO I NEED TO BE A SOFTWARE DEVELOPER?


Knowledge of a statistical software tool is a prerequisite for doing practical data science, but this can
be as simple as a common spreadsheet program (for example, the divisive but near-ubiquitous
Microsoft Excel). In theory, someone could be a data scientist without ever touching a computer or
other device. Understanding the problem, the data, and relevant statistical methods could be
enough, as long as someone else can follow your intentions and write the code. In practice, this
doesn’t happen often.
the process of thinking about and doing data science, but clearly software can’t be ignored.
Software—as an industry and its products—is the data scientist’s toolbox. The tools of the craft are
the enablers of work that’s beyond the capabilities of the human mind and body alone.
5. DO I NEED TO KNOW STATISTICS?
As with software, expert knowledge of statistics certainly helps but isn’t necessary. Statistics provides
a framework for theorizing about how intangible thoughts, needs, and reactions are eventually
translated into measurable actions that create data and formulate models of data-generating
processes and translating those models into statistical terminology, equations, and, ultimately, code.
6. PRIORITIES: KNOWLEDGE FIRST, TECHNOLOGY SECOND, OPINIONS THIRD
There are various concerns of every data science project—for example, software versus statistics,
changing business need versus project timeline, data quality versus accuracy of results. Each
individual concern pushes and pulls on the others as a project progresses, and we’re forced to make
choices whenever two of them disagree on a course of action.
Knowledge, technology, and opinions are typically what you have at the beginning of any project;
they are the three things that turn data into answers. Knowledge is what you know for a fact.
Technology is the set of tools you have at your disposal. Opinions are those little almost facts you
want to consider true but shouldn’t quite yet. It’s important to establish a hierarchy for your thought
processes so that less-important things don’t steamroll more-important ones because they’re easier
or more popular In practice, the hierarchy looks like this:
 Knowledge first—Get to know your problem, your data, your approach, and your
goal before you do anything else, and keep those at the forefront of your mind.
 Technology second—Software is a tool that serves you. It both enables and con-
strains you. It shouldn’t dictate your approach to the problem except in extenuating circumstances.
 Opinions third—Opinions, intuition, and wishful thinking are to be used only as guides toward
theories that can be proven correct and not as the focus of any project.
Goals are not certain to be achieved but are required for any project, so it’s imperative not to take
the goal and its attainability for granted. You should always consider current knowledge first and
seek to expand that knowledge incrementally until you either achieve the goal or are forced to
abandon it In data science, a goal is much less likely to be achievable in exactly its original form.
Remember: knowledge first, then technology, and then opinion. It’s not a perfect
framework, but it will be helpful.
The Role of Data Scientists
• Data scientists are like early European explorers, accessing interesting areas, recognizing new and
interesting things, handling new, unfamiliar, or sensitive things, evaluating new and unfamiliar things,
drawing connections between familiar and unfamiliar things, and avoiding pitfalls.
• Data scientists must survey the landscape, take careful note of surroundings, and dive into unfamiliar
territory to see what happens.
• The existence of data everywhere enables us to apply the scientific method to discovery and analysis
of a preexisting world of data.
• To get real truth and useful answers from data, we must use the scientific method, or the data
scientific method: ask a question, state a hypothesis, make a testable prediction, test the prediction via
an experiment involving data, and draw the appropriate conclusions through analyses of experimental
results.

7. BEST PRACTICES by DATA SCIENTIST


7.1 Documentation
● Comment your code so that a peer unfamiliar with your work can understand
what the code does.
● For a finished piece of software—even a simple script—write a short description
of how to use it and put this in the same place (for example, the same file
folder) as the code.
● Make sure everything—files, folders, documents, and so on—has a meaning-
ful name.

7.2 Code Repositories and Versioning

• Source code repositories (repos) are software products designed to manage source code.
• Modern repos are based on versioning systems, which track changes and allow for creation and
comparison of different versions.
• Repos and versioning take time to learn and integrate into workflows.
• Bitbucket.org and GitHub.com offer free web hosting of code repos, with Git being the most popular
versioning system.
• Remote repo-hosting services serve as a backup, ensuring code safety even in the event of a
computer crash.
• Some code-hosting services offer web interfaces for viewing code history, versions, and
development status.
• Remote repos allow code access from any location with web access, with language-specific code
highlighting and other useful features.
• Tips for repos and versioning include using a remote source code repo, learning Git or another
versioning system, committing code changes frequently, working in a location that doesn't affect
production version or development by other team members, using versioning, branching, and forking
instead of copying and pasting code, and asking Git gurus for best practices.

7.3 Code Organization

• Use common coding patterns for the programming language.


• Use meaningful names for variables and objects for better understanding.
• Use informative comments for better code organization.
• Avoid copying and pasting code.
• Code in chunks with specific functionalities for scripts and applications.
• Avoid premature optimization.
• Code in a logical, coherent fashion.
• Consider potential contributors to your project.
• Spend time organizing and commenting early to ensure understanding.

7.4 Ask Questions

Data Scientists' Awareness and Domain Knowledge

• Data scientists possess a strong sense of awareness, which can be a strength or a weakness.
• The stereotype of introverted academics being too shy to ask for help is often misguided.
• Data scientists, including software engineers, business strategists, sales executives, marketers, and
researchers, possess vast knowledge about their domains or projects.
• In a business setting, it's important to learn about the company and the industry, or domain
knowledge.
• Nontechnical business people often treat data scientists as smart, but they also have a deeper
understanding of project goals and business problems.
• Engaging in discussions with those who understand the business side of the project can illuminate
projects and contribute to domain knowledge.

7.5 Staying Close to Data in Data Analysis

• The concept of being "close to the data" involves minimizing the complexity of the methods and
algorithms used.
• Complex methods, such as black box methods, can be beneficial in certain fields. (but be conscious
of the possibility of mistakes)
• In cases where complex methods have advantages, the concept of being close to the data can be
adapted.
• This can involve verifying, justifying, or supporting results from complex methods with simpler,
close-to-data methods.
• Overstraying from the data without a safety line can lead to difficulties in diagnosing problems.

CHAPTER : DATA ALL AROUND US


Data Science: A Unique Approach

• Data science is the extraction of knowledge from data, a concept not found in other fields like
operations research, decision sciences, analytics, data mining, mathematical modeling, or applied
statistics.
• The term is often used to describe the unique tasks data scientists perform, which previous applied
statisticians and data-oriented software engineers did not.
• The data science process involves setting goals, preparing, building, finishing, exploring, wrapping
up, wrangle, revising, assessing, delivering, planning, analyzing, engineering, optimizing, and
executing.

The Information Age: From Computing to Data Generation

• The Information Age, from the second half of the 20th century to the beginning of the 21st century,
is characterized by the rise of computers and the internet.
• Early computers were used for computationally intensive tasks like cracking military codes,
navigating ships, and performing simulations in applied physics.
• The internet developed in size and capacity, allowing data and results to be sent easily across a large
distance, enabling data analysts to amass larger and more varied data sets for study.
• Internet access for the average person in a developed country increased dramatically in the 1990s,
giving hundreds of millions of people access to published information and data.
• Websites and applications began collecting user data in the form of clicks, typed text, site visits, and
other actions a user might take, leading to more data production than consumption.
• The advent of mobile devices and smartphones connected to the internet allowed for an enormous
advance in the amount and specificity of user data being collected.
• The Internet of Things (IoT) includes data collection and internet connectivity in almost every
electronic device, making the online world not just a place for consuming information but a
data-collection tool in itself.

Data Collection and its Purpose in the Information Age

Collecting User Data


• As businesses realized the potential for data sale, they began collecting vast amounts of user data.
• Online retailers, video games, and social networks stored every item, link, and activity.

Data Collection and Its Use


• Major websites and applications use their own data to optimize user experience and effectiveness.
• Publishers often struggle to balance the value of the data sold and its internal use.
• Many keep their data to themselves, hoarding it for future use.

Facebook and Amazon's Data Collection


• Facebook and Amazon collect vast amounts of data daily, but their data is largely unexploited.
• Facebook focuses on marketing and advertising revenue, while Amazon has data that could
potentially revolutionize economic principles or change industry processes.
• Despite their vast data sets, these companies focus on their own use and do not want others to take
their profits.

Access to Data
• Some companies, like Twitter, provide access to their data for a fee.
• An industry has developed around brokering the sale of data for profit.
• Academic and nonprofit organizations often make data sets available publicly and for free, but there
may be limitations on their use.
• There has been a trend towards consolidation of data sets within a single scientific field.

Data's Role and Value


• Data is now ubiquitous and has become a purpose of its own.
• Companies collect data as an end, not a means, and many claim to be planning to use it in the future.

Data Scientist as an Explorer in the 21st Century

Collecting and Exploration of Data


• Data sets are increasingly being collected at unprecedented rates, often not for specific purposes.
• Data analysts are now collecting data first and then deciding what to do with it.
• The internet, ubiquitous electronic devices, and a fear of missing out on hidden value in data have
led to the collection of as much data as possible.

Big Data Innovations


• Big data refers to the recent movement to capture, organize, and use any and all data possible.
• Each innovation begins with a problem that needs to be addressed and goes through four phases of
development: problem, invention, proof/recognition, adoption, and refinement.
• The current phase of big data collection and widespread adoption of statistical analysis has created
an entire data ecosystem where the knowledge extracted is only a small portion of the total knowledge
contained.

Exceptions to the Data Ecosystem


• Companies like Google, Amazon, Facebook, and Twitter are ahead of the curve in allowing access
to their entire data set and analyzing their data rigorously.
• Google's work on search-by-image, Google Analytics, and its basic text search are examples of solid
statistics on a large scale.

Data Storage and Interaction in Data Science

• Discusses various data formats such as flat files, XML, and JSON.
• Each format has unique properties and idiosyncrasies, making it easier to access and extract data.
• The discussion includes a discussion of databases and APIs, as they are essential for data science
projects.
• Data can be accessed as a file on a file system, in a database, or behind an API.
• Data storage and delivery are intertwined in some systems, making them a single concept: getting
data into analysis tools.
• The goal is to provide descriptions that make readers comfortable discussing and approaching each
data format or system.
• The section aims to make data science accessible to beginners, allowing them to move on to the
most important part: what the data can tell.
Understanding Flat Files

Flat Files Overview


• Flat files are plain-vanilla data sets, the default data format.
• They are self-contained and can be viewed in a text editor.
• They contain ASCII (or UTF-8) text, each character using 8 bits of memory/storage.
• A file containing only the word DATA will be of size 32 bits.
• If there is an end-of-line character after the word DATA, the file will be 40 bits.

Types of Flat Files


• Plain text: Words, numbers, and some special characters.
• Delimited: Plain text with a delimiter appearing every so often in the file.
• Table: Data in the file can be interpreted as a set of rows and columns.
• Most programs require the same number of delimiters on each line to ensure consistency in the
number of columns.

Reading Flat Files


• Any common program for manipulating text or tables can read flat files.
• Popular programming languages all include functions and methods that can read such files.
• Python (csv package) and R (read.table function and its variants) contain methods that can load a
CSV or TSV file into the most relevant data types.

Limitations and Considerations


• Flat files are the smallest and simplest common file formats for text or tables.
• They provide no additional functionality other than showing the data, making them inefficient for
larger data sets.
• In cases where reading flat files is too slow, alternative data storage systems are designed to parse
through large amounts of data quickly.

Understanding HTML and Web Scraping

• HTML is a markup language used to interpret plain text.


• It is widely used on the internet and is used to create web pages.
• The body of the document is considered by an HTML interpreter, with everything between the tags
considered as HTML.
• HTML tags are usually of the format to begin and end the annotation, for an arbitrary TAGNAME.
• The class "column" is applied to the div, allowing the interpreter to treat a column instance in a
special way.
• Web scraping involves writing code that can fetch and read web pages, interpret the HTML, and
scrape out specific pieces of the HTML page.
• Web scraping can be useful if the data needed isn't contained in other formats.
• It's important to check the website's copyright and terms of service before scraping.

XML Overview

• XML is a more flexible format than HTML, suitable for storing and transmitting documents and
data.
• XML documents begin with a tag declaring a specific XML version.
• XML works similarly to HTML but without most of the overhead associated with web pages.
• XML is used as a standard format for offline documents like OpenOffice and Microsoft Office.
• XML specification is designed to be machine-readable, allowing for data transmission through APIs.
• XML is popular in applications and documents using non-tabular data and other formats requiring
flexibility.

JSON Overview

• JavaScript Object Notation (JSON) is a functionally similar language for data storage or
transmission.
• JSON describes data structures like lists, maps, or dictionary in programming languages.
• Unlike XML, JSON is leaner in terms of character count.
• JSON is popular for transmitting data due to its ease of use.
• It can be read directly as JavaScript code, and many programming languages like Python and Java
have natural representations of JSON.
• JSON is highly efficient for interoperability between programming languages.

Relational Databases: An Overview

• Relational databases are data storage systems optimized for efficient data storage and retrieval.
• They are designed to search for specific values or ranges of values within the table entries.
• A database query can be expressed in plain English, with the most common basis query language
being Structured Query Language (SQL).
• A well-designed database can retrieve a set of table rows matching certain criteria much faster than a
scan of a flat file.
• The main reason for databases' quick retrieval is the database index, which is a data structure that
helps the database software find relevant data quickly.
• The administrator of the database needs to choose which columns of the tables are to be indexed, if
default settings aren't appropriate.
• Databases are also good at joining tables, which involves taking two tables of data and combining
them to create another table that contains some of the information of both the original tables.
• Joining can be a large operation if the original tables are big, so it should be minimized the size of
those tables.
• It's a good general rule to query the data first before joining, as there might be far less matching to
do and the execution of the operation will be much faster overall.
• For more information and guidance on optimizing database operations, practical database books are
available.
• If you have a relatively large data set and your code or software tool is spending a lot of time
searching for the data it needs at any given moment, setting up a database is worth considering.

Non-relational Databases: Efficiency and Flexibility

• NoSQL (Not only SQL) allows for database schemas outside traditional SQL-style relational
databases.
• Graph and document databases are typically classified as NoSQL databases.
• Many NoSQL databases return query results in familiar formats, like Elasticsearch and MongoDB.
• Elasticsearch is a document-oriented database that excels at indexing text contents, ideal for
operations like counting word occurrences in blog posts or books.
• NoSQL databases offer flexibility in schema, allowing for the incorporation of various data types.
• MongoDB is easy to set up but may lose performance if not optimized for rigorous indexing and
schema.
Understanding APIs and Data Collection

APIs as Communication Gateways


• APIs are rules for communicating with software.
• They define the language used in queries to receive data.
• Many websites, like Tumblr, have APIs that allow users to request and receive information about
Tumblr content.

Tumblr's API
• Tumblr's public API allows users to request and receive information about Tumblr content in JSON
format.
• The API is a REST API accessible via HTTP.

API Key and Response


• An API key is a unique string that indicates the user's use of the API.
• The API key can be obtained as a developer and used to request information about a specific blog.

Programmatic API Use


• To capture the Tumblr API response programmatically, an HTTP or URL package in a programming
language is needed.
• The request URL is assembled as a string object/variable and passed to the appropriate URL
retrieval method.
• The response should contain a JSON string similar to the response shown.

Aptitude in APIs
• Accurate API usage can be a powerful tool in data collection due to the vast amount of data
available through these gateways.
Common Bad Formats
• Avoids typical office software suites like word processing programs, spreadsheets, and mail clients.
• Avoids these formats when data science is involved.
• Uses specialized programs for data analysis, as these programs are usually incapable of the analysis
needed.
• OpenOffice Calc and Microsoft Excel allow for exporting individual sheets into CSV formats.
• Exports text from Microsoft Word documents into plain text, HTML, or XML.
• Exports text from PDFs into plain text files for analysis.

Unusual Formats
• This category includes data formats and storage systems unfamiliar to the user.
• Some formats are archaic or have been superseded by another format.
• Some formats are highly specialized.
• When encountering unfamiliar data storage systems, the user searches online for examples of similar
systems and decides if it's worth the trouble.
• If the data is worth it, the user generalizes from similar examples and gradually expands from them.
• Dealing with unfamiliar data formats or storage systems requires exploration and seeking help.

Data Formats and Scouting for Data

Choosing Data Formats


• Data formats can be inefficient, unwieldy, or unpopular.
• Secondary data stores can be set up for easier access, but at a cost.
• For applications requiring critical access efficiency, the cost may be worth it.
• For smaller projects, it may not be necessary.

General Rules for Data Formats


• Export for spreadsheets and office documents.
• Common formats are better for the data type and application.
• Don't overspend on converting; weigh the costs and benefits first.

Some common Types of Data and Formats


• Tabular data: small amount delimited flat file.
• Relational database: large amount with lots of searching/querying.
• Plain text: small amount flat file.
• Non-relational database with text search capabilities.
• Data transmission between components: JSON.
• Document transmission: XML.

Scouting for Data


• Data can be found in various forms, from file formats to databases to APIs.
• It's important to find data that can help solve problems.
• If internal system data doesn't answer major questions, consider finding a data set that complements
it.
• There's a vast amount of data available online, and a quick search is worth it.
• Highlights the difficulty in finding data that can help solve problems.
• Emphasizes the importance of not taking internal system data for granted.
• Suggests that external data sets can complement existing data and improve results.
• Emphasizes the value of quick searches for potential data aids.
• Encourages focusing on content rather than format in data search.

EXAMPLE Google Search and Data Usage


• Google searches are not perfect and require understanding of what to search for and what to look for.
• Searches for "Tumblr data" and "Tumblr API" yield different results.
• The former returns results involving data as used on Tumblr posts and third parties selling historical
Tumblr data.
• The latter deals almost exclusively with the official Tumblr API, providing up-to-the-minute
information about Tumblr posts.
• Terms like data and API significantly impact web searches.
• When searching for data related to your project, include modifying terms like historical, API, real
time, etc.

Copyright and Licensing


• Data may have licensing, copyright, or other restrictions that can make it illegal to use it for certain
purposes.
• Academic data often has restrictions that the data can't be used for profit.
• Proprietary data, like Tumblr or Twitter, often comes with restrictions that can't be used to replicate
functionality.
• It's best to read any legal documentation offered by the data provider and search for examples of
similar use.
• Without confirming that your use case is legal, you risk losing access to the data or a lawsuit.

Data Science: Choosing the Right Data for Your Project

• The decision to use existing data or seek more is complex due to the variability of data sets.
• An example of this is Uber's data sharing with the Taxi and Limousine Commission (TLC).
• The TLC required ZIP codes for pick-up and drop-off locations, which are not specific enough to
cover large areas.
• Addresses or city blocks would be better for data analysis, but this poses legal issues regarding user
privacy.
• After initial disappointment, it's important to check if the data will suffice or if additional data is
needed.
• A simple way to assess this is to run through specific examples of your intended analyses.
• The decision should be based on the project's goals and the specific questions you're aiming to
answer.

Combining Data Sources

• If your current data set is insufficient, it might be possible to combine data sets to find answers.
• This can be likened to fitting puzzle pieces together, where each piece needs to cover precisely what
the other pieces don't.
CHAPTER : Data Wrangling Overview

• Defined as "having a long and complicated dispute."


• Process of converting data into a format for use by conventional software.
• A collection of strategies and techniques applied within a project strategy.
• Not a pre-defined task; each case requires problem-solving.
• Case study used to illustrate specific techniques and strategies.
World Record Comparison in Athletics

Case study: best all-time performances in track and field

• World records are often compared based on their age or closeness to breaking.
• Usain Bolt's 200 m dash world record was 12 years old when he broke it, while Usain Bolt's 100 m
world record was less than a year old when he broke it in early 2008.
• Age of a world record does not necessarily indicate strength, as Bolt's 19.19 sec 200 m mark was not
worse than his 19.30 sec mark.
• The percentage improvement of a mark over the second-best mark is often used as evidence of good
performance.
• However, this is not perfect due to the high variance of second-best performances.

Common Heuristics

Comparisons in Athletics
• Armchair track and field enthusiasts often compare world records based on their age or closeness to
breaking them.
• Michael Johnson's 200 m dash world record was 12 years old when Usain Bolt broke it.
• Usain Bolt's 100 m world record was less than a year old when he broke it in early 2008, then again
at the 2008 Olympics and 2009 World Championships.
• The age of a world record does not necessarily indicate strength, as Bolt's 19.19 sec mark for the 200
m was not worse than his 19.30 sec mark.
• The percentage improvement of a mark over the second-best mark is often used as evidence of good
performance, but this is not perfect due to the high variance of second-best performances.

IAAF Scoring Tables: A Historical Perspective

• The IAAF Scoring Tables of Athletics are the most widely accepted method for comparing
performance between events in track and field.
• The IAAF publishes an updated set of points tables every few years.
• The tables are used in multidiscipline events like men’s decathlon and women’s heptathlon.
• The scoring tables for individual events have little effect on competition, except for certain track and
field meetings that award prizes based on the tables.
• The 2008 Scoring Tables gave Usain Bolt’s 2009 performance a score of 1374, indicating a dramatic
change in his performance.
• The 2011 tables, based on a relatively small set of the best performances in each event, could have
affected the scores in the next update.
• The 2008 and 2009 track and field seasons produced incredible 100 m performances, which affected
the next set of scores, released in 2011.
• The author aims to use all available data to generate a scoring method that is less sensitive to
changes in the best performances and a good predictor of future performance levels.

Comparing Performances Using All Data

• Alltime-athletics.com provides a comprehensive set of elite track and field performances.


• The site contains thousands of performances in all Olympic events.
• The aim is to improve the robustness and predictive power of the IAAF’s Scoring Tables.
• Data collection involves web scraping and comparing scores with the IAAF Scoring Tables,
available only in PDF.
• Both web pages and PDFs are not ideal for programmatic parsing due to their messy HTML
structure and page headers, footers, and numbers.
• Two tasks are involved: wrangling the top performance lists from alltime-athletics.com and
wrangling the IAAF Scoring Tables from the PDF.

Getting Ready to Wrangle Data

• Advocates for a deliberate approach to data collection.


• Encourages thorough exploration before writing code or implementing strategies.
• Provides insights to aid in effective data wrangling.
• Highlights the potential messiness of messy data.
• Describes steps to determine data type, actions needed, and potential issues.

Messy Data in Data Science

Types of Messy Data


• Each data set is unique, making it challenging to parse and use efficiently.
• Data scraping involves programmatically pulling selected elements from sources not designed for
programmatic access.
• Corrupted data can be found in poorly formatted or corrupted files, often due to disk errors or other
low-level problems.
• Common corrupted file formats include PST, an email archive.
• Poorly designed databases can lead to inconsistencies in data sources, such as unmatched database
values or keys, in scope, depth, APIs, or schemas.
• As of 2016, there is still no Era of Clean Data, raising questions about its potential.

Web Scraping for Track and Field Project


• Web scraping is a useful tool for tracking and field data.
• The process involves examining raw HTML and imagining the task as a wrangling script.
• The raw data is what any code will see, and it's crucial to understand how to deal with it.
• The first step in wrangling is to look at the raw data, such as header lines and other material at the
top of the page.
• The main goal is to capture the top marks at this point.
• A wrangling script can recognize an athlete's performance by testing each line of the file.
• Document structure, particularly in HTML or XML, can provide clues about where valuable data
starts.
• The data is preceded by a tag, which is often on the page or only right before the data set starts.
• The tag is used to denote the beginning of the data set for each of the events.
• The text parser in the scripting language can read the tag, separate the columns into fields, and store
each text field in a variable or data structure.

Data Wrangling Challenges and Uncertainties

• Understanding the starting point of valuable data within each HTML page is crucial.
• Character sequences in raw HTML can cause confusion and potential errors.
• These sequences are often HTML representations of characters like ü or é.
• It's important to double-check everything, both manually and programmatically.
• A quick scroll through post-wrangle, clean data files can reveal obvious mistakes.
• Extra tab characters can interfere with parsing algorithms, including standard R packages and Excel
imports.
• Every case requires careful consideration of potential parsing errors.
• Awareness is the most important aspect of data wrangling.

Wrangling Data and File Analysis

• Wrangling scripts start at the beginning of the file and finish at the end, but unexpected changes can
occur in the middle.
• It's crucial to examine the wrangled data file(s) at the beginning, end, and some places in the middle
to ensure the expected state.
• Nonstandard lists of best performances can be found at the bottom of the pages.
• The HTML tag that denotes the beginning of the desired data is closed at the end of the main list.
• This tag closure is a good way to end the parsing of the data set.
• If the wrangling script ignores the end of the useful data set, it may collect nonstandard results at the
bottom of the page or fail to know what to do when the data stops fitting the established column
format.
• Looking at the end of the wrangled data file is crucial to determine if the data wrangling was
successful.
• The data scientist should decide which aspects of wrangling are most important and ensure they are
completed properly.

Data Wrangling Plan for Track and Field Performance Data

• The process involves imagining oneself as a wrangling script, parsing through raw data, and
extracting necessary parts.
• A potential solution is to download all web pages containing all Olympic track and field events and
parse them using HTML structure.
• However, this requires a list of web addresses for individual events to download programmatically.
• Each page has a unique address that needs to be copied or typed manually, which could be
time-consuming.
• The author decided not to go with web scraping, instead opting for the post-HTML, already rendered
web page version.
• The author would visit each of the 48 web pages, select all text, copy the text, and paste the text into
a separate flat file.
• This method eliminates the need for translating HTML or scripting the downloading of the pages.
• The choice of data wrangling plan should depend on all the information discovered during initial
investigation.
• The author suggests pretending to be a wrangling script, imagine what might happen with the data,
and then write the script later.

Data Wrangling Techniques and Tools

• Data wrangling is an abstract process with uncertain outcomes.


• No single tool can clean messy data.
• Tools are good for various tasks, but no single tool can wrangle arbitrary data.
• No one application can read arbitrary data with an arbitrary purpose.
• Data wrangling requires specific tools in specific circumstances.

File Format Converters Overview

• Converting from HTML, CSV, PDF, TXT to other file formats.


• PDF format is not ideal for data analysis.
• File format converters can convert PDFs to other formats like text files and HTML.
• Unix application pdf2txt and pdf2html are useful for data scientists.
• Numerous file format converters are available, many free or open source.
• Google search can help determine if a file format is easily convertible.

Proprietary Data Wranglers in 2016


• Numerous companies offer data wrangling services at a cost.
• Many software products claim to be capable of this, but many are limited.
• Some proprietary products can convert existing data into desired data.
• The cost of these tools may be worthwhile for early project completion.
• The industry is young and rapidly changing, making it difficult to conduct a comprehensive survey.

Scripting: Using the Plan and Checking

• Imagine being a script and reading through files to understand the complexity of the task.
• Use simple tools like the Unix command line for simpler tasks like extracting lines, converting
occurrences of a word, and splitting files.
• For more complex operations, a scripting language like Python or R is recommended.
• Writing a wrangling script is not a well-orchestrated affair; it involves trying various techniques until
finding the best one.
• The most important capability to strive for when choosing scripting languages or tools is the ability
to load, manipulate, write, and transform data quickly.
• Make informed decisions about how best to wrangle the data or guess and check if it's more time
efficient.
• Consider the manual-versus-automate question: can you wrangle manually in a shorter time than you
can write a script?
• Stay aware of the status of the data, the script, the results, the goals, and what each wrangling step
and tool is gaining you.

Common Pitfalls in Wrangling Scripts


• Misuse of messy data can lead to omissions.
• Even with careful consideration, risks exist.
• Observance and thorough consideration are crucial.
• Symptoms of a wrangling script falling into a pitfalls are provided.

Data Incompatibilities in Windows/Mac/Linux

• Major operating systems still have disagreements on line endings in text files.
• Unix and Linux have used line feed (LF) denotation for new lines since the 1970s.
• Mac OS before version 9.0 used carriage return (CR) character for new lines.
• Mac OS joined Unix derivatives in using line feed since 1999, but Microsoft Windows uses a hybrid
CR+LF line ending.
• Improper line ending parsing can lead to various problems.
• Each programming language has its own capabilities for reading different file types.
• OS file formats include : more lines of text than expected, too few lines of text, and interspersed
weird-looking characters.

Understanding Escape Characters in Text Processing

Understanding Escape Characters


• Unix or Linux shells use the character * as a wildcard, representing all files in the current directory.
• The backslash character, '*', removes this special meaning, representing only the simple asterisk
character.
• These escape characters can occur in text files and text/string variables in programming languages.

Examples of Escape Characters


• A text file with three lines of text, followed by a line containing no characters and the third line with
tab characters in between the letters, would be read by Python and R.
• The line breaks have been replaced by 'n' and the tab characters by 't'.
• A single string variable can represent the data from an entire file, with each line separated by the
escaped newline character.

Using Escape Characters Within Quotations or Quotations Within Quotations


• The text itself contains quotation marks, which need to be escaped within the string variable.
• For example, a text file containing emails could be encoded as a string variable, escaping the internal
quotations.

Implications of Complex Escapes


• Complex escapes can be confusing, especially when dealing with many nested quotation marks and
newlines.
• Symptoms of escaping problems include lines or strings being too long or short, trying to read a file
line by line but end up with one long line, finding text inside quotation marks that doesn't belong
there, and getting errors while reading or writing files.

Outliers in Data Analysis

• Incorrect data can sneak into projects without causing an error or making itself obvious.
• Summary statistics and exploratory graphs can help catch these errors.
• Checking the range of values—minimum to maximum—could catch the error.
• Plotting histograms of all data can help check for errors and gain awareness about the data sets.
• Generating statistical or visual summaries can prevent errors and promote awareness.
• Techniques like basic descriptive statistics, summaries, and diagnostics can be used to ensure
successful data assessment.

CHAPTER : Data science in a big data world

Big Data and Data Science: A Comparative Analysis

Big Data Definition


• Big data refers to large, complex data sets that are difficult to process using traditional data
management techniques.
• Data science involves methods to analyze and extract knowledge from massive amounts of data.

Big Data Characteristics


• Big data is referred to as the three Vs: volume, variety, velocity, and veracity.
• These properties distinguish big data from traditional data management tools, causing challenges in
data capture, curation, storage, search, sharing, transfer, and visualization.

Data Science and Big Data


• Data science is an evolutionary extension of statistics, incorporating methods from computer
science.
• Data scientists are distinguished from statisticians by their ability to work with big data and
experience in machine learning, computing, and algorithm building.
• Their tools include Hadoop, Pig, Spark, R, Python, and Java.

Python in Data Science


• Python is a popular language for data science due to its numerous libraries and support by
specialized software.
• Python's ability to prototype quickly with Python while maintaining acceptable performance is
growing in the data science world.

BENEFITS AND USES OF DATA SCIENCE AND BIG DATA


• Commercial companies use data science and big data to gain insights into customers, processes,
staff, and products.
• Companies use data science to offer better user experience, cross-sell, up-sell, and personalize their
offerings.
• Human resource professionals use people analytics and text mining to screen candidates, monitor
employee mood, and study informal networks.
• Financial institutions use data science to predict stock markets, determine lending risk, and attract
new clients.
• Governmental organizations rely on internal data scientists to discover valuable information and
share their data with the public.
• Data scientists work on diverse projects such as detecting fraud and optimizing project funding.
• Nongovernmental organizations (NGOs) use data to raise money and defend their causes.
• Universities use data science in their research and to enhance the study experience of their students.
• Massive open online courses (MOOCs) produce a lot of data, allowing universities to study how this
type of learning can complement traditional classes.

Data Types and Their Implications in Data Science

Facets of Data
• Structured data: Depends on a data model and resides in a fixed field within a record. It can be
stored in tables within databases or Excel files. SQL is the preferred way to manage and query data
that resides in databases.
• Unstructured data: Content is context-specific or varying, making it difficult to fit into a data model.
Examples include regular emails, which contain structured elements but are challenging to find due to
the numerous ways to refer to a person and the thousands of different languages and dialects.

Natural Language Processing Challenges


• Natural language is a unique type of unstructured data that requires specific data science techniques
and linguistics.
• Successful methods include entity recognition, topic recognition, summarization, text completion,
and sentiment analysis.
• Techniques trained in one domain may not generalize to other domains.
• Even advanced techniques struggle to decipher the meaning of every text.
• Humans also struggle with natural language due to its ambiguity and questionable concept of
meaning.
Machine-Generated Data Overview

• Machine-generated data is information created automatically by machines without human


intervention.
• It is a significant data resource, with the market value of the industrial Internet projected to be
around $540 billion in 2020.
• The internet of things, a network of 26 times more connected things than people, is expected to
grow.
• Machine data analysis requires scalable tools due to its high volume and speed.
• Examples of machine data include web server logs, call detail records, network event logs, and
telemetry.
• Classic table-structured databases may not fit machine data, which requires interconnected
relationships.

Graph-Based or Network Data Overview

• Graph data refers to mathematical structures that model pair-wise relationships between objects.
• Graph structures use nodes, edges, and properties to represent and store graphical data.
• Graph-based data is a natural representation of social networks, allowing for calculation of specific
metrics like influence and shortest path.
• Examples of graph-based data include LinkedIn's company list and Twitter's follower list.
• Power and sophistication come from multiple, overlapping graphs of the same nodes.
• Graph databases store graph-based data and are queried with specialized query languages like
SPARQL.
• Graph data presents challenges, especially for computer interpretation of additive and image data.

Data Scientist Challenges in Audio, Image, and Video


• Audio, image, and video data types pose specific challenges for data scientists.
• Human tasks like object recognition in pictures are challenging for computers.
• MLBAM plans to increase video capture for live, in-game analytics.
• High-speed cameras capture ball and athlete movements for real-time calculations.
• DeepMind developed an algorithm for learning video game play, prompting Google to purchase the
company for AI development.
• The learning algorithm takes in data as it's produced by the computer game.

Streaming Data Overview


• Streaming data flows into the system during an event, not in batches.
• It requires adaptation in processes to handle this type of information.
• Examples include "What's Trending" on Twitter, live sporting events, and stock market.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy