0% found this document useful (0 votes)

16 views69 pages

Unit 1 Da

Uploaded by

Ayush2031028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views69 pages

Unit 1 Da

Uploaded by

Ayush2031028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 69

Sources and nature of data

Different Sources of Data for Data

Analysis
• Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in
later stages of data analysis.

• In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data.

• The data which is to be analyzed must be collected from different valid

sources.
Different Sources of Data for Data
Analysis
• The data which is collected is known as raw data which is not useful now but
on cleaning the impure and utilizing that data for further analysis forms
information, the information obtained is known as “knowledge”.

• Knowledge has many meanings like business knowledge or sales of enterprise

products, disease treatment, etc. The main goal of data collection is to collect
information-rich data.

• Data collection starts with asking some questions such as what type of
data is to be collected and what is the source of collection.

• Most of the data collected are of two types known as “qualitative data“
which is a group of non-numerical data such as words, sentences mostly
focus on behavior and actions of the group and another one is “quantitative
data” which is in numerical forms and can be calculated using different scientific
tools and sampling data.
The actual data is then further divided
mainly into two types known as: Primary
data and
Secondary data
1.Primary data:
• The data which is Raw, original, and extracted directly from the official sources
is known as primary data.

• This type of data is collected directly by performing techniques such as

questionnaires, interviews, and surveys.

• The data collected must be according to the demand and

requirements of the target audience on which analysis is performed otherwise
it would be a burden in the data processing.
1.Primary data:
Few methods of collecting primary data:

1. Interview method:

• The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions
are asked and noted down in the form of notes, audio, or video and this data is
stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email, etc.

2. Survey method:
• The survey method is the process of research where a list of relevant questions
are asked, and answers are noted down in the form of text,
audio, or video. The survey method can be obtained in both online and offline
mode like through website forms and email. Then that survey answers are
stored for analyzing data. Examples are online
surveys or surveys through social media polls.
1.Primary data:
3. Observation method:
• The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In this method,
the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and
their behavior towards the products. The data obtained will be sent for
processing.

4. Experimental method:
• The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used
experiment methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used in
data analytics which is based on randomization and replication. It is
mostly used for comparing the experiments.
1.Primary data:
• RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.

• LSD – Latin Square Design is an experimental design that is similar to CRD

and RBD blocks but contains rows and columns. It is an arrangement of
NxN squares with an equal amount of rows and columns which contain
letters that occurs only once in a row. Hence the differences can be easily found
with fewer errors in the experiment. Sudoku puzzle is an example of a Latin square
design.

• FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combinational
factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again
for some valid purpose. This type of data is previously recorded from primary
data and it has two types of sources named internal source and external source.

Internal source:
• These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources,
etc. The cost and time consumption is less in obtaining internal sources.

External source:
• The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples
of external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
2. Secondary data:
Other sources:
• Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.

• Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful
information.

• Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also provide
their data through keywords and queries searched mostly.
Classification of data
Difference between Structured, Semi-
structured and Unstructured data

• Big Data includes huge volume, high velocity, and extensible variety of data.

• These are 3 types:

• Structured data,
• Semi-structured data, and
• Unstructured data.
Difference between Structured, Semi-
structured and Unstructured data

Structured data –
• Structured data is data whose elements are addressable for effective analysis. It
has been organized into a formatted repository that is typically a database.
It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational
keys and can easily be mapped into pre-designed fields. Today, those data are most
processed in the development and simplest way to manage information. Example:
Relational data.

Semi-Structured data –
• Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier to
analyze. With some processes, you can store them in the relation database (it
could be very hard for some kind of semi- structured data), but Semi-
structured exist to ease space. Example: XML data.
Difference between Structured, Semi-
structured and Unstructured data

Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or
does not have a predefined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence
and analytics applications. Example: Word, PDF, Text, Media logs.
Differences between Structured,
Semi-structured and Unstructured
data:
Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and binary
Technology XML/RDF(Resource
database table data
Description Framework).

Matured transaction and No transaction

Transaction is adapted from
Transaction management various concurrency management and no
DBMS not matured
techniques concurrency

Versioning over Versioning over tuples or

Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than

It is schema dependent and less structured data but less It is more flexible and there is
Flexibility
flexible flexible than unstructured data absence of schema

It’s scaling is simpler than

It is very difficult to scale DB structured data
Scalability It is more scalable.
schema
New technology, not very
spread
Robustness Very robust —

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
Introduction of Big
Data
What Is Big Data?

• Big Data is the field of analyzing and extracting information from extremely
large data sets.

• The term also refers to large quantities of data that grow exponentially with time.
Such data is so humongous and complex that no conventional methods
or traditional data management tool can process and store it effectively.

• There are many examples of Big Data. From social media platforms to E-commerce
stores, organizations in various industries generate and utilize data to enhance
their processes.

• Big data includes multiple processes, including data mining, data analysis,
data storage, data visualization, etc. The term “big data” refers to collecting
these processes and all the tools that we use during the same.
Types of Big Data

There are primarily three types of data in big data:

1. Structured
Structured data refers to the data that you can process, store, and
retrieve in a fixed format. It is highly organized information that you can readily and
seamlessly store and access from a database by using simple algorithms. This is the
easiest type of data to manage as you know what data format you are working with in
advance. For example, the data that a company stores in its databases in the form of
tables and spreadsheets is structured data.

2. Unstructured
Data with an unknown structure is termed unstructured data. Its size is substantially
bigger than structured data and is heterogeneous in nature. A great example of
unstructured data includes the results you get when
you perform a Google search. You get webpages, videos, images, text, and other data
formats of varying sizes.
Types of Big Data

3. Semi-structured
As the name suggests, semi-structured data contains a combination of structured and
unstructured data. It is data that hasn’t been classified into a specific database but
contains vital tags that separate individual
elements within the same. For example, a table definition in relational
DBMS has semi-structured data.
Characteristics of Big Data

Following are the big data core characteristics. Understanding the characteristics
of big data is vital to know how it works and how you can use it. There are primarily
seven characteristics of big data analytics:

1. Volume
Volume refers to the amount of data that you have. We measure the volume of
our data in Gigabytes, Zettabytes (ZB), and Yottabytes (YB). According to the
industry trends, the volume of data will rise substantially
in the coming years.
2. Velocity
Velocity refers to the speed of data processing. High velocity is crucial for the
performance of any big data process. It consists of the rate of
change, activity bursts, and the linking of incoming data sets.
3. Value
Value refers to the benefits that your organization derives from the data.
Does it match your organization’s goals? Does it help your organization enhance
itself? It’s among the most important big data core characteristics.
Characteristics of Big Data

5. Veracity
Veracity refers to the accuracy of your data. It is among the most important
Big Data characteristics as low veracity can greatly damage the accuracy of your
results.
6. Validity
How valid and relevant is the data to be used for the intended purpose.
7. Volatility
Big data is constantly changing. The data you gathered from a source a day ago
might be different from what you found today. This is called variability of data, and
it affects your data homogenization.
8. Visualization
Visualization refers to showing your big data-generated insights through
visual representations such as charts and graphs. It has become prevalent
recently as big data professionals regularly share their insights with non-technical
audiences.
Main Components of Big Data

1. Ingestion
Ingestion refers to the process of gathering and preparing the data. You’d use
the ETL (extract, transform, and load) process to prepare your data. In this phase,
you have to identify your data sources, determine
whether you’ll gather the data in batches or stream it, and prepare it
through cleansing, massaging, and organization. You perform the extract process in
gathering the data and the transformation process in optimizing it.

2. Storage
Once you have gathered the necessary data, you’d need to store it. Here, you’ll
perform the final step of the ETL, the load process. You’d store
your data in a data warehouse or a data lake, depending on your requirements.
This is why it’s crucial to understand your organization’s goals while performing
any big data process.
Main Components of Big Data

4. Analysis
In this phase of your big data process, you’d analyze the data to generate
valuable insights for your organization. There are four kinds of big data analytics:
prescriptive, predictive, descriptive, and diagnostic.
You’d use artificial intelligence and machine learning algorithms in this
phase to analyze the data.

5. Consumption
This is the final phase of a big data process. Once you have analyzed the data and have
found the insights, you have to share them with others. Here, you’d have to utilize
data visualization and data storytelling to
share your insights effectively with a non-technical audience such as stakeholders
and project managers.
Advantages of Big Data

There are numerous advantages of Big Data for organizations. Some of the key
ones are as follows:

1. Enhanced Decision-making
Big data implementations can help businesses and organizations make
better-informed decisions in less time. It allows them to use outside intelligence
such as search engines and social media platforms to fine- tune their strategies.
Big data can identify trends and patterns that would’ve been invisible otherwise,
helping companies avoiding errors.

2. Data-driven Customer Service

Another huge impact big data can have on all industries is in the
customer service department. Companies are replacing the traditional customer
feedback system with data-driven solutions. Such solutions can analyze customer
feedback more efficiently and help them offer customer service to the consumers.
Advantages of Big Data

3. Efficiency Optimization
Organizations use big data to identify the weak areas present within them. Then,
they use these findings to resolve those issues and enhance their operations
substantially. For example, Big Data has substantially
helped the manufacturing sector improve its efficiency through IoT and
robotics.

4. Real-time Decision Making

Big Data has transformed several areas by enabling real-time trackings, such as
inventory management, supply chain optimization, anti-money laundering, and fraud
detection in banking & finance.
Need of Data Analytics
IMPORTANCE OF
DATA
ANALYTIC
For businesses to achieve a strategic edge, data analytics plays a vital role. Here
are a few Sways in which we know why is data analytics important for businesses
today:-

1. Product Development
Data analytics offers both estimation and exploration capability for information.
It allows one to understand the market or process’s current state and offers a solid
base for forecasting future results. Data analysis
helps companies to comprehend the current business situation and
change the processor cause the need for a new product creation that
meets market requirements.

2. Target Content
Learning what consumers wish in advance improves consumer
orientation in marketing campaigns. It encourages advertisers to tailor their
advertising to a subset of the entire consumer base. It also allows you to figure out
which client base group can better respond to the initiative. It also saves money to
convince a buyer to buy and increases the overall performance of the marketing
activities.
IMPORTANCE OF
DATA ANALYTICS
3. Efficiency in Operations
The importance of data analytics in marketing finds more viable ways to streamline
operations or increase benefit levels. It helps to recognize possible issues, avoids
the waiting period, and takes action on them.
The Evolution of Analytic Scalability
• Scalability: The ability of a system to handle
increasing amount of work required to
perform its task

• The increase in data storage ability has grown

in recent years to accommodate the need for
big data

• Measures of Data Size

– Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta (page 89)
https://www.youtube.com/watch?v=j3knIXR-KHQ
Basic Definitions
• Data:
– Known facts that can be recorded and have an implicit meaning.

• Database:
– Organized collection of related data.

• Database Management System (DBMS)

– A software package to facilitate the creation and maintenance of a
computerized database.

• Relational Database Management System (RDBMS)

– DBMS based on relational model
• Relation is group of tuples

• Enterprise Data Warehouse (EDW)

– Central warehouse of all sources of data
Figure 4.1 and 4.2:
Difference in traditional architecture and in-database environment
• Massively Parallel Processing Systems (MPP)
– Has lots of processor
– All these processor works in parallel
– Big data is split into many parts and the
processors works in parallel in each part
– Divide and conquer strategy
Data Preparation
• Manipulation of data into suitable form for
analysis
– Join
• Combining columns of different data sources

– Aggregation
• Combining all data into one
– Eg: statistical summary
– Combining rows of different data source

– Derivations
• Creating new columns of data
• Calculating ratio

– Transformation
• Converting data into useful format
• Taking log, converting date of birth to age
Ways for in-database data preparation

• SQL

• User defined functions / Embedded processes

– Eg: Select customer, attrition_score
– Analytic tool’s engine running on database

• Predictive modeling markup language

– Based on XML
Cloud Computing
• McKinsey Definition
– Enterprises incur no infrastructure or capital cost. They will
be paying on a pay-per-use basis
– Should be scalable
– The architectural specifics of the underlying hardware are
abstracted from the user

• Public Clouds and Private Clouds

– Security
– specialized service
– Long term cost
MapReduce
• Parallel Processing Framework
• Computational processing can occur on data (even
semi-structured and unstructured data) stored in a
file system without loading it into any kind of database
•Book example – page 113
https://www.youtube.com/watch?v=8wjvMyc01QY

https://www.youtube.com/watch?v=s8EPQpgpWVE

https://www.youtube.com/watch?v=bcjSe0xCHbE
Data Analysis - Process
Data Analysis - Process

• Data Analysis is a process of collecting, transforming, cleaning, and modeling

data with the goal of discovering the required information. The results so
obtained are communicated, suggesting conclusions, and supporting decision-
making.
• Data visualization is at times used to portray the data for the ease of
discovering the useful patterns in the data. The terms Data Modeling and Data
Analysis mean the same.
• Data Analysis Process consists of the following phases that are iterative
in nature −
• Data Requirements Specification
• Data Collection
• Data Processing
• Data Cleaning
• Data Analysis
• Communication
•
Data Analysis - Process
Data Analysis - Process
Data Requirements Specification
• The data required for analysis is based on a question or an experiment.
Based on the requirements of those directing the analysis, the data necessary as
inputs to the analysis is identified (e.g., Population of people). Specific
variables regarding a population (e.g., Age and Income) may be specified and
obtained. Data may be numerical or categorical.

Data Collection
• Data Collection is the process of gathering information on targeted variables
identified as data requirements. The emphasis is on ensuring accurate and honest
collection of data. Data Collection ensures that
data gathered is accurate such that the related decisions are valid.
Data Collection provides both a baseline to measure and a target to
improve.
• Data is collected from various sources ranging from
organizational databases to the information in web pages. The data thus
obtained,
may not be structured and may contain irrelevant information. Hence,
the collected data is required to be subjected to Data Processing and
Data Cleaning.
Data Analysis - Process
Data Processing
The data that is collected must be processed or organized for analysis.
This includes structuring the data as required for the relevant Analysis Tools. For
example, the data might have to be placed into rows and columns in a table within
a Spreadsheet or Statistical Application. A Data Model might have to be created.

Data Cleaning
The processed and organized data may be incomplete, contain
duplicates, or contain errors. Data Cleaning is the process of preventing and
correcting these errors. There are several types of Data Cleaning that depend on the
type of data. For example, while cleaning the financial data, certain totals might
be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be
subsequently excluded in analysis.
Data Analysis - Process
Data Analysis
• Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and
derive conclusions based on the requirements. Data
Visualization may also be used to examine the data in graphical format,
to obtain additional insight regarding the messages within the data.
• Statistical Data Models such as Correlation, Regression Analysis can be used to
identify the relations among the data variables. These models that are descriptive of
the data are helpful in simplifying analysis and communicate results.
• The process might require additional Data Cleaning or additional Data
Collection, and hence these activities are iterative in nature.
Communication
• The results of the data analysis are to be reported in a format as required by the users
to support their decisions and further action. The feedback from the users might result
in additional analysis.
• The data analysts can choose data visualization techniques, such as tables
and charts, which help in communicating the message clearly and efficiently to
the users. The analysis tools provide facility to highlight the required information
with color codes and formatting in tables and charts
Differences Between
Reporting and Analysis
Differences Between Reporting
• Living
and Analysis
in the era of digital and big data has made
technology
organizations dependent on the wealth of information data can bring. You might
have seen how reporting and analysis are used interchangeably,
especially the manner which outsourcing companies market their services. While
both areas are part of web analytics (note that analytics isn’t similar to
analysis), there’s a vast difference between them, and it’s more than just spelling.
• It’s important that we differentiate the two because some
organizations might be selling themselves short in one area and not reap the
benefits, which web analytics can bring to the table. The first core component of
web analytics, reporting, is merely organizing data into summaries. On the
other hand, analysis is the process of
inspecting, cleaning, transforming, and modeling these summaries
(reports) with the goal of highlighting useful information.
• Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask
“What?” questions about the information, whereas analysis should answer to
“Why”” and “What can we do about it?”
Here are five differences between
reporting and analysis
1. Purpose
• Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings to
their business, as reporting extracts that and
makes it easier to understand.

• Analysis interprets data at a deeper level. While reporting can link between
cross-channels of data, provide comparison, and make understand information
easier (think of a dashboard, charts, and graphs, which are reporting tools and
not analysis reports), analysis interprets this information and provides
recommendations on actions.
Here are five differences between
reporting and analysis
2. Tasks
• As reporting and analysis have a very fine line dividing them, sometimes
it’s easy to confuse tasks that have analysis labeled on top of them when all it does
is reporting. Hence, ensure that your analytics
team has a healthy balance doing both.
• Here’s a great differentiator to keep in mind if what you’re doing is reporting or
analysis:
• Reporting includes building, configuring, consolidating, organizing,
formatting, and summarizing. It’s very similar to the above mentioned like
turning data into charts, graphs, and linking data across multiple channels.
• Analysis consists of questioning, examining, interpreting, comparing,
and confirming. With big data, predicting is possible as well.
Here are five differences between
reporting and analysis
3. Outputs
• Reporting and analysis have the push and pull effect from its users through
their outputs. Reporting has a push approach, as it pushes information to users
and outputs come in the forms of canned reports,
dashboards, and alerts.
• Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are
comprised of insights, recommended actions, and a forecast of its impact
on the company—all in a language that’s easy to understand at the level of
the user who’ll be reading and deciding on it.
• This is important for organizations to realize truly the value of data, such
that a standard report is not similar to a meaningful analytics.
Here are five differences between
reporting and analysis
4. Delivery
• Considering that reporting involves repetitive tasks—often with
truckloads of data, automation has been a lifesaver, especially now with big data.
It’s not surprising that the first thing outsourced are data
entry services since outsourcing companies are perceived as data reporting
experts.
• Analysis requires a more custom approach, with human minds doing
superior reasoning and analytical thinking to extract insights, and technical
skills to provide efficient steps towards accomplishing a specific goal. This is
why data analysts and scientists are demanded these days, as organizations
depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
Here are five differences between
reporting and analysis
4. Delivery
• Considering that reporting involves repetitive tasks—often with truckloads
of data, automation has been a lifesaver, especially now with big data. It’s not
surprising that the first thing outsourced are data
entry services since outsourcing companies are perceived as data reporting experts.
• Analysis requires a more custom approach, with human minds doing
superior reasoning and analytical thinking to extract insights, and technical
skills to provide efficient steps towards accomplishing a specific goal. This is
why data analysts and scientists are demanded these days, as organizations
depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
Here are five differences between
reporting and analysis
5. Value
• This isn’t about identifying which one brings more value, rather
understanding that both are indispensable when looking at the big picture. It
should help businesses grow, expand, move forward, and
make more profit or increase their value.
• This Path to Value diagram illustrates how data converts into value by reporting
and analysis such that it’s not achievable without the other.
he word ‘Data’ has been in existence for ages now. In the era of 2.5 Quintillion bytes of
data being generated every day, data plays a crucial role in decision making for
business operations. But how do you think we can deal with so much data? Well, there
are several roles in the industry today that deal with data to gather insights, and one
such vital role is of a Data Analyst. A Data Analyst requires many tools to gather
insights from data. This article on the Top 10 Data Analytics Tools will talk about the top
tools that every budding Data Analyst to a skilled professional must learn in 2021.

In this article, we will cover the following Data Analytics Tools:

1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. Splunk

Let’s dive into this article from the 10th tool on our list i.e Splunk.

Splunk

Splunk is a platform used to search, analyze, and visualize the machine-generated

data gathered from the applications, websites, etc. Being named by Gartner as a
Visionary in the 2020 Magic Quadrant for APM, Splunk has evolved products in
various fields such as IT, Security, DevOps, Analytics.

Products
 Splunk Free
 Splunk Enterprise
 Splunk Cloud

All these 3 products differ by the bandwidth of the features they offer and are available
for free download and trial versions. The pricing options for Splunk products are based
on predictive pricing, Infrastructure-based pricing, and also rapid adoption packages.
Companies using

Trusted by 92 out of the Fortune 100, companies such as Dominos, Otto Group, Intel,
Lenovo are using Splunk in their day to day practices to discover the processes and
correlate data in real-time.

Recent Advancements/ Features

Since almost all the organizations need to deal with data across various divisions,
according to Splunk official website Splunk aims to bring data to every part of your
organization, by helping teams use Splunk to prevent and predict problems with
monitoring experience, detect and diagnose issues with clear visibility,
explore and visualize business processes and streamline the entire security
stack.

If you are looking for an online training program in Splunk, you can refer to our Splunk
Certification Program.

Talend

Talend is one of the most powerful data integration ETL tools available in the market
and is developed in the Eclipse graphical development environment. Being named as
a Leader in Gartner’s Magic Quadrant for Data Integration Tools and Data Quality
tools 2019, this tool lets you easily manage all the steps involved in the ETL
process and aims to deliver compliant, accessible and clean data for everyone.

Products

Talend comes with the following five products:

 Talend Open Source

 Stitch Data Loader
 Talend Pipeline Designer
 Talend Cloud Data Integration
 Talend Data Fabric

Out of these, few are completely free, few are free for 14 days and few are licensed. All
these products differ in their functionalities and pricing options.
Companies using

Small startups to multinational companies such as ALDO, ABInBev, EuroNext,

AstraZeneca are using Talend to make critical decisions.

Recent Advancements/ Features

Talend is the only platform that delivers complete and clean data at the moment you
need it by maintaining data quality, providing Big Data integration, cloud API services,
Preparing Data, and providing Data Catalog and Stitch Data Loader.

Recently Talend has also accelerated the journey to the lakehouse paradigm and the
path to reveal intelligence in data. Not only this but the Talend Cloud is now available
in Microsoft Azure Marketplace.

If you are looking for an online training program in Talend, you can refer to our Talend
Certification Program.

QlikView

QlikView is a Self-Service Business Intelligence, Data Visualization, and Data

Analytics tool. Being named a leader in Gartner Magic Quadrant 2020 for Analytics
and BI platforms, it aims to accelerate business value through data by providing
features such as Data Integration, Data Literacy, and Data Analytics.

Products

QlikView comes with a variety of products and services for Data Integration, Data
Analytics, and Developer platforms, out of which few are available for a free trial
period of 30 days.

Companies using

Trusted by more than 50,000 customers worldwide few of the top customers of
QlikView are CISCO, NHS, KitchenAid, SAMSUNG.

Recent Advancements/ Features

Recently QlikView has launched an intelligent alerting platform Qlik Alerting for Qlik
Sense® which helps the organizations handle the exceptions, notify users of potential
issues, help users analyze further, and also prompts actions based on the derived
insights.

If you are looking for an online training program in QlikView, you can refer to
our QlikView Certification Program.

Apache Spark

Apache Spark is one of the most successful projects in the Apache Software
Foundation and is a cluster computing framework that is open-source and is used for
real-time processing. Being the most active Apache project at the moment, it comes
with a fantastic open-source community and an interface for programming. This
interface makes sure of fault tolerance and implicit data parallelism.

Products
Apache Spark keeps on releasing new releases with new features. You can also
choose the various package types for Spark. The recent version is 2.4.5 and 3.0.0 is
in preview.

Companies using

Companies such as Oracle, Hortonworks, Verizon, Visa use Apache Spark for real-time
computation of data with ease of use and speed.

Recent Advancements/ Features

 In today’s world Spark runs on Kubernetes, Apache Mesos, standalone, Hadoop,

or in the cloud.
 It provides high-level APIs in Java, Scala, Python, and R, and Spark code can be
written in any of these four languages.
 Spark’s MLlib – the Machine Learning component is handy when it comes to Big
Data processing.

If you are looking for an online training program in Apache Spark, you can refer to
our Apache Spark Certification Program.

Power BI

Power BI is a Microsoft product used for business analytics. Named as a leader for the
13th consecutive year in the Gartner 2020 Magic Quadrant, it provides interactive
visualizations with self-service business intelligence capabilities, where end users can
create dashboards and reports by themselves, without having to depend on anybody.

Products

Power BI provides the following products:

 Power BI Desktop
 Power BI Pro
 Power BI Premium
 Power BI Mobile
 Power BI Embedded
 Power BI Report Server
All these products differ by the functionalities offered by them. Few of them are free for
a certain period of time and then you have to take the licensed versions

Companies using

Multinational organizations such as Adobe, Heathrow, Worldsmart, GE Healthcare are

using Power BI to achieve powerful results from their data.

Recent Advancements/ Features

Power BI has recently come up with solutions such as Azure + Power BI and
Office 365 + Power BI to help the users analyze the data, connect the data and
protect the data across various Office platforms.

If you are looking for an online training program in Power BI, you can refer to our Power
BI Certification Program.

KNIME

Konstanz Information Miner or most commonly known as KNIME is free and an open-
source data analytics, reporting, and integration platform built for analytics on a GUI
based workflow.

Products

KNIME provides the following two software:

 KNIME Analytics Platform – Is an open-source and used to clean & gather

data, make reusable components accessible to everyone, and create Data
Science workflows.
 KNIME Server – Is a platform used by enterprises for the deployment of Data
Science workflows, team collaboration, management, and automation.

Companies using

Companies such as Siemens, Novartis, Deutsche Telekom, Continental use KNime to

make sense of their data and leverage meaningful insights.
Recent Advancements/ Features

You do not need prior programming knowledge to use KNIME and derive insights.
You can work all the way from gathering data and creating models to deployment and
production.

RapidMiner

RapidMiner is the next tool on our list. Being named a Visionary in 2020 Gartner
Magic Quadrant for Data Science and Machine Learning Platforms, RapidMiner is
a platform for data processing, building Machine Learning models, and deployment.
NTRODUCTION
In this article let we look into applications of Data Analytics. Everything today
runs on Data. Be it from Social media to large companies. The term data refers
to information about anything. Each company, each institution has a set of data
to be maintained that they have earned, collected, and maintained over a period
of time. These data are collected, maintained, and analyzed to improve and
evaluate the growth of the companies. Analysis of the data or in other words,
Data Analytics is a vast field and one of the most important fields to cover
today.

The term Data Analytics refers to the analysis of the data collected to draw out
certain conclusions required as per the company’s objective. It involves the
structuring of a massive amount of irregular data and deriving the useful
required information from them using statistical tools. It all involves the
preparation of charts, graphs, etc. The application of Data analytics is not
limited to manufacturing companies or any industrial areas, but it gets involves
in almost every field of human living.

APPLICATION OF ANALYTICS IN DIFFERENT FIELDS

Not just one or two, the use of data analytics is in every field you can see
around. Be it from Online shopping, or Hitech industries, or the government,
everyone uses data analytics to help them in decision making, budgeting,
planning, etc. The data analytics are employed in various places like:

1. Transportation
Data analytics can be applied to help in improving Transportation Systems and
intelligence around them. The predictive method of the analysis helps find
transport problems like Traffic or network congestions. It helps synchronize the
vast amount of data and uses them to build and design plans and strategies to
plan alternative routes, reduce congestions and traffics, which in turn reduces
the number of accidents and mishappenings. Data Analytics can also help to
optimize the buyer’s experience in the travels through recording the information
from social media. It also helps the travel companies fixing their packages and
boost the personalized travel experience as per the data collected.

For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers
traveling from one place to another using prediction tools and techniques.

2. Logistics and Delivery

There are different logistic companies like DHL, FedEx, etc that uses data
analytics to manage their overall operations. Using the applications of data
analytics, they can figure out the best shipping routes, approximate delivery
times, and also can track the real-time status of goods that are dispatched
using GPS trackers. Data Analytics has made online shopping easier and
more demandable.
Example of Use of data analytics in Logistics and Delivery:
When a shipment is dispatched from its origin, till it reaches its buyers, every
position is tracked which leads to the minimizing of the loss of the goods.

3. Web Search or Internet Web Results

The web search engines like Yahoo, Bing, Duckduckgo, Google uses a set of
data to give you when you search a data. Whenever you hit on the search
button, the search engines use algorithms of data analytics to deliver the best-
searched results within a limited time frame. The set of data that appears
whenever we search for any information is obtained through data analytics.

The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand.
For example, when you search for a product on amazon it keeps showing on
your social media profiles or to provide you with the details of the product to
convince you by that product.

4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting,
etc. The unit can figure out the number of products needed to be manufactured
according to the data collected and analyzed from the demand samples and
likewise in many other operations increasing the operating capacity as well as
the profitability.

5. Security
Data analyst provides utmost security to the organization, Security Analytics is
a way to deal with online protection zeroed in on the examination of information
to deliver proactive safety efforts. No business can foresee the future,
particularly where security dangers are concerned, yet by sending security
investigation apparatuses that can dissect security occasions it is conceivable
to identify danger before it gets an opportunity to affect your framework and
main concern.

6. Education
Data analytics applications in education are the most needed data analyst in
the current scenario. It is mostly used in adaptive learning, new innovations,
adaptive content, etc. Is the estimation, assortment, investigation, and detailing
of information about students and their specific circumstance, for reasons for
comprehension and streamlining learning and conditions in which it happens.

7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers
for various illnesses. This won’t just give precise arrangements dependent on
recorded data yet may likewise give accurate answers for exceptional worries
for specific patients.
8. Military
Military applications of data analytics bring together an assortment of
specialized and application-situated use cases. It empowers chiefs and
technologists to make associations between information investigation and such
fields as augmented reality and psychological science that are driving military
associations around the globe forward.
Life Cycle Phases of
Data Analytics
Life Cycle Phases of Data Analytics

In this article, we are going to discuss life cycle phases of data analytics in which we
will cover various life cycle phases and will discuss them one by one.

Data Analytics Lifecycle :The Data analytic lifecycle is designed for Big Data
problems and data science projects. The cycle is iterative to represent real
project. To address the distinct requirements for performing analysis on Big Data,
step – by – step methodology is needed to organize the activities and tasks involved
with acquiring, processing, analyzing, and repurposing data.
Life Cycle Phases of Data
Analytics
Life Cycle Phases of Data Analytics

Phase 1: Discovery –The data science team learn and investigate the
problem.
• Develop context and understanding.
• Come to know about data sources needed and available for
the
project.
• The team formulates initial hypothesis that can be later tested with
data.

Phase 2: Data Preparation –Steps to explore, preprocess, and condition

data prior to modeling and analysis.
• It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple times and
not in predefined order.
• Several tools commonly used for this phase are – Hadoop,
Alpine
Miner, Open Refine, etc.
Life Cycle Phases of Data
Analytics
Phase 3: Model Planning –
• Team explores data to learn about relationships between variables
and subsequently, selects key variables and the most suitable models.
• In this phase, data science team develop data sets for training, testing,
and production purposes.
• Team builds and executes models based on the work done in the
model planning phase.
• Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building –Team develops datasets for testing, training, and
production purposes.
• Team also considers whether its existing tools will suffice for running the models or
if they need more robust environment for executing models.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
Life Cycle Phases of Data
Analytics
Phase 5: Communication Results –After executing model team need to compare
outcomes of modeling to criteria established for success and failure.
• Team considers how best to articulate findings and outcomes to
various team members and stakeholders, taking into account warning,
assumptions.
• Team should identify key findings, quantify business value, and
develop narrative to summarize and convey findings to stakeholders.

Phase 6: Operationalize –The team communicates benefits of project more

broadly and sets up pilot project to deploy work in controlled way before broadening
the work to full enterprise of users.
• This approach enables team to learn about performance and related constraints
of the model in production environment on small scale , and make adjustments
before full deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.

Data Science
100% (2)
Data Science
68 pages
DA KCS051 Unit 1
No ratings yet
DA KCS051 Unit 1
26 pages
Class XII (As Per CBSE Board) : Computer Science
No ratings yet
Class XII (As Per CBSE Board) : Computer Science
8 pages
Data Literacy
No ratings yet
Data Literacy
11 pages
Cloud Computing Software Contracts
0% (1)
Cloud Computing Software Contracts
16 pages
5V's of Big Data
No ratings yet
5V's of Big Data
2 pages
All Unit Notes
No ratings yet
All Unit Notes
116 pages
Software Project Management Lecture Two
100% (1)
Software Project Management Lecture Two
18 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Unit 2
No ratings yet
Unit 2
105 pages
Data Analyst Work
No ratings yet
Data Analyst Work
22 pages
01 How To Write Chapter 4
No ratings yet
01 How To Write Chapter 4
42 pages
Lesson 03 Understanding The Data
No ratings yet
Lesson 03 Understanding The Data
81 pages
DA Unit 1
No ratings yet
DA Unit 1
23 pages
RUDRA IP Practical File XII
No ratings yet
RUDRA IP Practical File XII
12 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
BAD601 Module 1 PDF
No ratings yet
BAD601 Module 1 PDF
64 pages
Data Analytics BCSDS501
No ratings yet
Data Analytics BCSDS501
114 pages
ML Assignment 2
No ratings yet
ML Assignment 2
7 pages
Quiz 3 - 1
No ratings yet
Quiz 3 - 1
6 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Gathering Data
No ratings yet
Gathering Data
3 pages
Data Analytics Unit-1 Part 1
No ratings yet
Data Analytics Unit-1 Part 1
37 pages
Module3 Question Bank
No ratings yet
Module3 Question Bank
10 pages
Data Analytics PDF
No ratings yet
Data Analytics PDF
115 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
DA Unit1 Notes
No ratings yet
DA Unit1 Notes
28 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
Data Analytics Unit 1
No ratings yet
Data Analytics Unit 1
16 pages
Software Testing - Levels
No ratings yet
Software Testing - Levels
8 pages
MKT301 Sec 2 Assignment 2 Group 1
No ratings yet
MKT301 Sec 2 Assignment 2 Group 1
24 pages
Midterm Notes
No ratings yet
Midterm Notes
10 pages
Microsoft - Mitigating Pass-The-Hash (PTH) Attacks and Other Credential Theft Techniques - English
No ratings yet
Microsoft - Mitigating Pass-The-Hash (PTH) Attacks and Other Credential Theft Techniques - English
78 pages
Da Unit-I
No ratings yet
Da Unit-I
39 pages
Unit I
No ratings yet
Unit I
15 pages
BRM Unit 3
No ratings yet
BRM Unit 3
51 pages
U1 D CLSRM
No ratings yet
U1 D CLSRM
18 pages
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
No ratings yet
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
27 pages
DM Unit I
No ratings yet
DM Unit I
52 pages
Data Type and Structure
No ratings yet
Data Type and Structure
24 pages
PPDM Job Description: Job Title: Petrotechnical Business Analyst
No ratings yet
PPDM Job Description: Job Title: Petrotechnical Business Analyst
2 pages
Assignment Part One
No ratings yet
Assignment Part One
3 pages
Unit 2 BI & Data Science
No ratings yet
Unit 2 BI & Data Science
35 pages
Venkatesh R
No ratings yet
Venkatesh R
2 pages
Data Analytics - Unit - 1
No ratings yet
Data Analytics - Unit - 1
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
EDA Unit-1
No ratings yet
EDA Unit-1
9 pages
Business - Statistics - PPT (Data Types and Presentation)
No ratings yet
Business - Statistics - PPT (Data Types and Presentation)
17 pages
Data Visulaziation
No ratings yet
Data Visulaziation
42 pages
Course 3
No ratings yet
Course 3
22 pages
Marketing Data Sources
No ratings yet
Marketing Data Sources
38 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
S&A Notes
No ratings yet
S&A Notes
5 pages
Data - Visualisation - Charts and Types of Data
No ratings yet
Data - Visualisation - Charts and Types of Data
7 pages
Unit 1ppt 241202105748 Ba1c594f
No ratings yet
Unit 1ppt 241202105748 Ba1c594f
30 pages
DAFD UNit-2
No ratings yet
DAFD UNit-2
16 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
Data Collection Is The Process of Gathering and Measuring Information On Variables of Interest, in An Established
No ratings yet
Data Collection Is The Process of Gathering and Measuring Information On Variables of Interest, in An Established
3 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
I. Data Collection What Is Data?
No ratings yet
I. Data Collection What Is Data?
12 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Parameter Manipulation: Prev Next
No ratings yet
Parameter Manipulation: Prev Next
5 pages
Unit 1ppt
No ratings yet
Unit 1ppt
29 pages
Dr. Ayaz - Data Science Presentation
No ratings yet
Dr. Ayaz - Data Science Presentation
164 pages
Lecture Note - 1
No ratings yet
Lecture Note - 1
5 pages
Unit 2
No ratings yet
Unit 2
37 pages
Design and Implementation Summary: ONTAP (Cluster View)
No ratings yet
Design and Implementation Summary: ONTAP (Cluster View)
21 pages
ServiceNow Sam Roles
No ratings yet
ServiceNow Sam Roles
5 pages
Data Mining 3
No ratings yet
Data Mining 3
31 pages
Encreption
No ratings yet
Encreption
5 pages
Specialist - Implementation Engineer, Unity Solutions Version 2.0
No ratings yet
Specialist - Implementation Engineer, Unity Solutions Version 2.0
4 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
BigDataAnalytics - Unit1
No ratings yet
BigDataAnalytics - Unit1
21 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
Various Types of Statistical Data and Collection
No ratings yet
Various Types of Statistical Data and Collection
22 pages
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
No ratings yet
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
18 pages
Working With Devstack: Initial Setup
No ratings yet
Working With Devstack: Initial Setup
5 pages
Oose
No ratings yet
Oose
16 pages
Containers or VMs - Deploy AI Workloads With Ease - 1647197291151001kben
No ratings yet
Containers or VMs - Deploy AI Workloads With Ease - 1647197291151001kben
26 pages
Requirements Traceability Matrix - Excel Template - Agile-Mercurial
No ratings yet
Requirements Traceability Matrix - Excel Template - Agile-Mercurial
4 pages
07
No ratings yet
07
13 pages
Data Warehouse Concepts: TCS Internal
No ratings yet
Data Warehouse Concepts: TCS Internal
19 pages
Hotel Room Booking Management System
No ratings yet
Hotel Room Booking Management System
5 pages
Dynamic Product Proposal in SAP SD
No ratings yet
Dynamic Product Proposal in SAP SD
3 pages
802.1X Wired Connexion PEAP-MSCHAPv2
No ratings yet
802.1X Wired Connexion PEAP-MSCHAPv2
33 pages
Itexamsimulator: Simulate Exam and Practical Test For Certification Exam
No ratings yet
Itexamsimulator: Simulate Exam and Practical Test For Certification Exam
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 1 Da

Uploaded by

Unit 1 Da

Uploaded by

Sources and nature of data

Different Sources of Data for Data

• The data which is to be analyzed must be collected from different valid

• Knowledge has many meanings like business knowledge or sales of enterprise

• This type of data is collected directly by performing techniques such as

• The data collected must be according to the demand and

• LSD – Latin Square Design is an experimental design that is similar to CRD

• These are 3 types:

Matured transaction and No transaction

Versioning over Versioning over tuples or

It is more flexible than

It’s scaling is simpler than

There are primarily three types of data in big data:

2. Data-driven Customer Service

4. Real-time Decision Making

• The increase in data storage ability has grown

• Measures of Data Size

• Database Management System (DBMS)

• Relational Database Management System (RDBMS)

• Enterprise Data Warehouse (EDW)

• User defined functions / Embedded processes

• Predictive modeling markup language

• Public Clouds and Private Clouds

• Data Analysis is a process of collecting, transforming, cleaning, and modeling

In this article, we will cover the following Data Analytics Tools:

Splunk is a platform used to search, analyze, and visualize the machine-generated

Recent Advancements/ Features

Talend comes with the following five products:

 Talend Open Source

Small startups to multinational companies such as ALDO, ABInBev, EuroNext,

Recent Advancements/ Features

QlikView is a Self-Service Business Intelligence, Data Visualization, and Data

Recent Advancements/ Features

Recent Advancements/ Features

 In today’s world Spark runs on Kubernetes, Apache Mesos, standalone, Hadoop,

Power BI provides the following products:

Multinational organizations such as Adobe, Heathrow, Worldsmart, GE Healthcare are

Recent Advancements/ Features

KNIME provides the following two software:

 KNIME Analytics Platform – Is an open-source and used to clean & gather

Companies such as Siemens, Novartis, Deutsche Telekom, Continental use KNime to

APPLICATION OF ANALYTICS IN DIFFERENT FIELDS

2. Logistics and Delivery

3. Web Search or Internet Web Results

Phase 2: Data Preparation –Steps to explore, preprocess, and condition

Phase 6: Operationalize –The team communicates benefits of project more

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.