Unit 1 Da
Unit 1 Da
• In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data.
• Data collection starts with asking some questions such as what type of
data is to be collected and what is the source of collection.
• Most of the data collected are of two types known as “qualitative data“
which is a group of non-numerical data such as words, sentences mostly
focus on behavior and actions of the group and another one is “quantitative
data” which is in numerical forms and can be calculated using different scientific
tools and sampling data.
The actual data is then further divided
mainly into two types known as: Primary
data and
Secondary data
1.Primary data:
• The data which is Raw, original, and extracted directly from the official sources
is known as primary data.
1. Interview method:
• The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions
are asked and noted down in the form of notes, audio, or video and this data is
stored for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face to face, email, etc.
2. Survey method:
• The survey method is the process of research where a list of relevant questions
are asked, and answers are noted down in the form of text,
audio, or video. The survey method can be obtained in both online and offline
mode like through website forms and email. Then that survey answers are
stored for analyzing data. Examples are online
surveys or surveys through social media polls.
1.Primary data:
3. Observation method:
• The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In this method,
the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and
their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
• The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used
experiment methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used in
data analytics which is based on randomization and replication. It is
mostly used for comparing the experiments.
1.Primary data:
• RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random
experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated
from the agriculture sector.
• FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combinational
factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again
for some valid purpose. This type of data is previously recorded from primary
data and it has two types of sources named internal source and external source.
Internal source:
• These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources,
etc. The cost and time consumption is less in obtaining internal sources.
External source:
• The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples
of external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
2. Secondary data:
Other sources:
• Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
• Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful
information.
• Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also provide
their data through keywords and queries searched mostly.
Classification of data
Difference between Structured, Semi-
structured and Unstructured data
• Big Data includes huge volume, high velocity, and extensible variety of data.
• Structured data,
• Semi-structured data, and
• Unstructured data.
Difference between Structured, Semi-
structured and Unstructured data
Structured data –
• Structured data is data whose elements are addressable for effective analysis. It
has been organized into a formatted repository that is typically a database.
It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational
keys and can easily be mapped into pre-designed fields. Today, those data are most
processed in the development and simplest way to manage information. Example:
Relational data.
Semi-Structured data –
• Semi-structured data is information that does not reside in a relational
database but that has some organizational properties that make it easier to
analyze. With some processes, you can store them in the relation database (it
could be very hard for some kind of semi- structured data), but Semi-
structured exist to ease space. Example: XML data.
Difference between Structured, Semi-
structured and Unstructured data
Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or
does not have a predefined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there
are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence
and analytics applications. Example: Word, PDF, Text, Media logs.
Differences between Structured,
Semi-structured and Unstructured
data:
Properties Structured data Semi-structured data Unstructured data
It is based on
It is based on Relational It is based on character and binary
Technology XML/RDF(Resource
database table data
Description Framework).
Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
Introduction of Big
Data
What Is Big Data?
• Big Data is the field of analyzing and extracting information from extremely
large data sets.
• The term also refers to large quantities of data that grow exponentially with time.
Such data is so humongous and complex that no conventional methods
or traditional data management tool can process and store it effectively.
• There are many examples of Big Data. From social media platforms to E-commerce
stores, organizations in various industries generate and utilize data to enhance
their processes.
• Big data includes multiple processes, including data mining, data analysis,
data storage, data visualization, etc. The term “big data” refers to collecting
these processes and all the tools that we use during the same.
Types of Big Data
1. Structured
Structured data refers to the data that you can process, store, and
retrieve in a fixed format. It is highly organized information that you can readily and
seamlessly store and access from a database by using simple algorithms. This is the
easiest type of data to manage as you know what data format you are working with in
advance. For example, the data that a company stores in its databases in the form of
tables and spreadsheets is structured data.
2. Unstructured
Data with an unknown structure is termed unstructured data. Its size is substantially
bigger than structured data and is heterogeneous in nature. A great example of
unstructured data includes the results you get when
you perform a Google search. You get webpages, videos, images, text, and other data
formats of varying sizes.
Types of Big Data
3. Semi-structured
As the name suggests, semi-structured data contains a combination of structured and
unstructured data. It is data that hasn’t been classified into a specific database but
contains vital tags that separate individual
elements within the same. For example, a table definition in relational
DBMS has semi-structured data.
Characteristics of Big Data
Following are the big data core characteristics. Understanding the characteristics
of big data is vital to know how it works and how you can use it. There are primarily
seven characteristics of big data analytics:
1. Volume
Volume refers to the amount of data that you have. We measure the volume of
our data in Gigabytes, Zettabytes (ZB), and Yottabytes (YB). According to the
industry trends, the volume of data will rise substantially
in the coming years.
2. Velocity
Velocity refers to the speed of data processing. High velocity is crucial for the
performance of any big data process. It consists of the rate of
change, activity bursts, and the linking of incoming data sets.
3. Value
Value refers to the benefits that your organization derives from the data.
Does it match your organization’s goals? Does it help your organization enhance
itself? It’s among the most important big data core characteristics.
Characteristics of Big Data
5. Veracity
Veracity refers to the accuracy of your data. It is among the most important
Big Data characteristics as low veracity can greatly damage the accuracy of your
results.
6. Validity
How valid and relevant is the data to be used for the intended purpose.
7. Volatility
Big data is constantly changing. The data you gathered from a source a day ago
might be different from what you found today. This is called variability of data, and
it affects your data homogenization.
8. Visualization
Visualization refers to showing your big data-generated insights through
visual representations such as charts and graphs. It has become prevalent
recently as big data professionals regularly share their insights with non-technical
audiences.
Main Components of Big Data
1. Ingestion
Ingestion refers to the process of gathering and preparing the data. You’d use
the ETL (extract, transform, and load) process to prepare your data. In this phase,
you have to identify your data sources, determine
whether you’ll gather the data in batches or stream it, and prepare it
through cleansing, massaging, and organization. You perform the extract process in
gathering the data and the transformation process in optimizing it.
2. Storage
Once you have gathered the necessary data, you’d need to store it. Here, you’ll
perform the final step of the ETL, the load process. You’d store
your data in a data warehouse or a data lake, depending on your requirements.
This is why it’s crucial to understand your organization’s goals while performing
any big data process.
Main Components of Big Data
4. Analysis
In this phase of your big data process, you’d analyze the data to generate
valuable insights for your organization. There are four kinds of big data analytics:
prescriptive, predictive, descriptive, and diagnostic.
You’d use artificial intelligence and machine learning algorithms in this
phase to analyze the data.
5. Consumption
This is the final phase of a big data process. Once you have analyzed the data and have
found the insights, you have to share them with others. Here, you’d have to utilize
data visualization and data storytelling to
share your insights effectively with a non-technical audience such as stakeholders
and project managers.
Advantages of Big Data
There are numerous advantages of Big Data for organizations. Some of the key
ones are as follows:
1. Enhanced Decision-making
Big data implementations can help businesses and organizations make
better-informed decisions in less time. It allows them to use outside intelligence
such as search engines and social media platforms to fine- tune their strategies.
Big data can identify trends and patterns that would’ve been invisible otherwise,
helping companies avoiding errors.
3. Efficiency Optimization
Organizations use big data to identify the weak areas present within them. Then,
they use these findings to resolve those issues and enhance their operations
substantially. For example, Big Data has substantially
helped the manufacturing sector improve its efficiency through IoT and
robotics.
1. Product Development
Data analytics offers both estimation and exploration capability for information.
It allows one to understand the market or process’s current state and offers a solid
base for forecasting future results. Data analysis
helps companies to comprehend the current business situation and
change the processor cause the need for a new product creation that
meets market requirements.
2. Target Content
Learning what consumers wish in advance improves consumer
orientation in marketing campaigns. It encourages advertisers to tailor their
advertising to a subset of the entire consumer base. It also allows you to figure out
which client base group can better respond to the initiative. It also saves money to
convince a buyer to buy and increases the overall performance of the marketing
activities.
IMPORTANCE OF
DATA ANALYTICS
3. Efficiency in Operations
The importance of data analytics in marketing finds more viable ways to streamline
operations or increase benefit levels. It helps to recognize possible issues, avoids
the waiting period, and takes action on them.
The Evolution of Analytic Scalability
• Scalability: The ability of a system to handle
increasing amount of work required to
perform its task
• Database:
– Organized collection of related data.
– Aggregation
• Combining all data into one
– Eg: statistical summary
– Combining rows of different data source
– Derivations
• Creating new columns of data
• Calculating ratio
– Transformation
• Converting data into useful format
• Taking log, converting date of birth to age
Ways for in-database data preparation
• SQL
https://www.youtube.com/watch?v=s8EPQpgpWVE
https://www.youtube.com/watch?v=bcjSe0xCHbE
Data Analysis - Process
Data Analysis - Process
Data Collection
• Data Collection is the process of gathering information on targeted variables
identified as data requirements. The emphasis is on ensuring accurate and honest
collection of data. Data Collection ensures that
data gathered is accurate such that the related decisions are valid.
Data Collection provides both a baseline to measure and a target to
improve.
• Data is collected from various sources ranging from
organizational databases to the information in web pages. The data thus
obtained,
may not be structured and may contain irrelevant information. Hence,
the collected data is required to be subjected to Data Processing and
Data Cleaning.
Data Analysis - Process
Data Processing
The data that is collected must be processed or organized for analysis.
This includes structuring the data as required for the relevant Analysis Tools. For
example, the data might have to be placed into rows and columns in a table within
a Spreadsheet or Statistical Application. A Data Model might have to be created.
Data Cleaning
The processed and organized data may be incomplete, contain
duplicates, or contain errors. Data Cleaning is the process of preventing and
correcting these errors. There are several types of Data Cleaning that depend on the
type of data. For example, while cleaning the financial data, certain totals might
be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be
subsequently excluded in analysis.
Data Analysis - Process
Data Analysis
• Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and
derive conclusions based on the requirements. Data
Visualization may also be used to examine the data in graphical format,
to obtain additional insight regarding the messages within the data.
• Statistical Data Models such as Correlation, Regression Analysis can be used to
identify the relations among the data variables. These models that are descriptive of
the data are helpful in simplifying analysis and communicate results.
• The process might require additional Data Cleaning or additional Data
Collection, and hence these activities are iterative in nature.
Communication
• The results of the data analysis are to be reported in a format as required by the users
to support their decisions and further action. The feedback from the users might result
in additional analysis.
• The data analysts can choose data visualization techniques, such as tables
and charts, which help in communicating the message clearly and efficiently to
the users. The analysis tools provide facility to highlight the required information
with color codes and formatting in tables and charts
Differences Between
Reporting and Analysis
Differences Between Reporting
• Living
and Analysis
in the era of digital and big data has made
technology
organizations dependent on the wealth of information data can bring. You might
have seen how reporting and analysis are used interchangeably,
especially the manner which outsourcing companies market their services. While
both areas are part of web analytics (note that analytics isn’t similar to
analysis), there’s a vast difference between them, and it’s more than just spelling.
• It’s important that we differentiate the two because some
organizations might be selling themselves short in one area and not reap the
benefits, which web analytics can bring to the table. The first core component of
web analytics, reporting, is merely organizing data into summaries. On the
other hand, analysis is the process of
inspecting, cleaning, transforming, and modeling these summaries
(reports) with the goal of highlighting useful information.
• Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask
“What?” questions about the information, whereas analysis should answer to
“Why”” and “What can we do about it?”
Here are five differences between
reporting and analysis
1. Purpose
• Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings to
their business, as reporting extracts that and
makes it easier to understand.
• Analysis interprets data at a deeper level. While reporting can link between
cross-channels of data, provide comparison, and make understand information
easier (think of a dashboard, charts, and graphs, which are reporting tools and
not analysis reports), analysis interprets this information and provides
recommendations on actions.
Here are five differences between
reporting and analysis
2. Tasks
• As reporting and analysis have a very fine line dividing them, sometimes
it’s easy to confuse tasks that have analysis labeled on top of them when all it does
is reporting. Hence, ensure that your analytics
team has a healthy balance doing both.
• Here’s a great differentiator to keep in mind if what you’re doing is reporting or
analysis:
• Reporting includes building, configuring, consolidating, organizing,
formatting, and summarizing. It’s very similar to the above mentioned like
turning data into charts, graphs, and linking data across multiple channels.
• Analysis consists of questioning, examining, interpreting, comparing,
and confirming. With big data, predicting is possible as well.
Here are five differences between
reporting and analysis
3. Outputs
• Reporting and analysis have the push and pull effect from its users through
their outputs. Reporting has a push approach, as it pushes information to users
and outputs come in the forms of canned reports,
dashboards, and alerts.
• Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are
comprised of insights, recommended actions, and a forecast of its impact
on the company—all in a language that’s easy to understand at the level of
the user who’ll be reading and deciding on it.
• This is important for organizations to realize truly the value of data, such
that a standard report is not similar to a meaningful analytics.
Here are five differences between
reporting and analysis
4. Delivery
• Considering that reporting involves repetitive tasks—often with
truckloads of data, automation has been a lifesaver, especially now with big data.
It’s not surprising that the first thing outsourced are data
entry services since outsourcing companies are perceived as data reporting
experts.
• Analysis requires a more custom approach, with human minds doing
superior reasoning and analytical thinking to extract insights, and technical
skills to provide efficient steps towards accomplishing a specific goal. This is
why data analysts and scientists are demanded these days, as organizations
depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
Here are five differences between
reporting and analysis
4. Delivery
• Considering that reporting involves repetitive tasks—often with truckloads
of data, automation has been a lifesaver, especially now with big data. It’s not
surprising that the first thing outsourced are data
entry services since outsourcing companies are perceived as data reporting experts.
• Analysis requires a more custom approach, with human minds doing
superior reasoning and analytical thinking to extract insights, and technical
skills to provide efficient steps towards accomplishing a specific goal. This is
why data analysts and scientists are demanded these days, as organizations
depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
Here are five differences between
reporting and analysis
5. Value
• This isn’t about identifying which one brings more value, rather
understanding that both are indispensable when looking at the big picture. It
should help businesses grow, expand, move forward, and
make more profit or increase their value.
• This Path to Value diagram illustrates how data converts into value by reporting
and analysis such that it’s not achievable without the other.
he word ‘Data’ has been in existence for ages now. In the era of 2.5 Quintillion bytes of
data being generated every day, data plays a crucial role in decision making for
business operations. But how do you think we can deal with so much data? Well, there
are several roles in the industry today that deal with data to gather insights, and one
such vital role is of a Data Analyst. A Data Analyst requires many tools to gather
insights from data. This article on the Top 10 Data Analytics Tools will talk about the top
tools that every budding Data Analyst to a skilled professional must learn in 2021.
1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. Splunk
Let’s dive into this article from the 10th tool on our list i.e Splunk.
Splunk
Products
Splunk Free
Splunk Enterprise
Splunk Cloud
All these 3 products differ by the bandwidth of the features they offer and are available
for free download and trial versions. The pricing options for Splunk products are based
on predictive pricing, Infrastructure-based pricing, and also rapid adoption packages.
Companies using
Trusted by 92 out of the Fortune 100, companies such as Dominos, Otto Group, Intel,
Lenovo are using Splunk in their day to day practices to discover the processes and
correlate data in real-time.
Since almost all the organizations need to deal with data across various divisions,
according to Splunk official website Splunk aims to bring data to every part of your
organization, by helping teams use Splunk to prevent and predict problems with
monitoring experience, detect and diagnose issues with clear visibility,
explore and visualize business processes and streamline the entire security
stack.
If you are looking for an online training program in Splunk, you can refer to our Splunk
Certification Program.
Talend
Talend is one of the most powerful data integration ETL tools available in the market
and is developed in the Eclipse graphical development environment. Being named as
a Leader in Gartner’s Magic Quadrant for Data Integration Tools and Data Quality
tools 2019, this tool lets you easily manage all the steps involved in the ETL
process and aims to deliver compliant, accessible and clean data for everyone.
Products
Out of these, few are completely free, few are free for 14 days and few are licensed. All
these products differ in their functionalities and pricing options.
Companies using
Talend is the only platform that delivers complete and clean data at the moment you
need it by maintaining data quality, providing Big Data integration, cloud API services,
Preparing Data, and providing Data Catalog and Stitch Data Loader.
Recently Talend has also accelerated the journey to the lakehouse paradigm and the
path to reveal intelligence in data. Not only this but the Talend Cloud is now available
in Microsoft Azure Marketplace.
If you are looking for an online training program in Talend, you can refer to our Talend
Certification Program.
QlikView
Products
QlikView comes with a variety of products and services for Data Integration, Data
Analytics, and Developer platforms, out of which few are available for a free trial
period of 30 days.
Companies using
Trusted by more than 50,000 customers worldwide few of the top customers of
QlikView are CISCO, NHS, KitchenAid, SAMSUNG.
If you are looking for an online training program in QlikView, you can refer to
our QlikView Certification Program.
Apache Spark
Apache Spark is one of the most successful projects in the Apache Software
Foundation and is a cluster computing framework that is open-source and is used for
real-time processing. Being the most active Apache project at the moment, it comes
with a fantastic open-source community and an interface for programming. This
interface makes sure of fault tolerance and implicit data parallelism.
Products
Apache Spark keeps on releasing new releases with new features. You can also
choose the various package types for Spark. The recent version is 2.4.5 and 3.0.0 is
in preview.
Companies using
Companies such as Oracle, Hortonworks, Verizon, Visa use Apache Spark for real-time
computation of data with ease of use and speed.
If you are looking for an online training program in Apache Spark, you can refer to
our Apache Spark Certification Program.
Power BI
Power BI is a Microsoft product used for business analytics. Named as a leader for the
13th consecutive year in the Gartner 2020 Magic Quadrant, it provides interactive
visualizations with self-service business intelligence capabilities, where end users can
create dashboards and reports by themselves, without having to depend on anybody.
Products
Power BI Desktop
Power BI Pro
Power BI Premium
Power BI Mobile
Power BI Embedded
Power BI Report Server
All these products differ by the functionalities offered by them. Few of them are free for
a certain period of time and then you have to take the licensed versions
Companies using
Power BI has recently come up with solutions such as Azure + Power BI and
Office 365 + Power BI to help the users analyze the data, connect the data and
protect the data across various Office platforms.
If you are looking for an online training program in Power BI, you can refer to our Power
BI Certification Program.
KNIME
Konstanz Information Miner or most commonly known as KNIME is free and an open-
source data analytics, reporting, and integration platform built for analytics on a GUI
based workflow.
Products
Companies using
You do not need prior programming knowledge to use KNIME and derive insights.
You can work all the way from gathering data and creating models to deployment and
production.
RapidMiner
RapidMiner is the next tool on our list. Being named a Visionary in 2020 Gartner
Magic Quadrant for Data Science and Machine Learning Platforms, RapidMiner is
a platform for data processing, building Machine Learning models, and deployment.
NTRODUCTION
In this article let we look into applications of Data Analytics. Everything today
runs on Data. Be it from Social media to large companies. The term data refers
to information about anything. Each company, each institution has a set of data
to be maintained that they have earned, collected, and maintained over a period
of time. These data are collected, maintained, and analyzed to improve and
evaluate the growth of the companies. Analysis of the data or in other words,
Data Analytics is a vast field and one of the most important fields to cover
today.
The term Data Analytics refers to the analysis of the data collected to draw out
certain conclusions required as per the company’s objective. It involves the
structuring of a massive amount of irregular data and deriving the useful
required information from them using statistical tools. It all involves the
preparation of charts, graphs, etc. The application of Data analytics is not
limited to manufacturing companies or any industrial areas, but it gets involves
in almost every field of human living.
1. Transportation
Data analytics can be applied to help in improving Transportation Systems and
intelligence around them. The predictive method of the analysis helps find
transport problems like Traffic or network congestions. It helps synchronize the
vast amount of data and uses them to build and design plans and strategies to
plan alternative routes, reduce congestions and traffics, which in turn reduces
the number of accidents and mishappenings. Data Analytics can also help to
optimize the buyer’s experience in the travels through recording the information
from social media. It also helps the travel companies fixing their packages and
boost the personalized travel experience as per the data collected.
For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers
traveling from one place to another using prediction tools and techniques.
The searched data is considered as a keyword and all the related pieces of
information are presented in a sorted manner that one can easily understand.
For example, when you search for a product on amazon it keeps showing on
your social media profiles or to provide you with the details of the product to
convince you by that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting,
etc. The unit can figure out the number of products needed to be manufactured
according to the data collected and analyzed from the demand samples and
likewise in many other operations increasing the operating capacity as well as
the profitability.
5. Security
Data analyst provides utmost security to the organization, Security Analytics is
a way to deal with online protection zeroed in on the examination of information
to deliver proactive safety efforts. No business can foresee the future,
particularly where security dangers are concerned, yet by sending security
investigation apparatuses that can dissect security occasions it is conceivable
to identify danger before it gets an opportunity to affect your framework and
main concern.
6. Education
Data analytics applications in education are the most needed data analyst in
the current scenario. It is mostly used in adaptive learning, new innovations,
adaptive content, etc. Is the estimation, assortment, investigation, and detailing
of information about students and their specific circumstance, for reasons for
comprehension and streamlining learning and conditions in which it happens.
7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers
for various illnesses. This won’t just give precise arrangements dependent on
recorded data yet may likewise give accurate answers for exceptional worries
for specific patients.
8. Military
Military applications of data analytics bring together an assortment of
specialized and application-situated use cases. It empowers chiefs and
technologists to make associations between information investigation and such
fields as augmented reality and psychological science that are driving military
associations around the globe forward.
Life Cycle Phases of
Data Analytics
Life Cycle Phases of Data Analytics
In this article, we are going to discuss life cycle phases of data analytics in which we
will cover various life cycle phases and will discuss them one by one.
Data Analytics Lifecycle :The Data analytic lifecycle is designed for Big Data
problems and data science projects. The cycle is iterative to represent real
project. To address the distinct requirements for performing analysis on Big Data,
step – by – step methodology is needed to organize the activities and tasks involved
with acquiring, processing, analyzing, and repurposing data.
Life Cycle Phases of Data
Analytics
Life Cycle Phases of Data Analytics
Phase 1: Discovery –The data science team learn and investigate the
problem.
• Develop context and understanding.
• Come to know about data sources needed and available for
the
project.
• The team formulates initial hypothesis that can be later tested with
data.
Phase 4: Model Building –Team develops datasets for testing, training, and
production purposes.
• Team also considers whether its existing tools will suffice for running the models or
if they need more robust environment for executing models.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
Life Cycle Phases of Data
Analytics
Phase 5: Communication Results –After executing model team need to compare
outcomes of modeling to criteria established for success and failure.
• Team considers how best to articulate findings and outcomes to
various team members and stakeholders, taking into account warning,
assumptions.
• Team should identify key findings, quantify business value, and
develop narrative to summarize and convey findings to stakeholders.