Unit 1
Unit 1
UNIT 1
Introduction to Data Analytics
2.Diagnostic Analytics focuses on the reason for the occurrence of any event. It
answer questions, such as “Did the weather impact the selling of beer?” or “Did the
3.Predictive Analytics focuses on the events that are expected to occur in the immediate future. Predictive
analytics tries to find answers to questions like, what happened to the sales in the last hot summer season? How
many weather forecasts expect this year’s hot summer?
4. Prescriptive Analytics indicates a plan of action. If the chance of a hot summer calculated as the average of
the five weather models is above 58%, an evening shift can be added to the brewery, and an additional tank can
be rented to maximize the production.
Benefits of Data Analytics
• A) Primary data
• Primary data means first-hand information collected by an investigator.
• It is collected for the first time.
• It is original and more reliable.
• For example, the population census conducted by the government of India after
every ten years is primary data.
• B) Secondary data
• Secondary data refers to second-hand information.
• It is not originally collected and rather obtained from already published or
unpublished sources.
• For example, the address of a person taken from the telephone directory or the
phone number of a company taken from Just Dial are secondary data.
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by
traditional data storage or the processing unit. It is used by
many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day
before replication
5 v’s of big data
Tools Of Data Analytics
• 1. SAS
• SAS is a copyrighted piece of C-based software with over 200 different parts. Because its programming code is regarded as top-
level, learning it is simple. Nonetheless, it just instantly publishes the findings through an excel worksheet. As a result, several
businesses utilize it, including Twitter, Netflix, Facebook, and Google. Moreover, SAS is improving to demonstrate that it is a
significant player in the analytics of data business even after facing challenges from innovative coding languages like R and Python.
• 2. R
• R is among the top coding languages for creating detailed statistical visuals. It is open and free programming that one may use with
Windows, MacOS, and many UNIX operating systems. It moreover features a simple for using command-line interface.
Nevertheless, learning it can be challenging, especially for those without any prior coding skills. Furthermore, it is extremely
helpful for developing statistical software and carrying out sophisticated analyses.
• 3. Python
• One of several potent tools at the user’s disposal for analytics of data includes Python. It includes numerous packages and
libraries. Python is a freeware, open-source program with modules like Matplotlib and Seaborn, which could use for sophisticated
visualization. The popular analytics of data package that Python includes is called Pandas. Owing to its efficiency and adaptability,
analysts frequently choose Python as a beginner’s coding language. Python is used across several systems and has a wide range of
applications.
Tools of data analytics
• 4. Microsoft Excel
• An advanced understanding of Excel will help you clean and visualize your data. It allows you to use charts and conditional formatting to identify trends and patterns. You
can perform the following activities with Excel:
• Regression analysis
• Statistical analysis
• Inferential statistics
• Descriptive statistics
• Exploratory data analysis
• 5.RapidMiner
• As the name suggests, this tool is primarily used for data mining. But you can also use it for various statistical techniques, such as inferential statistics and descriptive statistics, to
generate summaries and conclusions.
• 6.Tableau
• Tableau is a data visualization platform that allows you to share insights, collaborate over data analysis tasks, and share reports with stakeholders. Tableau has robust analytical
features, such as limitless what-if analysis, and enables you to perform calculations with as many types of variables as you need.
• 7.Apache Spark
• Apache Spark helps with large-scale data engineering, regression analysis, and exploratory analysis, allowing you to analyze massive datasets.
ANALYZING VS REPORTING
ANALYTICS REPORTING
• Analytics is the method of examining • Reporting is the action that includes all the
and analyzing summarized data to needed information of data and is put
make business decisions. together in an organized way.
• Questioning the data , understanding • Identifying business events , gathering the
it , investigating it and presenting it to required information, organizing ,
the end user part of analytics. summarizing, and presenting the existing
data are all part of reporting.
• The purpose of analytics is to draw • The purpose of reporting is to organize
conclusions based on data . the data into meaningful information.
• Analytics is used by data analysts, • Reporting is provided to the appropriate
scientists and business people to business leaders to perform effectively
make effective decisions. and efficiently within a firm.
MODERN DATA ANALYTICS TOOLS
• 1. RapidMiner
• Primary use: Data mining
• RapidMiner is a comprehensive package for data mining and model
• development. This platform allows professionals to work with data at
• many stages, including preparation, visualization, and review. This can be
• beneficial for professionals who have data that isn’t in raw format or that
• they have mined in the past.
• RapidMiner also offers an array of classification, regression, clustering, and association rule mining algorithms. While it has
some limitations in feature engineering and selection, it compensates for its limitations with a powerful graphical
programming language.
MODERN DATA ANALYTICS TOOLS
• 2. Orange
• Primary use: Data mining
• Orange is a package renowned for data visualization and analysis, especially appreciated for its user-friendly, color-
coordinated interface. You can find a comprehensive selection of color-coded widgets for functions like data input,
cleaning, visualization, regression, and clustering, which make it a good choice for beginners or smaller projects.
• Despite offering fewer tools compared to other platforms, Orange is still an effective data analysis tool, hosting an array
of mainstream algorithms like k-nearest neighbors, random forests, naive Bayes classification, and support vector
machines.
MODERN DATA ANALYTICS TOOLS
• 3. KNIME
• KNIME, short for KoNstanz Information MinEr, is a free and open-source data cleaning and analysis tool that makes data mining accessible even if you are a
beginner. Along with data cleaning and analysis software, KNIME has specialized algorithms for areas like sentiment analysis and social network analysis. With
KNIME, you can integrate data from various sources into a single analysis and use extensions to work with popular programming languages like R, Python, Java,
and SQL.
• If you are new to data mining, KNIME might be a great choice for you. Resources on the KNIME platform can help new data professionals learn about data
mining by guiding them through building, deploying, and maintaining large-scale data mining strategies. Because of this, many companies use KNIME to help
• Tableau stands out as a leading data visualization software, widely utilized in business analytics and intelligence.
• Tableau is a popular data visualization tool for its easy-to-use interface and powerful capabilities. Its software can connect with hundreds of different data
sources and manipulate the information in many different visualization types. It holds a special appeal for both business users, who appreciate its simplicity
and centralized platform, and data analysts, who can use more advanced big data tools for tasks such as clustering and regression.
MODERN DATA ANALYTICS TOOLS
• 5. Google Charts
• Google Charts is a free online tool that excels in producing a wide array of interactive and engaging data visualizations. Its design caters
to user-friendliness, offering a comprehensive selection of pre-set chart types that can embed into web pages or applications. The
versatile nature of Google Charts allows its integration with a multitude of web platforms, including iPhone, iPad, and Android,
extending its accessibility.
• This tool, with its high customization and user-friendly nature, makes it ideal if you are looking to create compelling data visuals for web
and mobile platforms. It’s also a great option if you need to publish your charts, as the integration makes it straightforward for you to
publish on most web platforms by sharing a link or embedding the link into a website’s HTML code.
MODERN DATA ANALYTICS TOOLS
• 6. Microsoft Excel and Power BI
• Primary use: Business intelligence
• Microsoft Excel, fundamentally a spreadsheet software, also has noteworthy data analytics capabilities.
Because of the wide enterprise-level adoption of Microsoft products, many businesses find they already
have access to it.
• You can use Excel to construct at least 20 distinct chart types using spreadsheet data. These range from
standard options such as bar charts and scatter plots to more complex options like radar charts and
treemaps. Excel also has many streamlined options for businesses to find insights into their data and use
modern business analytics formulas.
MODERN DATA ANALYTICS TOOLS
• 7. Google Analytics
• Primary use: Business intelligence
• Google Analytics is a tool that helps businesses understand how people interact with their
websites and apps. To use it, you add a special Javascript code to your web pages. This
code collects information when someone visits your website, like which pages they see,
what device they’re using, and how they found your site. It then sends this data to Google
• Analytics, where it is organized into reports. These reports help you see patterns, like
• which products are most popular or which ads are bringing people to your site.
MODERN DATA ANALYTICS TOOLS
• 8. Spotfire
• Primary use: Business intelligence
• TIBCO Spotfire is a user-friendly platform that transforms data into actionable insights. It
allows you to analyze historical and real-time data, predict trends, and visualize results in
a single, scalable platform. Features include custom analytics apps, interactive AI and
data science tools, real-time streaming analytics, and powerful analytics for location-
based data.
APPLICATIONS OF DATA ANALYTICS
• 1. Energy
• The energy sector's applications of data analytics focus on consumption
analysis and grid optimization. In an era of rising energy demands, efficient
distribution and consumption become paramount.
• Through these analytics applications, energy distribution can be optimized,
and consumption patterns predicted.
• 2. Finance & Banking
• In finance and banking, the applications of data analytics are primarily
directed towards fraud detection and risk management. Every transaction
provides data that, when analyzed, can reveal anomalies.
• This usage of data analytics reduces fraudulent activities and helps manage
risks linked to loans and investments.
APPLICATION OF DATA ANALYTICS
• 3. Government & Public Sector
• Governments utilize the applications of data analytics in policy formation
and resource distribution. The vast administrative data provides insights
into public needs and requirements.
• These analytics applications allow for policies that are more aligned with
public needs, ensuring resources are allocated wisely and public services
improve.
• 4. Health Care
• In the health care domain, data analytics applications play a pivotal role in
diagnosis and treatment optimization. Massive volumes of patient data are
now analyzed to detect patterns and correlations.
• These analytics applications guide health care professionals in making
decisions that lead to enhanced patient outcomes and substantial
reductions in medical expenses.
APPLICATION OF DATA ANALYTICS
• 5. Manufacturing
• Manufacturing industries utilize data analytics applications for quality
control and process efficiency. With complex machinery and operations,
every stage provides vital data.
• Predictive analytics helps in pre-empting manufacturing defects and
refining production workflows, leading to reduced waste and superior
products.
• 6. Marketing & Advertising
• Marketing professionals harness analytics applications for precise customer
segmentation and to gauge the effectiveness of their campaigns.
• With the insights from these data analytics applications, businesses can
target audiences more effectively and assess their campaign ROI.
APPLICATION OF DATA ANALYTICS
• 7. Real Estate
• The real estate sector's applications of data analytics involve property
valuation and tracking market trends. The fluctuating property
market generates vast amounts of data.
• Real estate professionals, armed with these insights, can more
accurately price properties and anticipate market movements.
• 8. Retail & E-Commerce
• The retail and e-commerce sector taps into analytics applications to
gain customer insights and manage inventory. The digital footprints
of online shopping are treasure troves of data.
• With data analytics applications, retailers can discern customer
preferences, hone pricing strategies, and oversee optimal stock
levels, translating to boosted sales and cost savings.
APPLICATION OF DATA ANALYTICS
• 9. Insurance
• In the insurance sector, data analytics applications are crucial for risk
assessment and claim processing. With countless policyholders and
claims, insurers rely on analytics to make accurate predictions and
decisions.
• Benefit: These analytics applications allow insurers to set premiums
more accurately based on risk, as well as expedite claim processes,
which enhances customer satisfaction and operational efficiency.
• 10. Transport & Logistics
• In transport and logistics, the applications of data analytics involve
route optimization and demand prediction. The constant movement
of goods provides a continuous stream of data to be processed.
School of Computing Science and Engineering
Course Code : R1UC402T.. Course Nam: Data Analytics….
…….UNIT 1……….
…….. Data Analytics Lifecycle (Need , Importance & key role)
…….
Name of the Faculty: Dr. Avinash Dwivedi Program Name: B.Tech (CSE)
Data Analytics Lifecycle
• The data analytics lifecycle is a structure for doing data analytics that has
business objectives at its core.
• The data analytics lifecycle is a series of six phases that have each been
identified as vital for businesses doing data analytics. This lifecycle is based on
the popular CRISP-DM analytics process model, which is an open-standard
analytics model developed by IBM. The phases of the data analytics lifecycle
include defining your business objectives, cleaning your data, building models,
and communicating with your stakeholders.
• This lifecycle runs from identifying the problem you need to solve, to running your
chosen models against some sandboxed data, to finally operationalizing the
output of these models by running them on a production dataset. This will enable
you to find the answer to your initial question and use this answer to inform
business decisions.
Requirement of the Data Analytics
Lifecycle
• The data analytics lifecycle allows you to better understand the
factors that affect successes and failures in your business. It’s
especially useful for finding out why customers behave a certain
way. These customer insights are extremely valuable and can
help inform your growth strategy.
• There are certain key roles that are required for the complete and
fulfilled functioning of the data science team to execute projects on
analytics successfully. The key roles are seven in number.
• Each key plays a crucial role in developing a successful analytics
project. There is no hard and fast rule for considering the listed
seven roles, they can be used fewer or more depending on the
scope of the project, skills of the participants, and organizational
structure.
• Example –
For a small, versatile team, these listed seven roles may be fulfilled
by only three to four people but a large project on the contrary may
require 20 or more people for fulfilling the listed roles.
Key Roles for Data Analytics project
• Business User :
• The business user is the one who understands the main area of the project and is also
basically benefited from the results.
• This user gives advice and consult the team working on the project about the value of the
results obtained and how the operations on the outputs are done.
• The business manager, line manager, or deep subject matter expert in the project mains fulfills
this role.
• Project Sponsor :
• The Project Sponsor is the one who is responsible to initiate the project. Project Sponsor
provides the actual requirements for the project and presents the basic business issue.
• He generally provides the funds and measures the degree of value from the final output of the
team working on the project.
• This person introduce the prime concern and brooms the desired output.
• Project Manager :
• This person ensures that key milestone and purpose of the project is met on time and of the
expected quality.
Key Roles for Data Analytics project
• Business Intelligence Analyst :
• Business Intelligence Analyst provides business domain perfection based on a detailed and deep
understanding of the data, key performance indicators (KPIs), key matrix, and business intelligence from a
reporting point of view.
• This person generally creates fascia and reports and knows about the data feeds and sources.
• Data Engineer :
• Data engineer grasps deep technical skills to assist with tuning SQL queries for data management and
data extraction and provides support for data intake into the analytic sandbox.
• The data engineer works jointly with the data scientist to help build data in correct ways for analysis.
• Data Scientist :
• Data scientist facilitates with the subject matter expertise for analytical techniques, data modelling, and
applying correct analytical techniques for a given business issues.
• He ensures overall analytical objectives are met.
• Data scientists outline and apply analytical methods and proceed towards the data available for the
concerned project.
Phases of Data Analytics Lifecycle
• A scientific method that helps give the data analytics life cycle a
structured framework is divided into six phases of data analytics
architecture.
• Phase 1: Data Discovery and Formation
• Phase 2: Data Preparation and Processing
• Phase 3: Design a Model
• Phase 4: Model Building
• Phase 5: Result Communication and Publication
• Phase 6: Measuring of Effectiveness
•
Phase 1: Data Discovery and Formation
• Everything begins with a defined goal. In this phase, you’ll define your data’s purpose and how to achieve it
by the time you reach the end of the data analytics lifecycle.
• Everything begins with a defined goal. In this phase, you’ll define your data’s purpose and how to achieve it
by the time you reach the end of the data analytics lifecycle. The goal of this first phase is to make evaluations
and assessments to come up with a basic hypothesis for resolving any problem and challenges in the
business.
• The initial stage consists of mapping out the potential use and requirement of data, such as where the
information is coming from, what story you want your data to convey, and how your organization benefits
from the incoming data. As a data analyst, you will have to study the business industry domain, research case
studies that involve similar data analytics and, most importantly, scrutinize the current business trends.
• Then you also have to assess all the in-house infrastructure and resources, time and technology requirements
to match with the previously gathered data. After the evaluations are done, the team then concludes this stage
with hypotheses that will be tested with data later. This is the preliminary stage in the big data analytics
lifecycle and a very important one.
• Basically, as a data analysis expert, you’ll need to focus on enterprise requirements related to data, rather than
data itself. Additionally, your work also includes assessing the tools and systems that are necessary to read,
organize, and process all the incoming data.
• Essential activities in this phase include structuring the business problem in the form of an analytics
challenge and formulating the initial hypotheses (IHs) to test and start learning the data. The
subsequent phases are then based on achieving the goal that is drawn in this stage. So you will
need to develop an understanding and concept that will later come in handy while testing it with
data.
School of Computing Science and Engineering
Course Code : R1UC402T.. Course Nam: Data Analytics….
…….UNIT 1……….
…….. Data Analytics Lifecycle (6 phases) …….
Name of the Faculty: Dr. Avinash Dwivedi Program Name: B.Tech (CSE)
Life Cycle of Data Analysis Project
Phases of Data Analytics Lifecycle
• A scientific method that helps give the data analytics life cycle a
structured framework is divided into six phases of data analytics
architecture.
• Phase 1: Data Discovery and Formation
• Phase 2: Data Preparation and Processing
• Phase 3: Design a Model
• Phase 4: Model Building
• Phase 5: Result Communication and Publication
• Phase 6: Measuring of Effectiveness
•
Phase 1: Data Discovery and Formation
• Everything begins with a defined goal. In this phase, you’ll define your data’s purpose and how to achieve it
by the time you reach the end of the data analytics lifecycle.
• Everything begins with a defined goal. In this phase, you’ll define your data’s purpose and how to achieve it
by the time you reach the end of the data analytics lifecycle. The goal of this first phase is to make evaluations
and assessments to come up with a basic hypothesis for resolving any problem and challenges in the
business.
• The initial stage consists of mapping out the potential use and requirement of data, such as where the
information is coming from, what story you want your data to convey, and how your organization benefits
from the incoming data. As a data analyst, you will have to study the business industry domain, research case
studies that involve similar data analytics and, most importantly, scrutinize the current business trends.
• Then you also have to assess all the in-house infrastructure and resources, time and technology requirements
to match with the previously gathered data. After the evaluations are done, the team then concludes this stage
with hypotheses that will be tested with data later. This is the preliminary stage in the big data analytics
lifecycle and a very important one.
• Basically, as a data analysis expert, you’ll need to focus on enterprise requirements related to data, rather than
data itself. Additionally, your work also includes assessing the tools and systems that are necessary to read,
organize, and process all the incoming data.
• Essential activities in this phase include structuring the business problem in the form of an analytics
challenge and formulating the initial hypotheses (IHs) to test and start learning the data. The
subsequent phases are then based on achieving the goal that is drawn in this stage. So you will
need to develop an understanding and concept that will later come in handy while testing it with
data.
Phase 2: Data Preparation and Processing
• This stage consists of everything that has anything to do with data. In phase 2, the
attention of experts moves from business requirements to information requirements.
• The data preparation and processing step involve collecting, processing, and cleansing
the accumulated data. One of the essential parts of this phase is to make sure that the
data you need is actually available to you for processing. The earliest step of the data
preparation phase is to collect valuable information and proceed with the data analytics
lifecycle in a business ecosystem. Data is collected using the below methods:
• Data Acquisition: Accumulating information from external sources.
• Data Entry: Formulating recent data points using digital systems or manual data entry
techniques within the enterprise.
• Signal Reception: Capturing information from digital devices, such as control systems
and the Internet of Things.
• The Data preparation stage in the big data analytics life cycle requires something known
as an analytical sandbox. This is a scalable platform that data analysts and data scientists
use to process data. The analytical sandbox is filled with data that was executed, loaded
and transformed into the sandbox. This stage in the business analytical cycle does not
have to happen in a predetermined sequence and can be repeated later if the need
arises.
Phase 3: Design a Model
• After mapping out your business goals and collecting a glut of data
(structured, unstructured, or semi-structured), it is time to build a
model that utilizes the data to achieve the goal. This phase of the
data analytics process is known as model planning.
• There are several techniques available to load data into the
system and start studying it:
• ETL (Extract, Transform, and Load) transforms the data first using a
set of business rules, before loading it into a sandbox.
• ELT (Extract, Load, and Transform) first loads raw data into the
sandbox and then transform it.
• ETLT (Extract, Transform, Load, Transform) is a mixture; it has two
transformation levels.
Phase 4: Model Building
• It helps them determine whether the tools they have currently are
going to sufficiently execute the model or if they need a more robust
system for it to work properly.
Phase 5: Result Communication and Publication
• Remember the goal you had set for your business in phase 1?
Now is the time to check if those criteria are met by the tests
you have run in the previous phase.
• The communication step starts with a collaboration with major
stakeholders to determine if the project results are a success or
failure. The project team is required to identify the key findings
of the analysis, measure the business value associated with the
result, and produce a narrative to summarise and convey the
results to the stakeholders.
Phase 6: Measuring of Effectiveness
UNIT-1
Data Analytics Lifecycle : Discovery, Data preparations, Model Planning, Model
Building, Operationalization
Name of the Faculty: Ms. Kimmi Gupta Program Name: B.Tech(IV Sem)
Data Analytics Lifecycle
• In this phase, the purpose and goals of the data analytics project are established.
• This may involve defining the problem to be solved, determining the key metrics
to be measured, and identifying the relevant data sources.
• This first phase involves getting the context around your problem: you need to
know what problem you are solving and what business outcomes you wish to see.
• You should begin by defining your business objective and the scope of the work.
• Work out what data sources will be available and useful to you
Phase 2(Data Preparation)
• This phase involves collecting and cleaning the data to make it suitable for
analysis.
• This may include data integration, data transformation, and data profiling. It's
crucial to ensure that the data is accurate, complete, and consistent before it is
used for analysis.
• It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
• Data preparation tasks are likely to be performed multiple times and not in
predefined order.
• Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open
Refine, etc.
Phase 3(Model Planning)
• In this phase, the data models are constructed and the data is analyzed to generate
insights.
• This may involve creating predictive models, performing clustering, or conducting
hypothesis testing.
• Team develops datasets for testing, training, and production purposes.
• Free or open-source tools – Rand PL/R, Octave, WEKA.
• Commercial tools – Matlab , STASTICA.
Phase 5(Communicating Results)
• In this phase, the insights and findings from the data analysis are communicated to
stakeholders.
• This may involve creating visualizations, writing reports, or presenting the results
to key decision-makers.
• After executing model team need to compare outcomes of modeling to criteria
established for success and failure.
• The goal is to communicate the insights in a way that is easily understood by the
intended audience.
Phase 6(Operationalization)
• This phase involves putting the insights and findings into action.
• This may involve updating processes and systems, automating data-driven
processes, or monitoring the results to ensure that they align with expectations.
• This approach enables team to learn about performance and related constraints of
the model in production environment on small scale and make adjustments before
full deployment.
• The team delivers final reports, briefings, codes.
• Free or open source tools – Octave, WEKA, SQL, MADlib.
Different Phases of Data Analytics Life Cycle
Conclusion
• The Data Analytics lifecycle is a circular process consisting of six primary stages
that define how the information is created, collected, processed, used, and
analyzed. Mapping out business objectives and aiming towards achieving them
through the rest of the stages.
• The Data analytics lifecycle was designed to address Big Data problems and data
science projects.
• To address the specific demands for conducting analysis on Big Data, the step-by-
step methodology is required to plan the various tasks associated with the
acquisition, processing, analysis, and recycling of data.