Data Analytics and Visualization Unit-I
Data Analytics and Visualization Unit-I
SYLLABUS
Typically, these platforms blend various Data Management tools to handle data on a large scale, usually
leveraging cloud storage.
Big Data Platforms are designed to handle and analyse vast amounts of data efficiently. Let’s explore some
key features you can expect from these platforms:
1) Scalability: They can scale horizontally to manage increasing volumes of data without compromising
performance.
2) Distributed Processing: They use distributed computing to process large datasets across multiple nodes,
ensuring faster data processing.
3) Real-time Stream Computing: Capable of processing data in real-time, that is crucial for applications
requiring immediate insights.
4) Machine Learning and Advanced Analytics: They offer built-in tools for Machine Learning and
Advanced Analytics to derive actionable insights from data.
5) Data Analytics and Visualisation: Provide tools for Data Analysis and visualisation to help users make
sense of complex data.
Big Data Platforms are complex systems designed to handle vast volumes of data, process it efficiently, and
turn it into valuable insights. These platforms consist of several essential components, each playing a critical
role in overall functionality.
This is the first step in the Big Data journey. Data can come from various sources, including sensors,
applications, social media, and databases. The data ingestion component is responsible for gathering this
diverse data and making it ready for processing. It involves data connectors, adapters, and protocols to
ensure data from different sources can be efficiently brought into the platform.
2) Data Storage
Once data is ingested, it needs a place to reside. Big Data Platforms employ a variety of storage solutions
designed to handle large datasets. Common storage systems include distributed file systems (e.g., Hadoop
HDFS, Amazon S3) and NoSQL databases (e.g., Apache Cassandra, MongoDB). These storage systems are
optimised for scalability, fault tolerance, and high availability, ensuring data remains accessible and reliable
even as it grows.
This is the heart of Big Data Platforms, where data is transformed, processed, and analysed to extract
meaningful insights. Processing engines and frameworks like Apache Spark, Apache Flink, and Hadoop
MapReduce play a vital role in this component. They distribute and parallelise computations across clusters
of machines, enabling the platform to handle massive workloads efficiently.
Elevate your skills with our expert-led Big Data Architecture Training – join us and learn to design, build
and manage scalable Data Architectures!
Managing and orchestrating data processing tasks across a distributed infrastructure is a complex task. The
management layer includes components for resource allocation, job scheduling, and workflow orchestration.
This layer ensures that data processing tasks run smoothly and efficiently, optimising resource utilisation.
5) Data Visualisation and Reporting
The insights derived from Big Data analysis are only valuable if they can be understood and acted upon. This
includes tools and technologies for data visualisation and reporting. This component allows users to create
interactive dashboards, generate reports, and visualise trends and patterns in the data.
Data security and governance are paramount in Big Data Platforms, especially when dealing with sensitive
information. This layer includes components for authentication, authorisation, encryption, and auditing. It
ensures that data is protected from unauthorised access and maintains compliance with regulatory
requirements.
Big Data Platforms follow a structured process to enable companies to harness data for informed decision-
making. This process involves several key steps:
a) Data Collection: This initial step systematically gathers data from various sources such as databases,
social media, and sensors. Methods like web scraping, data feeds, APIs, and data integration tools are used to
collect data, which is then stored in a central repository, often a data lake or warehouse, for easy access and
further analysis.
b) Data Storage: After collection, data must be stored efficiently for retrieval and processing. Big Data
Platforms typically use distributed storage systems like Hadoop Distributed File System (HDFS), Google
Cloud Storage, or Amazon S3. This architecture ensures high availability, fault tolerance, and scalability.
c) Data Processing: Collected data is processed to extract valuable insights through operations such as
cleaning, transforming, and aggregating. Platforms like Apache Hadoop and Apache Spark enable rapid
computations and complex data transformations.
d) Data Analysis: This step involves examining and interpreting large data volumes to extract meaningful
insights and patterns using machine learning algorithms, data mining techniques, or visualisation tools. The
results inform data-driven decisions, optimise processes, and identify opportunities.
e) Data Quality Assurance: Ensuring data accuracy, consistency, integrity, relevance, and security is crucial.
Techniques like data quality management, lineage tracking, and cataloguing help maintain robust data
quality, giving organisations confidence in their decision-making data.
f) Data Management: This involves organising, storing, and retrieving large data volumes. Techniques such
as data backup, recovery, and archiving ensure fault tolerance and optimised data retrieval for various use
cases.
***************************************************************************************
Benefits of Using Big Data Platforms :There are various benefits of using Big Data Platforms, which are
discussed below:
a
a) Big Data Integration Platforms help organisations make smarter decisions by providing insights from
vast datasets, ensuring that choices depend on facts rather than guesswork.
b) These platforms streamline data storage and processing, reducing infrastructure costs and making Data
Management more affordable.
c) Big Data Platforms enable real-time Data Analysis, allowing companies to respond quickly to changing
situations and seize opportunities as they arise.
d) They help integrate data from various sources, creating a unified view of information and facilitating
comprehensive analysis.
e) With better insights, organisations can tailor their services, products and strategies to meet customer needs
more effectively, increasing satisfaction and loyalty.
f) Big Data Platforms can expand effortlessly to accommodate growing data volumes, ensuring they remain
effective as organisations evolve.
g) Those who harness Big Data gain an edge by staying abreast of the competition and providing superior
products and services.
h) These platforms spark innovation by revealing trends, gaps, and opportunities, driving the development of
new products and services.
i) Big Data Platforms offer robust security features to protect sensitive data, mitigating risks in an
increasingly complex Cyber Security landscape.
j) They improve operations across sectors, from manufacturing to healthcare, increasing efficiency and
reducing waste.
***************************************************************************************
Popular Big Data Platforms
Big Data Platforms are capable of handling massive amounts of data and turning it into some valuable
information. Here, we'll introduce you to a list of those platforms:
**************************************************************************************
a) Apache Hadoop: Apache Hadoop is an excellent platform for keeping and processing large volumes of
data. It's like a robust storage and data processing system that companies use to handle and manage massive
datasets.
b) Apache Spark: Apache Spark is known for its speed and efficiency in analysing data. It's like a powerful
tool that helps organisations quickly make sense of their data and extract valuable insights from it.
c) Apache Flink: Apache Flink is another data processing platform, similar to Spark, that specialises in real-
time Data Analysis. It's used for tasks where speed and low latency are critical, like monitoring online
activities or financial transactions.
d) Amazon Web Services (AWS) Big Data services: AWS offers a suite of Big Data services that run in the
cloud. These services make it easier for companies to store, process, and analyse data without the need for
extensive infrastructure management.
e) Google Cloud Platform (GCP) Big Data services: Similar to AWS, Google Cloud Platform provides a
range of Big Data services in the cloud. These services help organisations leverage Google's computing
power and data analytics capabilities.
f) Microsoft Azure Big Data services: Microsoft Azure offers various Big Data services, including data
storage, processing, and analytics tools. These services are designed to help businesses work with their data
efficiently and effectively.
https://www.studocu.com/in/document/velammal-engineering-college/big-data/evolution-of-analytic-
scalability/22574245
TOOLS
Discover the right tool for your needs and stay ahead in the competitive world of data
analytics
1. Tableau
Tableau is an easy-to-use Data Analytics tool. Tableau has a drag-and-drop interface which
helps to create interactive visuals and dashboards. Organizations can use this to instantly
develop visuals that give context and meaning to the raw data, making the data very easy
to understand. Also, due to the simple and easy-to-use interface, one can easily use this
tool regardless of their technical ability. Furthermore, Tableau comes with a wide range of
features and tools that help you create the best visuals which are easy to understand.
The advantage of Tableau that overshadows all others is in its Quality Visuals embedded
with Interactive Information. But this doesn’t mean Tableau is perfect. Tableau is only
meant for Data Visualisation, so we can’t preprocess data using this tool. Also, it does have
a bit of a learning curve and is known for its high cost.
Features:
• Easy Drag and Drop Interface
• Mobile support for both iOS and Android
• The Data Discovery feature allows you to find hidden data
• You can use various Data sources like SQL Server, Oracle, etc
2. Power BI
Power BI is Microsoft’s Data Analysis Tools. It provides enhanced Interactive Visualisation
and capabilities of Business Intelligence. Power BI achieves all this while providing a
Simple and intuitive User Interface. Being a product of Microsoft, you can expect seamless
integration with various Microsoft products. It allows you to connect with Excel
spreadsheets, cloud-based data sources and on-premises data sources.
Power BI is known and loved for its groundbreaking features like Natural Language
queries, Power Query Editor Support, and intuitive User Interface. But Power BI does have
its downsides. It can not handle records that are bigger than 250 MB in size. Besides, it has
limited sharing capabilities, and you would need to pay extra to scale as per your needs,
Features:
• Great connectivity with Microsoft products
• Powerful Semantic Models
• Can meet both Personal and Enterprise needs
• Ability to create beautiful paginated reports
3. Apache Spark
Apache Spark is known for its speed in Data Processing is a Data Analysis Tools. Spark has
in-memory processing, which makes it incredibly fast. It is also open source which results
in trust and interoperability. The ability to handle enormous amounts of Data makes Spark
distinguished. It is quite easy and straightforward to learn, thanks to its API. This doesn’t
end here. It also has support for Distributed Computing Frameworks.
But Apache Spark does have some drawbacks. It doesn’t have an integrated File
Management System and has fewer algorithms than its competitors. Also, it faces issues if
the files are tiny.
Features:
• Incredible Speed and Efficiency
• Great connectivity with support of Python, Scala, R, and SQL shells
• Ability to handle and manipulate data in real-time
• Can run on many platforms like Hadoop, Kubernetes, Cloud, and also standalone
4. TensorFlow
TensorFlow is a Machine Learning Library and among data analysis tools. This open-source
library was developed by Google and is a popular choice for many businesses looking
forward to supporting Machine Learning capabilities to their Data Analytics workflow
as Tensorflow can build and train Machine Learning Models. Tensorflow is the first choice of
many due to its wide recognition, which results in an adequate amount of tutorials, and
support for many Programming Languages. TensorFlow can also run on GPUs and TPUs,
making the task much faster.
But TensorFlow can be very hard to use for beginners, and you need Coding knowledge to
use it stand alone, and it has a steep learning curve. Tensorflow can also be quite tricky to
install and configure, depending on your system.
Features:
• Supports a lot of programming languages like Python, C++, JavaScript, and Java
• Can scale as needed with support for multiple CPUs, GPUs, or TPUs
• Offers a large community to solve problems and issues
• Features a built-in visualization tool for you to see how the model is performing
5. Hadoop
Hadoop by Apache is a Distributed Processing and Storage Solution and also used as a
data analysis tools. It is an open-source framework that stores and processes Big Data with
the help of the MapReduce Model. Hadoop is known for its scalability. It is also fault-
tolerant and can continue even after one or more nodes fail. Being Open Source, it can be
used freely and customized to suit specific needs, and Hadoop also supports various Data
Formats.
But Hadoop does have some drawbacks. Hadoop requires powerful hardware for it to run
effectively. In addition, it features a steep learning curve making it hard for some users.
This is partly because some users find the MapReduce Model hard to grasp.
Features:
• Free to use as it is Open Source
• Can run on commodity hardware
• Built with fault-tolerance as it can operate even when some node fails
• Highly scalable with the ability to distribute data into multiple nodes
6. R
R is an Open Source Programming language widely used for Statistical Computing and
Data Analysis and can be consider as a data analysis tools. It is known for handling large
Datasets and its flexibility. The package library of R has various packages. Using these
packages, R allows the user to manipulate and visualize data. Besides, R also has
packages for things like Data cleaning, Machine Learning, and Natural Language
Processing. These features make R very capable.
Despite these features, R isn’t perfect. For example, R is significantly slower than
languages like C++ and Java. Besides, R is known to have a steep learning curve,
especially if you are unfamiliar with Programming.
Features:
• Ability to handle large Datasets
• Flexibility to be used in many areas like Data Visualisation, Data Processing
• Features built-in graphics capabilities for amazing visuals
• Offers an active community to answer questions and help in problem-solving
7. Python
Python is another Programming Language popular for Data Analysis and Machine
Learning.Python is used extremely in Data analysis tools. Python is widely recognized to
have easy syntax which makes it easy to learn. Along with the easy syntax, the package
manager of Python features a lot of important packages and libraries. This makes it
suitable for Data Analysis and Machine Learning. Another reason to use Python is its
scalability.
This doesn’t mean Python is flawless. It is quite slow when we compare it to languages
like Java or C++; this is because Python is an interpreted language while the others are
compiled. Besides, Python is also infamous for its high memory consumption.
Features:
• Easy to learn and user-friendly
• Scalable with the ability to handle large datasets
• Extensive packages and libraries that increase the functionality
• Open Source and widely adopted which ensures problems can be fixed easily.
8. SAS
SAS stands for Statistical Analysis System. The SAS Software was developed by the SAS
Institute, and it is widely used for Business Analytics nowadays. SAS has both a Graphical
User Interface and a Terminal Interface. So, depending on the user’s skillsets, they can
choose either one. It also has the ability to handle large datasets. In addition, SAS is
equipped with a lot of Analytical Tools which makes it valid for a lot of applications.
Although SAS is very powerful, it has a big price tag and a steep learning curve, so it is
quite hard for beginners.
Features:
• Ability to handle large datasets
• Support for graphical and non-graphical interface
• Features tools to create high-quality visualizations
• Wide range of tools for predictive and statistical analysis
9. QlikSense
QilkSense is a Business and data analysis Tools that provides support for Data Visualisation
and Data Analysis. QuilkSense supports various Data sources
from Spreadsheets, Databases, and also Cloud Services. You can create amazing
Dashboards and Visualisations. It comes with Machine Learning features and uses AI to
help the user understand the Data. Furthermore, QlikSense also has features like Instant
Search and Natural Language Processing.
But QilkSense does have some drawbacks. The data extraction of QilkSense is quite
inflexible. The Pricing Model is quite complicated, and it is quite sluggish when it comes to
large datasets.
Features:
• Tools for stunning and interactive Data Visualisation
• Conversational AI-powered analytics with Qlik Insight Bot
• Features tools to create high-quality visualizations
• Provides Qlik Big Data Index which is a Data Indexing Engine
10. KNIME
KNIME is an Analytics Platform and a data analysis tools. It is Open Source and features an
User Interface which is intuitive. KNIME is built with scalability and also offers extensibility
via a well-defined API Plugin. You can also automate Spreadsheets, do Machine Learning,
and much more using KNIME. The best part is you don’t even need to code to do all this.
But KNIME does have its issues. The abundance of features can be overwhelming to some
users. Also, the Data Visualisation of KNIME is not the best and can be improved.
Features:
• Intuitive User Interface with drag and drop function
• Support for extensive analytics tools like Machine Learning, Data Mining, Big Data
Processing
• Provides tools to create high-quality visualizations
**************************************************************************************
Steps for Data Analysis Process
1.Define
the
Problem or
Research
Question
2.Collect Data
3.Data Cleaning
4.Analyzing the Data
5.Data Visualization
6.Presenting Data
***************************************************************************************
Analytics is the technique of examining data and reports to obtain actionable insights
that can be used to comprehend and improve business performance. Business users
may gain insights from data, recognize trends, and make better decisions
with workforce analytics.
On the one hand, analytics is about finding value or making new data to help you
decide. This can be performed either manually or mechanically. Next-generation
analytics uses new technologies like AI or machine learning to make predictions about
the future based on past and present data.
Differences between analytics and reporting can significantly benefit your business. If
you want to use both to their full potential and not miss out on essential parts of either
one knowing the difference between the two is important. Some key differences are:
Analytics Reporting
Analytics is the method of examining Reporting is an action that includes all the
and analyzing summarized data to needed information and data and is put
make business decisions. together in an organized way.
Analytics examines report data to determine why and how to fix data
organizational problems. Analysts begin by asking questions that may arise as they
examine how the data in the reports has been structured. A qualified analyst can make
recommendations to improve business performance once the data analysis is
complete.
Data Analytics and Data Analysis are related yet distinct processes within data
interpretation. Data Analysis focuses on dissecting information from raw data,
identifying trends, patterns, and relationships. It involves cleaning, organizing, and
summarizing data to extract insights. On the other hand, Data Analytics goes beyond
surface-level exploration. It uses advanced techniques to model, predict, and prescribe
outcomes based on historical data, enabling businesses to make informed decisions
for the future.
Analytics and reporting go hand in hand, and you can’t have one without the other.
The raw data are the first step in the whole process. The data then needs to be put
together to make it look like accurate information. Reports can be comprehensive and
employ a range of technologies. Still, their main objective is always to make it simpler
for analysts to understand what is actually happening within the organization.
********************************************************************************
Data analytics finds applications across various industries and sectors, transforming
the way organizations operate and make decisions. Here are some examples of how
data analytics is applied in different domains:
Healthcare
Finance
In the financial sector, data analytics plays a crucial role in fraud detection, risk
assessment, and investment strategies. Banks and financial institutions analyze large
volumes of data to identify suspicious transactions, predict creditworthiness, and
optimize investment portfolios. Data analytics also enables personalized financial
advice and the development of creative financial products and services.
E-commerce
Cybersecurity
Data analytics plays a vital role in cybersecurity by detecting and preventing cyber
threats and attacks. Security systems analyze network traffic, user behavior, and
system logs to identify anomalies and potential security breaches. By leveraging data
analytics, organizations can proactively strengthen their security measures, detect and
respond to threats in real-time, and safeguard sensitive information.
Banking
Banks use data analytics to gain insights into customer behavior, manage risks, and
personalize financial services. Banks can tailor their offerings, identify potential fraud,
and assess creditworthiness by analyzing transaction data, customer demographics,
and credit histories. Data analytics also helps banks detect money laundering
activities and improve regulatory compliance.
Logistics
In the logistics industry, data analytics plays a crucial role in optimizing transportation
routes, managing fleet operations, and improving overall supply chain efficiency.
Logistics companies can minimize costs, reduce delivery times, and enhance customer
satisfaction by analyzing data on routes, delivery times, and vehicle performance.
Data analytics also enables better demand forecasting and inventory management.
Retail
Data analytics transforms the retail industry by providing insights into customer
preferences, optimizing pricing strategies, and improving inventory management.
Retailers analyze sales data, customer feedback, and market trends to identify popular
products, personalize offers, and forecast demand. Data analytics also helps retailers
enhance their marketing efforts, improve customer loyalty, and optimize store layouts.
Manufacturing
Internet Searching
Data analytics powers internet search engines, enabling users to find relevant
information quickly and accurately. Search engines analyze vast amounts of data,
including web pages, user queries, and click-through rates, to deliver the most
relevant search results. Data analytics algorithms continuously learn and adapt to user
behavior, providing increasingly accurate and personalized search results.
Risk Management
Data analytics plays a crucial role in risk management across various industries,
including insurance, finance, and project management. Organizations can assess risks,
develop mitigation strategies, and make informed decisions by analyzing historical
data, market trends, and external factors. Data analytics helps organizations identify
potential risks, quantify their impact, and implement risk mitigation measures.
**************************************************************************************
The Data analytic lifecycle is designed for Big Data problems and data science projects.
The cycle is iterative to represent real project. To address the distinct requirements for
performing analysis on Big Data, step–by–step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and repurposing data.
•Phase 1: Discovery –
•The data science team learns and investigates the problem.
•Develop context and understanding.
•Come to know about data sources needed and available for the project.
•The team formulates the initial hypothesis that can be later tested with data.
•Phase 2: Data Preparation –
•Steps to explore, preprocess, and condition data before modeling and analysis.
•It requires the presence of an analytic sandbox, the team executes, loads, and transforms,
to get data into the sandbox.
•Data preparation tasks are likely to be performed multiple times and not in predefined
order.
•Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
•Phase 3: Model Planning –
•The team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
•In this phase, the data science team develops data sets for training, testing, and
production purposes.
•Team builds and executes models based on the work done in the model planning phase.
•Several tools commonly used for this phase are – Matlab and STASTICA.
•Phase 4: Model Building –
•Team develops datasets for testing, training, and production purposes.
•Team also considers whether its existing tools will suffice for running the models or if they
need more robust environment for executing models.
•Free or open-source tools – Rand PL/R, Octave, WEKA.
•Commercial tools – Matlab and STASTICA.
•Phase 5: Communication Results –
•After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
•Team considers how best to articulate findings and outcomes to various team members
and stakeholders, taking into account warning, assumptions.
•Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
•Phase 6: Operationalize –
•The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
•This approach enables team to learn about performance and related constraints of the
model in production environment on small scale which make adjustments before full
deployment.
•The team delivers final reports, briefings, codes.
•Free or open source tools – Octave, WEKA, SQL, MADlib.
**************************************************************************************