Analytical Approaches and Tools To Analyze Data
Analytical Approaches and Tools To Analyze Data
to analyze data
U-3 P-2
Analytical approaches
• The ways adopted to sort, analyze, or solve
problems are known as analytical approaches.
• As size of the data to be analyzed grew, newer
analytical approaches were adopted.
• Commonly used approaches are,
1. Ensemble methods
2. Text data analysis
Ensemble methods
• It is a process of generating multiple models
and combining them to solve a specific
problem.
• The main aim is to minimize the probability of
selecting a poor item and also to improve the
performance.
• Ex: Bagging, Boosting, Random forests
Bagging
• It stands for bootstrap aggregation.
• The idea behind bagging is combining the
results of multiple models (for instance, all
decision trees) to get a generalized result.
• we create subsets of observations from the
original dataset, with replacement.
• A base model (weak model) is created on each
of these subsets.
• The models run in parallel and are independent
of each other.
Boosting
• Boosting is a sequential process, where each
subsequent model attempts to correct the
errors of the previous model.
• A subset is created from the original dataset.
• Initially, all data points are given equal
weights.
• A base model is created on this subset.
• This model is used to make predictions on the
whole dataset.
• Errors are calculated using the actual values and
predicted values.
• The observations which are incorrectly predicted, are
given higher weights.
• Another model is created and predictions are made
on the dataset.
• Similarly, multiple models are created, each
correcting the errors of the previous model.
• The final model (strong learner) is the weighted
mean of all the models (weak learners).
• Thus, the boosting algorithm combines a number of
weak learners to form a strong learner.
Random Forests
• In this, random samples are generated,
multiple trees are constructed and a random
subset of inputs called predictors are
evaluated for each tree.
• The final prediction is calculated by averaging
the predictions from all decision trees.
Text data analysis
• The processing and modeling of textual data to
gain useful business insights is called text data
analysis.
• An essential part of text data analysis is text
mining which mines high quality information.
• This information is derived by finding relationships
and patterns from massive collection of text.
• It takes text as input, where the text can be e-mail,
media data, etc.
Precautions against Fraudulent practices
Advanced analytics can be used to safeguard companies from
falling into trap of imposters and frauds. Financial organizations
usually take following precautions.
• Record the information such as contact number, password,
username, email of customer.
• Determine the IP addresses of customers from the mails sent or
received by them.
• Identify whether the email id is fake or real.
• Match IP address, contact number, email, address of customer
and determine whether all the information are from same place
or different.
• Search the given contact number and check whether it is
reported for any kind of abuse or scam on the internet.
History of analytical tools
• In late 1980s, Job Control Language(JCL) which
is a scripting language, is used for analytics.
• By the late 1990s, all commercial analytical
tools offered GUIs.
• Later data visualization tools have been
introduced.
GUI
• Generates code that is already defined to
perform a particular task.
• Helps analytic professionals to focus on
analysis methods rather than on writing code.
• Code generated is free of errors and bugs.
Analytic point solutions
• Refers to software packages that solve a specific
group of problems.
• Ex: price optimization applications, fraud
detection and demand forecasting applications.
• It is based on tool suites such as SAS.
• Implementing a one point solution as a
substitute for creating a custom solution for
various problems can help organizations save
money, effort and time.
Data visualization tools
• The results will be displayed in a user-friendly
manner for easy understanding
• Presents data in the form of charts, graphs and
tables.
• Advanced visualization tools analyze and present
data in new ways.
• Tools such as Tableau, Quickview, JMP, Advizor,
Spotfire has enhanced graphics.
• They also allow users to link multiple tabs of graphs
and charts to each other and to the underlying data.
Popular analytical tools
• Some open source analytical tools are as follows:
1. GridGain
2. HPCC
3. Storm
• Popular tools are as follows:
1. R project for statistical computing
2. IBM SPSS
3. SAS
R project for statistical computing
• It is free, open source package and widely used in
academic, research and development
environments.
• Features of R:
1. R is object-oriented.
2. It is possible to embed it in different applications.
3. Can be implemented from commercial analytical
tools.
4. It is extensible language.
• Limitations of R:
• Lack of scalability.
• Memory is not enough.
• Programming in R is a fairly difficult process.
IBM SPSS
• SPSS(Statistical Package for the Social
Sciences) was introduced in 1968 and in 2009,
its name changed to IBM SPSS.
• The functionality can be accessed with the
help of proprietary 4GL known as syntax
language and using a graphical interface
offering menus.
• Very user-friendly and quite simple to use.
Features of IBM SPSS
• SPSS commands are executed one line at a time
to update tables and add results to the output
editor window.
• Can also store executed syntaxes with their times
of execution in the window.
• Can read data from and write to ASCII files,
databases, and tables of other statistical software.
• Provides basic data management functions, such
as sorting, aggregation, and table merge.
• Can send output directly to a file.
• File can be .txt, .html, or .xml format.
• Output Management System(OMS) helps in
storing the outputs in a single file by creating
loop using a macro.
• IBM SPSS statistics can be installed on
different platforms such as Windows, Mac OS
X and Unix.
SAS
• Statistical Analysis System is an information
delivery system, which is an integrated and
hardware-independent computing package.
• Based on 4GL programming language.
• Provides well-organized and timely information
delivery.
• SAS products, commonly known as modules,
are mostly used by social and behavioral
scientists.
Features of SAS
• Statistics
• Data and text mining
• Data visualization
• Forecasting
• Optimization
• Model management and deployment
• Quality improvement
Comparing various analytical tools
R installation
• https://www.javatpoint.com/r-installation
• https://www.javatpoint.com/rstudio-ide