0% found this document useful (0 votes)
55 views

BDA Mod1

This document provides an overview of Module 1: Big Data and Analytics. It discusses key topics including the basic nomenclature used in analytics, example applications of big data in various sectors like banking, government, and healthcare, and the analytical process model. Big data is characterized by its volume, velocity, variety, veracity, and value. The analytical process involves selecting data, collecting data, cleaning data, transforming data, and then applying analytical models to gain insights. Example applications of big data highlighted include banking, government, healthcare, manufacturing, retail, finance, telecom, and social media.

Uploaded by

VTU ML Workshop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

BDA Mod1

This document provides an overview of Module 1: Big Data and Analytics. It discusses key topics including the basic nomenclature used in analytics, example applications of big data in various sectors like banking, government, and healthcare, and the analytical process model. Big data is characterized by its volume, velocity, variety, veracity, and value. The analytical process involves selecting data, collecting data, cleaning data, transforming data, and then applying analytical models to gain insights. Example applications of big data highlighted include banking, government, healthcare, manufacturing, retail, finance, telecom, and social media.

Uploaded by

VTU ML Workshop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

: Module – 1: Big Data and Analytics

Module 1: Big Data and Analytics

Syllabus
Example Applications, Basic Nomenclature, Analysis Process Model, Analytical Model
Requirements, types of Data Sources, Sampling, Types of data elements, data explorations,
exploratory statistical analysis, missing values, outlier detection and Treatment, standardizing data
labels, categorization

Big data is the term for a collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools or traditional data processing
applications.

Any piece of information can be considered as data. This data can be in various forms and in
various sizes. It can vary from small data to very big Data. So, let us see the classification of
this data:

Any data that can reside in RAM or memory is considered as small data. Small data is
less than 10s of GBs.
Any data that can reside in Hard Disk is considered as medium data. Medium data is in
the range of 10s to 1000s of GBs.
Any data which cannot reside in Hard disk or in a single system is considered as Big
Data. Its size is more than 1000s of Terra Bytes, Peta Bytes, Zetta Bytes.

This data is so huge that it is very difficult to handle it with our traditional / conventional
approaches of relational databases, so Big data technologies evolved in the industry.

Big Data is characterized in terms of following Vs:

Volume – Today data size has increased to size of terabytes in the form of records or
transactions. Volume is how much data we have – what used to be measured in
Gigabytes is now measured in Zettabytes (ZB) or even Yottabytes (YB). The IoT
(Internet of Things) is creating exponential growth in data.
Velocity - is the speed in which data is accessible. It means near or real-time
assimilation of data coming in huge volume.
Variety – There is huge variety of data based on internal, external, behavioral, or/and
social type. Data can be of structured, semi structured or unstructured type. Variety
: Module – 1: Big Data and Analytics

describes one of the biggest challenges of big data. It can be unstructured and it can
include so many different types of data from XML to video to SMS. Organizing the
data in a meaningful way is no simple task, especially when the data itself changes
rapidly.
Veracity - is all about making sure the data is accurate, which requires processes to
keep the bad data from accumulating in your systems. The simplest example is contacts
that enter your marketing automation system with false names and inaccurate contact
information. How many times have you seen Mickey Mouse in your database? It’s the
classic “garbage in, garbage out” challenge.
Value - is the end game. After addressing volume, velocity, variety, variability,
veracity, and visualization – which takes a lot of time, effort and resources – you want
to be sure your organization is getting value from the data.

Figure 1.1 Characteristics of Big Data


1.1 Example Applications

a) Big data in Banking


Large amounts of data streaming in from countless sources, banks have to find out unique
and innovative ways to manage big data. It’s important to analyse customers’ needs and
provide them service as per their requirements, and minimize risk and fraud while
maintaining regulatory compliance. Big data have to deal with financial institutions to do
one step from the advanced analytics.
b) Big data in Government
: Module – 1: Big Data and Analytics

When government agencies are harnessing and applying analytics to their big data, they
have improvised a lot in terms of managing utilities, running agencies, dealing with traffic
congestion or preventing the affects crime. But apart from its advantages in Big Data,
governments also address issues of transparency and privacy.
c) Big data in Health Care
When it comes to health care in terms of Patient records. Treatment plans. Prescription
information etc., everything needs to be done quickly and accurately and some aspects
enough transparency to satisfy stringent industry regulations. Effective management results
in good health care to uncover hidden insights that improve patient care.

d) Big data in Manufacturing


Manufacturers can improve their quality and output while minimizing waste where
processes are known as the main key factors in today’s highly competitive market.
Several manufacturers are working on analytics where they can solve problems faster and
make more agile business decisions.
e) Big data in Retail
Customer relationship maintains is the biggest challenge in the retail industry and the best
way to manage will be to manage big data. Retailers must have unique marketing ideas to
sell their products to customers, the most effective way to handle transactions, and applying
improvised tactics of using innovative ideas using BigData to improve their business.
f) Big data in Finance sector

Financial services have widely adopted big data analytics to inform better investment
decisions with consistent returns. The big data pendulum for financial services has swung
from passing fad to large deployments last year.

g) Big data in Telecom

A recent report found that use of data analytics tools in telecom sector is expected to grow
at a compound annual growth rate of 28% over the next four years.

h) Big data in retail sector

Retailers harness Big Data to offer consumers personalized shopping experiences.


Analyzing how a customer came to make a purchase, or the path to purchase. 66% of
retailers have made financial gains in customer relationship management through big data.
: Module – 1: Big Data and Analytics

i) Big Data in tourism

Big data is transforming the global tourism industry. People know more about the world
than ever before. People have much more detailed itineraries these days with the help of
Big data.

j) Big data in Airlines

Big Data and Analytics give wings to the Aviation Industry. An airline now knows where
a plane is headed, where a passenger is sitting, and what a passenger is viewing on the IFE
or connectivity system.

k) Big data in Social Media


Big data is a driving factor behind every marketing decision made by social media
companies and it is driving personalization to the extreme.

To summarize, the relevance, importance, and impact of analytics are now bigger than ever
before and, given that more and more data are being collected and that there is strategic value
in knowing what is hidden in data, analytics will continue to grow. Without claiming to be
exhaustive, Table 1.1 presents some examples of how analytics is applied in various settings.

1.2 Basic Nomenclature

• In order to start doing analytics, some basic vocabulary needs to be defined. A first
important concept here concerns the basic unit of analysis.
: Module – 1: Big Data and Analytics

• Customers can be considered from various perspectives. Customer Lifetime Value


(CLV) can be measured for either individual customers or at the household level.

• Another alternative is to look at account behavior. For example, consider a credit


scoring exercise for which the aim is to predict whether the applicant will default on a
particular mortgage loan account.

• The analysis can also be done at the transaction level. For example, in insurance fraud
detection, one usually performs the analysis at insurance claim level.

• Also, in web analytics, the basic unit of analysis is usually a web visit or session.

It is also important to note that customers can play different roles. For example, parents can
buy goods for their kids, such that there is a clear distinction between the payer and the end
user. In a banking setting, a customer can be primary account owner, secondary account owner,
main debtor of the credit, codebtor, guarantor, and so on. It is very important to clearly
distinguish between those different roles when defining and/or aggregating data for the
analytics exercise.

1.3 The Analytical Process models

Figure 1.2 gives a high-level overview of the analytics process model. First Step is a very
important step, as data is the key component for to any analytical process. The selection of
data will have a deterministic impact on the analytical models.

Second Step is the data collection. All data will then be gathered in a staging area, which
could be a data mart or data warehouse. Some basic exploratory analysis can be considered
here using, Online Analytical Processing (OLAP) facilities for multidimensional data analysis
(e.g., rollup, drill down, slicing and dicing). Third Step is data cleaning step. Data cleaning
step is to get rid of all inconsistencies, such as missing values, outliers, and duplicate data.
Fourth Step is Additional transformations may also be considered, such as binning,
alphanumeric to numeric coding, geographical aggregation, and so forth. Fifth Step is
analytics step. In the analytics step, an analytical model will be estimated on the preprocessed
and transformed data. Different types of analytics can be considered here. (e.g., to do churn
prediction, fraud detection, customer segmentation, market basket analysis).

Finally, once the model has been built, it will be interpreted and evaluated by the business
experts. Usually, many trivial patterns will be detected by the model. For example, in a market
: Module – 1: Big Data and Analytics

basket analysis setting, one may find that spaghetti and spaghetti sauce are often purchased
together. These patterns are interesting because they provide some validation of the model. But
of course, the key issue here is to find the unexpected yet interesting and actionable patterns
(sometimes also referred to as knowledge diamonds) that can provide added value in the
business setting. Once the analytical model has been appropriately validated and approved, it
can be put into production as an analytics application (e.g., decision support system, scoring
engine).

1.4 Analytical Model Requirements

Analytics is a term that is often used interchangeably with data science, data mining, knowledge
discovery, and others. The distinction between all those is not clear cut. All of these terms
essentially refer to extracting useful business patterns or mathematical decision models from a
preprocessed data set.
A good analytical model should satisfy several requirements, depending on the application
area.
– Achieve business relevance
– Statistical significance and predictive power
– Interpretability
– Justifiability
– Operationally Efficient
– Economic Cost
– Regulation and Legislation
: Module – 1: Big Data and Analytics

A first critical success factor is business relevance. The analytical model should actually solve
the business problem for which it was developed. It makes no sense to have a working
analytical model that got sidetracked from the original problem statement. In order to achieve
business relevance, it is of key importance that the business problem to be solved is
appropriately defined, qualified, and agreed upon by all parties involved at the outset of the
analysis.

A second criterion is statistical performance. The model should have statistical significance
and predictive power. How this can be measured will depend upon the type of analytics
considered. For example, in a classification setting (churn, fraud), the model should have good
discrimination power. In a clustering setting, the clusters should be as homogenous as possible.

Interpretability refers to understanding the patterns that the analytical model captures. This
aspect has a certain degree of subjectivism, since interpretability may depend on the business
user’s knowledge. In many settings, however, it is considered to be a key requirement. For
example, in credit risk modeling or medical diagnosis, interpretable models are absolutely
needed to get good insight into the underlying data patterns. In other settings, such as response
modeling and fraud detection, having interpretable models may be less of an issue.

Justifiability refers to the degree to which a model corresponds to prior business knowledge
and intuition. For example, a model stating that a higher debt ratio results in more creditworthy
clients may be interpretable, but is not justifiable because it contradicts basic financial intuition.
Note that both interpretability and justifiability often need to be balanced against statistical
performance. Often one will observe that high performing analytical models are
incomprehensible and black box in nature.
A popular example of this is neural networks, which are universal approximators and are high
performing, but offer no insight into the underlying patterns in the data. On the contrary, linear
regression models are very transparent and comprehensible, but offer only limited modeling
power.

Analytical models should also be Operationally efficient. This refers to the efforts needed to
collect the data, preprocess it, evaluate the model, and feed its outputs to the business
application (e.g., campaign management, capital calculation). Especially in a real-time online
scoring environment (e.g., fraud detection) this may be a crucial characteristic. Operational
: Module – 1: Big Data and Analytics

efficiency also entails the efforts needed to monitor and back test the model, and reestimate it
when necessary.

Another key attention point is the economic cost needed to set up the analytical model. This
includes the costs to gather and preprocess the data, the costs to analyze the data, and the costs
to put the resulting analytical models into production. In addition, the software costs and human
and computing resources should be taken into account here. It is important to do a thorough
cost–benefit analysis at the start of the project.

Finally, analytical models should also comply with both local and international regulation and
legislation. For example, in a credit risk setting, the Basel II and Basel III Capital Accords have
been introduced to appropriately identifying the types of data that can or cannot be used
to build credit risk models.

In an insurance setting, the Solvency II Accord plays a similar role. Given the importance of
analytics now a days, more and more regulation is being introduced relating to the development
and use of the analytical models. In addition, in the context of privacy, many new regulatory
developments are taking place at various levels. A popular example here concerns the use of
cookies in a web analytics context.

1.5 Types of Data Sources

More data is better to start off the analysis. Data can originate from a variety of different
sources. They are as follows:

• Transactional data

• Unstructured data

• Qualitative/Export based data

• Data poolers

• Publicly available data

Transactional Data: Transactions are the first important source of data. Transactional data
consist of structured, low level, detailed information capturing the key characteristics of a
customer transaction (e.g., purchase, claim, cash transfer, credit card payment). This type of
data is usually stored in massive online transaction processing (OLTP) relational databases. It
: Module – 1: Big Data and Analytics

can also be summarized over longer time horizons by aggregating it into averages,
absolute/relative trends, maximum/minimum values, and so on.

Unstructured data: Embedded in text documents (e.g., emails, web pages, claim forms) or
multimedia content can also be interesting to analyse. However, these sources typically require
extensive pre-processing before they can be successfully included in an analytical exercise.

Qualitative/Expert based data: Another important source of data is qualitative, expert based
data. An expert is a person with a substantial amount of subject matter expertise within a
particular setting (e.g., credit portfolio manager, brand manager). The expertise stems from
both common sense and business experience, and it is important to elicit expertise as much as
possible before the analytics is run. This will steer the modelling in the right direction and allow
you to interpret the analytical results from the right perspective. A popular example of applying
expert based validation is checking the univariate signs of a regression model. For example,
one would expect a priori that higher debt has an adverse i

Data poolers: Nowadays, data poolers are becoming more and more important in the industry.
Popular examples are Dun & Bradstreet, Bureau Van Dijck, and Thomson Reuters. The core
business of these companies is to gather data in a particular setting (e.g., credit risk, marketing),
build models with it, and sell the output of these models (e.g., scores), possibly together with
the underlying raw data, to interested customers. A popular example of this in the United States
is the FICO score, which is a credit score ranging between 300 and 850 that is provided by the
three most important credit bureaus: Experian, Equifax, and TransUnion. Many financial
institutions use these FICO scores either as their final internal model, or as a benchmark against
an internally developed credit scorecard to better understand the weaknesses of the latter.

Publicly available data: Finally, plenty of publicly available data can be included in the
analytical exercise. A first important example is macroeconomic data about gross domestic
product (GDP), inflation, unemployment, and so on. By including this type of data in an
analytical model, it will become possible to see how the model varies with the state of the
economy. This is especially relevant in a credit risk setting, where typically all models need to
be thoroughly stress tested. In addition, social media data from Facebook, Twitter, and others
can be an important source of information. However, one needs to be careful here and make
sure that all data gathering respects both local and international privacy regulations.
: Module – 1: Big Data and Analytics

1.6 Sampling

Sampling is to take a subset of past customer data and use that to build an analytical model
• The sample should also be taken from an average business period to get a picture of the
target population that is as accurate as possible.
• With the availability of high performance computing facilities (e.g. grid computing
and cloud computing) one could also directly analyze the full data set.
• Key requirement for a good sample is, it should be representative of the future
customers on which the analytical model will be run.
• Timing aspect becomes important because customers of today are more similar to
customers of tomorrow than customers of yesterday.
• Choosing the optimal time window for the sample involves the trade-off between the
lost data and the recent data.

Example: credit scoring-assume one wants to build an application scorecard to score mortgage
(a legal agreement that allows you to borrow money from a bank or similar organization,
especially in order to buy a house, or the amount of money itself) application. In future
population then consists of all through-the-door (TTD) population. One then needs a subset of
the historical TTD population to build an analytical model. The customer were accepted with
the old policy and the one that were rejected in shown in the figure 1.2. When building a
sample, one can the make use of those that were accepted, which clearly implies a bias.

Figure 1.3: The reject inference problem in credit scoring


In stratified sampling, a sampling a taken according to predefined strata. Example: a churn
prediction or fraud detection context in which data sets are typically very skewed. When
stratifying according to the target churn indicator, the sample will contain exactly the same
percentage of churners and no churners as in the original data.

1.7 Types of Data Elements

It is important to appropriately consider the different types of data elements at the start of the
analysis. The different types of data elements can be considered:
Continuous Categorical

P a g e | 10
10
: Module – 1: Big Data and Analytics

Continuous: These are data elements that are defined on an interval that can be limited or
unlimited. Examples include income, sales, RFM (recency, frequency, monetary).
Categorical: The categorical data elements are differentiated as follows:
Nominal: These are data elements that can only take on a limited let of values
with no meaningful ordering in between.
Examples: marital status, profession, purpose of loan.
Ordinal: These are data elements that can only take on a limited set of values with
a meaningful ordering in between.
Examples: credit rating; age coded as young, middle aged, and old.
Binary: These are data elements that can only take on two values.
Example: gender, employment status.

Appropriately distinguishing between these different data elements is of key importance to start
the analysis when importing the data into an analytics tool. For example, if marital status were to
be incorrectly specified as a continuous data element, then the software would calculate its mean,
standard deviation, and so on, which is obviously having no meaning.

1.8 Visual Data Exploration and Exploratory Statistical Analysis

Visual Data exploration is a getting to know your data in an "informal” way. It allows you to
get some initial insights into the data, which can then be usefully adopted throughout the
modeling. Different plots/graphs can be useful here.
Examples:
Pie charts
Bar charts
Histograms and scatter plots
• A pie chart represents a variable’s distribution as a pie.
• Bar charts represent the frequency of each of the values (either
absolute or relative) as bars.
• Histogram provides an easy way to visualize the central tendency and
to determine the variability or spread of the data. It also allows us to
contrast the observed data with standard known distributions.
• Scatter plots allows us to visualize one variable against another to see
whether there are any correlation patterns in the data.
• WordCloud: an image composed of words used in a particular text or
subject to visualize data occurrence, in which the size of each word
Figure 1.4 : Data
indicates its frequency or importance. visualization through Graphs

P a g e | 11
11
: Module – 1: Big Data and Analytics

1.9 Missing Values

• Missing values can occur because of various reasons.


• The information can be nonapplicable. For example, when modeling time of churn,
this information is only available for the churners and not for the non-churners
because it is not applicable there.
• The information can also be undisclosed. For example, a customer decided not to
disclose his or her income because of privacy.
• Missing data can also originate because of an error during merging.
Table 1.2: Dealing with missing values

• Some analytical techniques (e.g., decision trees) can directly deal with missing values.
• Other techniques need some additional preprocessing. The following are the most
popular schemes to deal with missing values:
– Replace (impute). This implies replacing the missing value with a known
value.
– Delete. This is the most straightforward option and consists of deleting
observations or variables with lots of missing values.
– Keep. Missing values can be meaningful (e.g., a customer did not disclose his
or her income because he or she is currently unemployed).

As a practical way of working, one can first start with statistically testing whether missing
information is related to the target variable. If yes, then we can adopt the keep strategy and
make a special category for it. If not, depending on the number of observations available, decide
to either delete or keep.

1.10 Outlier Detection and Treatment

Outliers are extreme observations that are very dissimilar to the rest of the population. Actually,
two types of outliers are:

P a g e | 12
12
: Module – 1: Big Data and Analytics

• Valid observations (e.g salary of boss is $1 million)

• Invalid observations (e.g age is 300 years)

Both are univariate outliers in the sense that they are outlying on one dimension, outliers
can be hidden in unidimensional views of the data. Multivariate outliers are observations
that are outlying in multiple dimensions.
Two important steps in dealing with outliers are detection and treatment

A first check for outlier is to calculate the minimum and maximum values for each data
element
Next various graphical and statistical procedures can be used to detect outliers.

Figure 1.5 presents an example of a distribution for age whereby the circled areas clearly
represents outlier.

Figure 1.5: outlier detection by graphical method


Another visual mechanism are box plots. A box plot represents three key qualifiers of data.
They are:
• First Quartile (25% of the
observations have a lower value)

• Median (50% of the observations have


Figure 2.6: Box Plots for Outlier Detection
a lower value)

• Third Quartile (75% of the observations are lower value)


All three quartiles are represented as a box. The minimum and maximum values are then also
added unless they are too far from the edges of the box. Too far away is then 1.5*Interquartile
Range (IQR=Q3-Q1). The figure 1.6 gives an example of a box plot in which the three outliers
can be seen.

Another way to identify outliers is to calculate z-scores, measuring how many standard
deviations an observation lies away from the mean as follows:
௫೔ ିఓ
𝑧೔ = (1.1)

P a g e | 13
13
: Module – 1: Big Data and Analytics

Equation 1.1 indicate the formula for calculating, where, μ-represents the mean of the
observations, σ- represents Standard Deviation. Table 1.3 indicates an example of z-score calculation.
Table 1.3: z-scores for Outlier Detection

A practical rule of thumb then defines outliers when the absolute value of the z-score |z| is bigger than
3. Note that the z-scores relies on the normal distribution.
Multivariate outliers can be detected by fitting regression lies and inspecting the observation with large
errors. (e.g residual plot). Alternative methods are clustering. Some analytical techniques like decision
trees, neural networks are fairly robust with respect to outliers. Various schemes exist to deal with
outliers. It highly depends on whether the outlier represents a valid or invalid observation.

A popular scheme for is truncation/capping/winsorising. One here by imposes lower limit and upper
limit on a variable and any values below/above are brought back to the limits. The limits can be
calculated using the z-scores or the IQR (which is more robust than the z-scores)

Calculating upper and lower limit using z-score is:


Upper limit = μ + 3σ
Lower limit = μ - 3σ

Figure 1.7: Using the z-scores for Truncation


Another way of calculating the upper and lower limit is using
box plot IQR.
Upper limit=M+3s Lower limit=M-3s Where, M is Median and s=IQR/(2*0.6745)3

1.11 Standardizing Data Labels


Standardizing data is a data preprocessing activity targeted at scaling variables to a similar range.
Example, two variables: gender (coded as 0/1) and income (ranging between $0 and $1 million).
The standardization procedures could have adopted here is:

P a g e | 14
14
: Module – 1: Big Data and Analytics

Min/max standardization
೔ ିି ୫୧୬ (೔೔೔೔ )
𝑋೔೔௪ = ୫ୟ୶೔೔೔ ∗ (𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛) + 𝑛𝑒𝑤𝑚𝑖𝑛 (1.2)
(೔ ೔೔ )ିି୫୧୬ )
೔ (೔
೔೔೔

Whereby newmax and newmin are the newly imposed maximum and minimum (e.g 0 and 1).

1.12 Categorization

Categorization is also known as coarse classification, classing, grouping, binning etc., can be
done for many reasons. For categorical variables, it is needed to reduce the number of
categories. With categorization, one would create categories of values such that fewer
parameter will have to be estimated and a more robust model is
obtained. For continuous variables, categorization may also be
very beneficial. Example: the age variable and its risk as depicted
in figure 1.8. Clearly, there is a relation b/w risk and age.
Two basic methods are: Figure 1.8: Default Risk versus Age

• Equal interval binning

• Equal frequency binning

Example: income values 1,000, 1,200, 1,300, 2,000, 1,800 and 1,400.

Equal Interval Binning: it will create two bins with the same range- Bin1:1,000, 1,500 and
Bin2: 1,500, 2,000.

Equal Frequency Binning: it would create two bins with same number of observations- Bin1:
1,000, 1,200, 1,3000 and Bin2: 1,400, 1,800, 2,000.

However, both methods are quite basic and do not take into account a target variable (e.g
churn, fraud, credit risk)

Chi-squared analysis is a more sophisticated way to do coarse classification. The table 1.4
shows the example for coarse classifying a residential status variable.

Table 1.4: Coarse Classifying the Residential Status Variable

-----------

P a g e | 15
15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy