Business Analytics Viva Questions
Business Analytics Viva Questions
Interpretation of p-value:
• p < 0.05 → The variable is statistically significant and likely impacts the dependent variable.
• p > 0.05 → The variable is not statistically significant, meaning its effect may be due to chance.
Que 12) How to check the impact of independent variables on dependent variables?
Ans 12) To check the impact of independent variables on a dependent variable, various statistical techniques
are used in regression analysis. The regression coefficients (\beta) indicate the direction and strength of the
relationship, with larger absolute values suggesting a stronger effect. The t-test and p-value help determine
statistical significance, where a p-value below 0.05 indicates that the variable has a meaningful impact. The
R² and adjusted R² values show how much of the dependent variable’s variance is explained by the independent
variables, while the F-test assesses the overall model significance. To ensure accuracy, checking for
multicollinearity using the Variance Inflation Factor (VIF) is also essential.
Ans 13) Adjusted R2 is a version of R2 that adjusts for the number of predictors in a regression model. R2 tells
us how well our model explains the variation in the dependent variable (higher is better). However, R2 always
increases when we add more predictors, even if they don’t actually improve the model. Adjusted R2 fixes this
by penalising models with unnecessary predictors.
Que 14) What do you mean by coefficient in the regression table?
Ans 14) In a regression table, the coefficient is represented by the symbol \beta (beta) and indicates the effect
of an independent variable on the dependent variable. It shows how much the dependent variable changes
when the independent variable increases by one unit while keeping other variables constant. A positive
coefficient means a direct relationship, while a negative coefficient suggests an inverse relationship between
the variables.
Ans 15) A confidence interval (CI) is a range of values that estimates the true population parameter with a
certain level of confidence. In regression analysis, it is used to indicate the range in which the true coefficient
of an independent variable is likely to fall. It is calculated using the estimated coefficient, its standard error,
and a critical value from the t-distribution or z-distribution.
For a 95% confidence interval, it means that if the study were repeated multiple times, 95% of the intervals
would contain the true parameter value. A narrower confidence interval suggests higher precision, while a
wider interval indicates greater uncertainty. If a confidence interval includes zero, the variable may not be
statistically significant. Confidence intervals are crucial in statistical analysis, research, and predictive
modeling for understanding the reliability of estimates.
Ans 16) A prediction interval (PI) is a range that estimates where a future individual observation of the
dependent variable is likely to fall, given specific values of the independent variables. Unlike a confidence
interval, which predicts the mean of the dependent variable, a prediction interval considers both sampling
variability and individual variability, making it wider than a confidence interval. It is typically denoted as PI
and calculated using the predicted value, the standard error of the regression, and the variance of individual
observations. For a 95% prediction interval, it means that 95% of future observations will fall within this
range.
Ans 17) Text mining is the process of extracting meaningful information, patterns, and insights from
unstructured text data using techniques from natural language processing (NLP), machine learning, and
statistical analysis. It involves steps like text preprocessing (removing stopwords, stemming, and
tokenization), feature extraction (converting text into numerical data), and pattern recognition (such as
sentiment analysis, topic modeling, and named entity recognition). Text mining is widely used in business
analytics, social media monitoring, customer feedback analysis, and fraud detection to derive actionable
insights from large volumes of text data.
Que 18) What is the command that we use in doing textual analysis?
Ans 18) In R, textual analysis uses packages like tm. To begin, a text corpus is created using the Corpus()
function from the tm package, allowing text data to be processed. Common preprocessing steps include
converting text to lowercase using tm_map(content_transformer(tolower)), removing punctuation with
removePunctuation(), and eliminating stopwords with removeWords(). A Document-Term Matrix (DTM) can
then be generated using DocumentTermMatrix(), which helps analyze word frequency and patterns. For
sentiment analysis, the syuzhet package’s get_sentiment() function is used to extract emotional tone.
Que 19) What are the pre-processing text requirements in textual analysis?
Ans 19) In textual analysis, preprocessing is a crucial step to clean and standardize text data for better accuracy
and meaningful insights. The process begins with lowercasing, which ensures uniformity by converting all
text to lowercase, preventing duplication of words due to case differences. Next, punctuation removal
eliminates unnecessary symbols, while stopword removal filters out common words like “the” and “is” that
do not add significant value to the analysis. Tokenization then breaks text into individual words or phrases,
making it easier to analyze. To further refine the text, stemming and lemmatization reduce words to their root
or base form, improving consistency (e.g., “running” to “run” or “better” to “good”). Additionally, removing
numbers helps when numerical data is not relevant, while whitespace removal ensures proper formatting. In
some cases, spelling correction is applied to fix typos and enhance text quality.
Ans 20) Sentiment analysis is the process of determining the emotional tone or opinion expressed in a piece
of text. It uses natural language processing (NLP), machine learning, and text analysis techniques to classify
text as positive, negative, or neutral. Sentiment analysis is widely used in social media monitoring, customer
feedback analysis, brand reputation management, and market research to understand public opinion and trends.
In R, sentiment analysis can be performed using the syuzhet, tidytext, or text packages. For example, the
get_sentiment() function in the syuzhet package can analyze text using different sentiment lexicons like Bing,
NRC, or AFINN. Advanced sentiment analysis also involves aspect-based sentiment analysis (ABSA), where
emotions are identified for specific topics, and deep learning models for more accurate predictions. By
analyzing emotions in text, sentiment analysis helps businesses and researchers make data-driven decisions
based on public opinion.
Que 21) What are the libraries that we import for running textual analysis?
Ans 21) In R, several libraries are used for textual analysis, enabling tasks like text preprocessing, tokenization,
sentiment analysis, and visualization. The tm package is widely used for text mining, helping clean text and
create a Document-Term Matrix (DTM) for further analysis. The tidytext package allows text processing using
tidy data principles, making it easier to work with structured text data. For advanced text processing, quanteda
provides tools for tokenization, word frequency analysis, and text classification. Sentiment analysis is
commonly performed using the syuzhet package, which offers multiple sentiment lexicons like Bing, NRC,
and AFINN to determine the emotional tone of text. Additionally, wordcloud is useful for visualizing frequent
words in a dataset, making textual insights more interpretable.
Que 22) What is the difference between data analytics and data analysis?
Ans 22) Data analytics and data analysis are closely related but differ in scope and purpose. Data analysis
refers to the process of examining, cleaning, transforming, and interpreting data to extract meaningful insights.
It focuses on identifying patterns, trends, and relationships within data using statistical and exploratory
techniques.
On the other hand, data analytics incorporates predictive modeling, machine learning, automation, and
business intelligence tools to drive decision-making. Data analytics is more action-oriented, aiming to optimize
processes, forecast future trends, and provide data-driven recommendations.
Que 23) What is the classification of data analytics?
Ans 23) Data analytics is classified into four main types, each serving a different purpose in extracting insights
and making data-driven decisions:
1. Descriptive Analytics – This type focuses on summarizing past data to understand trends and
patterns. It answers the question, “What happened?” using techniques like data visualization, reporting, and
dashboards. Examples include sales reports, website traffic analysis, and financial statements.
2. Diagnostic Analytics – Going a step further, this type aims to determine the reasons behind past
outcomes by identifying patterns and relationships in data. It answers the question, “Why did it happen?” using
techniques like drill-down analysis, correlation analysis, and statistical modeling.
3. Predictive Analytics – This type focuses on forecasting future trends based on historical data.
It answers the question, “What is likely to happen?” using methods like machine learning, regression analysis,
and time series forecasting. Businesses use predictive analytics for customer behavior prediction, fraud
detection, and risk assessment.
4. Prescriptive Analytics – The most advanced type, prescriptive analytics suggests actions to
achieve desired outcomes. It answers the question, “What should be done?” using optimization algorithms,
artificial intelligence (AI), and decision science. Examples include recommendation engines, supply chain
optimization, and personalized marketing strategies.
Ans 24) Moving averages are statistical techniques used to analyze trends in time-series data by smoothing
out short-term fluctuations. They help in identifying patterns and forecasting future movements in areas such
as stock prices, sales trends, and economic indicators. The two main types are Simple Moving Average (SMA)
and Exponential Moving Average (EMA). SMA calculates the average of a fixed number of past data points,
giving equal weight to all values, while EMA assigns greater weight to recent data points, making it more
responsive to changes.
Que 25) What are the softwares used for making interactive dashboards?
Ans 25) Several software tools are used for creating interactive dashboards, helping businesses visualize and
analyze data effectively. Tableau is a powerful data visualization tool known for its drag-and-drop
functionality and real-time data integration. Microsoft Power BI is another widely used tool that enables users
to build dynamic dashboards with AI-powered insights and seamless integration with Excel and other
Microsoft applications. Google Data Studio (Looker Studio) is a free tool that allows users to create interactive
reports, especially useful for Google Analytics, Ads, and Sheets integration.
Ans 26) In R, a command refers to an instruction given to the R programming environment to perform a
specific task. Commands are executed in the R console or script to manipulate data, create visualizations,
perform statistical analysis, or build machine learning models. Commands in R typically involve functions,
which take inputs (arguments) and return outputs. For example, the command mean(c(1, 2, 3, 4, 5)) calculates
the mean (average) of the given numbers. Similarly, plot(x, y) generates a scatter plot of two variables.
Ans 27) In R, syntax refers to the set of rules that define how code must be written and structured for the R
interpreter to understand and execute it correctly. The syntax includes the way functions, variables, operators,
and commands are used in R.
Que 28) What are packages in R?
Ans 28) In R, a package is a collection of functions, datasets, and documentation that extends the capabilities
of base R, allowing users to perform specialized tasks such as data manipulation, visualization, machine
learning, and statistical analysis. Packages are stored in repositories like CRAN (Comprehensive R Archive
Network) and can be easily installed using the install.packages("package_name") command. Once installed,
they need to be loaded into the R session using library(package_name). Some popular packages include dplyr
for data manipulation, ggplot2 for visualization, caret for machine learning, and shiny for building interactive
dashboards.
Ans 29) In R, a library is a collection of installed packages that provide additional functions and tools for data
analysis, visualization, statistical modeling, and machine learning. While the terms library and package are
often used interchangeably, a package refers to the individual software bundle, whereas a library is the location
where installed packages are stored. To use a package, it must first be installed using
install.packages("package_name") and then loaded into the R session using library(package_name). For
example, library(ggplot2) loads the ggplot2 package for data visualization.
Ans 30) To install a package in R, the install.packages() function is used, followed by the package name in
quotation marks. For example, to install the ggplot2 package, the command install.packages("ggplot2") is
executed. If multiple packages need to be installed at once, they can be specified within a vector, such as
install.packages(c("dplyr", "tidyverse", "caret")). Once a package is installed, it must be loaded into the R
session using the library() function, such as library(ggplot2), to make its functions available.
Ans 31) In R, data can be stored and manipulated using different data structures, which define how information
is organized and accessed. The primary structures in R include vectors, lists, matrices, data frames, and arrays.
1. Vectors—The simplest data structure in R, a vector contains elements of the same data type
(numeric, character, logical, etc.). It is created using the c() function, such as x <- c(1, 2, 3, 4).
2. Lists—Unlike vectors, lists can store elements of different data types, including numbers,
strings, and even other lists. A list is created using list(), for example, my_list <- list(1, "text", TRUE).
3. Matrices—A matrix is a two-dimensional data structure where all elements must be of the same
type. It is created using the matrix() function, such as mat <- matrix(1:6, nrow=2, ncol=3), which generates a
2×3 matrix.
4. Data Frames— The most commonly used structure, a data frame is a table-like format where
each column can contain different data types. It is created using data.frame(), for example, df <-
data.frame(Name = c("A", "B"), Age = c(25, 30)).
5. Arrays— Similar to matrices but with more than two dimensions, arrays store data in multi-
dimensional space. An array is created using array(), such as arr <- array(1:12, dim = c(2,3,2)), which forms a
2×3×2 array.
Que 32) What is the Variance Inflation Factor?
Ans 32) The Variance Inflation Factor (VIF) is a statistical measure used to detect multicollinearity in a
multiple regression model, which occurs when independent variables are highly correlated with each other.
Multicollinearity can distort the estimated coefficients and make the model unreliable. VIF quantifies how
much the variance of a regression coefficient is inflated due to correlation among predictors. A VIF value of
1 indicates no multicollinearity, while values between 1 and 5 suggest moderate correlation but are generally
acceptable. However, a VIF greater than 10 signals a high level of multicollinearity, requiring corrective
measures such as removing highly correlated variables, combining features, or using dimensionality reduction
techniques like Principal Component Analysis (PCA).
Que 33) What are the assumptions of the simple linear regression model?
Ans 33) The simple linear regression model in R, like any statistical model, is based on several key
assumptions that ensure the validity and reliability of the results. These assumptions include:
1. Linearity—There must be a linear relationship between the independent variable (X) and the
dependent variable (Y). This can be checked using scatter plots or correlation analysis.
2. Independence – The observations should be independent of each other, meaning that the value
of one observation does not influence another. This assumption is particularly important in time-series data,
where autocorrelation can be an issue.
3. Homoscedasticity (Constant Variance of Errors) – The residuals (errors) should have constant
variance across all levels of the independent variable. If the spread of residuals increases or decreases
systematically, it indicates heteroscedasticity, which can be detected using residual plots.
To validate these assumptions in R, diagnostic tools such as residual plots, histograms, Q-Q plots, and
statistical tests (e.g., the Durbin-Watson test for independence and the Breusch-Pagan test for
homoscedasticity) can be used.
Ans 34) In regression analysis, standard errors measure the accuracy of estimated coefficients by indicating
the variability of the coefficient estimates across different samples. A smaller standard error suggests that the
estimate is more precise, while a larger standard error indicates greater uncertainty.
Standard errors are crucial for hypothesis testing and constructing confidence intervals for regression
coefficients. They help determine whether an independent variable significantly influences the dependent
variable by calculating the t-statistic (Coefficient / Standard Error) and the p-value. A high standard error may
suggest multicollinearity or insufficient data points. In R, standard errors can be found in the output of the
summary() function applied to a regression model, which provides insight into the reliability of the estimated
coefficients.
Que 35) What are the different types of measurement scales?
Ans 35) There are four main types of measurement scales in research: Nominal, Ordinal, Interval, and Ratio.
Nominal scales classify data into categories without any order (e.g., gender, nationality). Ordinal scales rank
data but without equal intervals (e.g., satisfaction levels, education levels). Interval scales have equal
differences between values but no true zero (e.g., temperature in Celsius, IQ scores). Ratio scales have a true
zero, allowing all mathematical operations (e.g., height, weight, income). These scales determine the type of
statistical analysis that can be applied.
Que 36) How to identify outliers in a dataset? Name any one thing that can be used to identify an outlier?
Ans 36) Outliers in a dataset can be identified using statistical methods, visualization techniques, or machine
learning algorithms. An outlier is a data point that significantly deviates from the rest of the observations,
potentially affecting the accuracy of statistical models.
One commonly used method to detect outliers is the Interquartile Range (IQR) Method, where the IQR = Q3
- Q1 (the range between the 75th percentile and the 25th percentile). Any data point that falls below Q1 - 1.5
× IQR or above Q3 + 1.5 × IQR is classified as an outlier. Another effective way to identify outliers is using
a boxplot, which visually represents the distribution of data and highlights extreme values beyond the
“whiskers.” In R, outliers can be detected using the boxplot() function, which makes it easier to spot deviations.
Que 37) What are the different types of data based on structures?
Ans 37) Data can be categorized into three main types based on its structure: structured, semi-structured, and
unstructured data. Each type differs in its organization, format, and how it is stored and processed.
1. Structured Data – This type of data is highly organized and stored in predefined formats such
as tables with rows and columns in relational databases (e.g., MySQL, PostgreSQL). Examples include sales
records, customer details, and financial transactions. Structured data is easy to query using SQL.
2. Semi-Structured Data – This data does not follow a strict tabular format but still contains tags
or markers to separate elements, providing some level of organization. Examples include JSON, XML, and
NoSQL databases (e.g., MongoDB, Firebase). Semi-structured data is commonly used in web applications,
APIs, and big data storage.
3. Unstructured Data – This type lacks a fixed format, making it more challenging to store and
analyze. Examples include text files, images, videos, social media posts, and emails. Since unstructured data
does not fit into traditional databases, specialized tools like Hadoop, Spark, and AI-driven analytics are used
for processing.
Ans 38) A box plot, also known as a box-and-whisker plot, is a graphical representation used to visualize the
distribution, central tendency, and spread of a dataset. It summarizes key statistical measures, including the
minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum, while also highlighting potential
outliers. The box represents the interquartile range (IQR), spanning from Q1 to Q3, with a line inside indicating
the median. The whiskers extend from Q1 to the minimum and from Q3 to the maximum, excluding outliers,
which are plotted separately as individual points beyond 1.5 × IQR. Box plots help detect skewness, variability,
and extreme values in data, making them useful in exploratory data analysis. In R, box plots can be generated
using the boxplot() function, which provides an efficient way to identify outliers and understand data
distribution.
Que 39) Why do we need data cleaning?
Ans 39) Data cleaning is an essential step in data analysis and machine learning as it ensures the accuracy,
consistency, and reliability of data. Raw data often contains errors, missing values, duplicates, inconsistencies,
and outliers, which can lead to misleading conclusions and incorrect predictions. By cleaning data, we improve
its quality and make it suitable for analysis.
Data cleaning helps in removing irrelevant or redundant information, correcting errors, handling missing
values, and standardizing formats. This process enhances the performance of machine learning models,
improves decision-making, and ensures compliance with data governance standards. Without proper data
cleaning, biased insights, incorrect statistical results, and faulty business decisions may occur. Overall, data
cleaning is crucial for maintaining the integrity and usability of data in analytics and predictive modeling.
Que 40) Why is data processing required before running the analysis?
Ans 40) Data processing is required before running analysis to ensure that the data is accurate, structured, and
ready for meaningful insights. Raw data is often messy, containing inconsistencies, missing values, duplicates,
and outliers, which can distort the results. By processing the data, we transform it into a clean, structured, and
usable format for statistical analysis or machine learning models.
Key steps in data processing include data cleaning (removing errors and inconsistencies), data transformation
(converting data into the required format), feature engineering (creating new meaningful variables), and
normalization (scaling data for consistency). Proper data processing enhances the accuracy, reliability, and
efficiency of analysis, leading to better decision-making and predictive performance. Without it, analysis may
produce incorrect or misleading results, affecting business strategies and research conclusions.