BA TH Exam
BA TH Exam
🔹 1 What is Data?
Definition:
Data is a collection of raw facts, figures, symbols, or observations that do not carry any meaning by
themselves. Once processed or analyzed, data can become information, which is meaningful and useful.
Example:
If you have a list of numbers like 95, 80, 76, 88 — they are just raw marks of students. This is data. But if
you calculate the average of these marks, it becomes information (like: "Average score is 84.75").
Types of Data:
1. Structured Data
o Data that is organized in rows and columns.
o Stored in databases, spreadsheets, etc.
o Example: Student records (Name, Roll No., Marks, etc.)
2. Unstructured Data
o Data that does not have a predefined format.
o Example: Images, videos, social media posts, emails, etc.
3. Semi-Structured Data
o Data that is partly structured.
o Example: XML files, JSON data.
Definition:
Data Science is an interdisciplinary field that uses mathematics, statistics, computer science, and domain
knowledge to extract insights and knowledge from data.
It involves collecting, cleaning, analyzing, and interpreting data to support decision-making or to make
predictions.
1. Data Collection
o Gathering data from various sources (databases, web, sensors, etc.)
2. Data Cleaning (Preprocessing)
o Removing errors, duplicates, and missing values to improve data quality.
3. Data Exploration and Analysis
o Using statistical tools to understand patterns, relationships, and trends.
4. Data Visualization
o Creating graphs, charts, dashboards to represent data visually.
5. Modeling and Machine Learning
o Applying algorithms to make predictions or automate tasks (e.g., predicting sales,
recommending movies).
6. Interpretation and Decision Making
o Making business decisions or strategies based on the analysis.
These two terms are related but not exactly the same. Think of data analysis as a part of the broader field
called data analytics.
✅ Data Analysis:
Definition:
Data Analysis is the process of examining, cleaning, transforming, and modeling data to discover useful
information, patterns, or conclusions.
Goal: To understand past data and answer questions like “What happened?” or “Why did it happen?”
Example:
A company analyzing last month’s sales report to find which products sold the most.
✅ Data Analytics:
Definition:
Data Analytics is a broader field that includes data analysis, along with prediction, automation, decision-
making, and optimization using advanced tools and technologies.
Goal: To not only understand the past but also predict the future, find hidden patterns, and support data-
driven decisions.
Includes:
• Data analysis
• Data mining
• Machine learning
• Statistical modeling
• Business intelligence
Example:
An e-commerce site using analytics to recommend products, predict customer behavior, and optimize
delivery routes.
🔹 4. Classification of Analytics
Data analytics is used in almost every industry today. Here’s how different sectors benefit from it:
✅ 5. Healthcare
✅ 6. Manufacturing
✅ 7. E-commerce
1. Nominal Data
2. Ordinal Data
3. Scale Data (which includes Interval and Ratio data)
📌 Definition:
Nominal data represents categories that do not have any order or ranking. They are just names or labels
used to identify items.
🎯 Key Features:
• No numerical meaning
• No logical order or ranking
• You can count how many belong to each category
• You cannot add, subtract, or average them
🧠 Examples:
• Gender: Male, Female, Other
• Colors: Red, Blue, Green
• Marital status: Single, Married, Divorced
• Religion: Hindu, Muslim, Christian, Sikh
• Types of fruits: Apple, Mango, Banana
• Count frequency
• Mode (most common category)
• Use bar charts or pie charts
📌 Definition:
Ordinal data represents categories that have a meaningful order or ranking, but the difference between
the values is not measurable.
🎯 Key Features:
🧠 Examples:
Scale data is numerical and can be used for mathematical calculations. It is divided into two subtypes:
Interval data has numbers with equal spacing between values, but no true zero point.
🧠 Examples:
📌 Definition:
Ratio data has all the features of interval data, plus a true zero. This means you can say “twice as much” or
“half as much”.
🧠 Examples:
• Height, Weight
• Age
• Income
• Distance
• Time duration
✅ In simple terms:
Big Data is not just about the size of the data but also about how it is generated, stored, processed, and
analyzed to get useful insights.
• Facebook or Instagram generating billions of posts, likes, and comments every day.
• Online shopping websites like Amazon collecting data on user clicks, purchases, reviews.
• Smart cities using sensors and cameras to monitor traffic, weather, and pollution.
• Banks recording millions of transactions every hour.
✅ 1. Healthcare
✅ 4. Education
✅ 6. Telecommunications
🔚 Conclusion:
• Big Data is not just about size — it is about complex, fast, and diverse data that needs advanced
tools and technologies to manage.
• Its 5 Vs — Volume, Velocity, Variety, Veracity, and Value — help us understand its challenges and
potential.
• Big Data is transforming every industry by helping in smarter decision-making, cost saving, and
better customer experience.
• Problem: If the data is incorrect, outdated, incomplete, or duplicated, the analysis will be
misleading.
• Example: Missing customer information or wrong entries can lead to poor business decisions.
• Impact: Reduces trust in the analytics results.
• Problem: The massive amount of data being generated is difficult to store and manage.
• Example: Social media platforms generate millions of posts per day.
• Impact: Requires powerful storage systems and computing power.
• Problem: Data often comes from many sources (like websites, apps, sensors, etc.) in different
formats.
• Example: Combining sales data from online and offline stores.
• Impact: Makes it hard to unify and analyze data accurately.
• Problem: Skilled data scientists, analysts, and engineers are in high demand but short supply.
• Example: A company may have lots of data but no experts to analyze it.
• Impact: Data remains unused or poorly analyzed.
✅ 6. Choosing the Right Tools and Technologies
• Problem: So many tools are available (Python, R, Tableau, Hadoop, etc.) that it’s hard to select the
right one.
• Example: Using a complex tool for a simple task wastes time and resources.
• Impact: Increases project costs and slows down progress.
• Problem: Some situations need instant data analysis (like fraud detection or traffic monitoring).
• Example: Credit card companies need to detect fraud the moment it happens.
• Impact: Requires high-speed data processing systems.
✅ 9. Cost of Infrastructure
• Problem: Setting up and maintaining servers, cloud platforms, and analytical tools is expensive.
• Example: A startup may not afford advanced data systems.
• Impact: Limited ability to handle or scale data analysis.
• Problem: Governments regularly update data protection laws (like GDPR in Europe).
• Example: Not following regulations can lead to legal trouble.
• Impact: Analytics strategies must constantly adapt to new rules.
Unit 5 (25 Marks)
🔷 1. Simple Linear Regression Model
📘 Meaning:
Simple Linear Regression is a statistical method used to study the relationship between two variables:
📈 Purpose:
It helps in predicting the value of Y when we know the value of X, assuming a linear (straight-line)
relationship between them.
🧮 Mathematical Equation:
Y=a+bX+eY = a + bX + eY=a+bX+e
✅ Example:
Imagine we want to predict a student’s score based on hours studied. We collect data and find the following
regression line:
This means:
• A Confidence Interval tells us the range in which the average (mean) value of Y is expected to fall
for a given X value.
• It gives certainty about the average prediction.
• Usually given with a confidence level like 95%.
🔹 Example:
If a model says the average score for students who study 5 hours is between 70 and 75, we are 95%
confident that this range contains the true average.
• A Prediction Interval tells us the range in which an individual’s actual Y value is expected to fall
for a given X.
• It accounts for more uncertainty, so it is wider than a confidence interval.
🔹 Example:
For a student who studied 5 hours, their predicted score might fall between 65 and 80. This is a prediction
interval.
📌 Summary of Difference:
Feature Confidence Interval Prediction Interval
Tells about Mean of Y One new value of Y
Width Narrower Wider
Use General trend Individual forecast
When we use two or more independent variables (X₁, X₂, X₃…) to predict a single dependent variable
(Y), it is called Multiple Linear Regression.
📈 Equation:
Y=a+b1X1+b2X2+b3X3+⋯+eY = a + b_1X_1 + b_2X_2 + b_3X_3 + \dots + eY=a+b1X1+b2X2+b3X3
+⋯+e
• a = Intercept
• b₁, b₂… = Slopes (impact of each independent variable)
• X₁, X₂… = Independent variables
• e = Error term
✅ Example:
Model:
This means:
✅ Example:
Model:
Interpretation:
• For each 1 unit increase in TV advertising, sales increase by ₹300 (if Online Ads stay the same).
• For each 1 unit increase in Online Ads, sales increase by ₹150 (if TV Ads remain same).
🔷 5. Heteroscedasticity
📘 Meaning:
Heteroscedasticity occurs when the spread (variance) of the errors is not constant across all levels of the
independent variable(s).
When predicting people’s spending, low-income people may have small variation in spending, but high-
income people may have very large variation. So the errors are unequal.
🔍 Visual Sign:
• If we plot residuals (errors) and they form a cone or funnel shape, it indicates heteroscedasticity.
⚠ Problem:
🔷 6. Multicollinearity
📘 Meaning:
Multicollinearity happens when two or more independent variables in a regression model are highly
correlated with each other.
✅ Example:
These two may be highly correlated, because larger houses usually have more rooms. This leads to
multicollinearity.
🔍 Why is it bad?
📊 How to Detect?
• Using Variance Inflation Factor (VIF). If VIF > 10, multicollinearity is likely a problem.
🟩 Final Summary:
Topic Explanation
Formal Text Structured and grammatically correct News articles, academic papers
Informal Text Casual, often with slang or emojis Tweets, WhatsApp chats
In simple terms, textual data analysis helps you make sense of huge amounts of written text by:
• Identifying patterns
• Understanding emotions or sentiments
• Finding common topics
• Summarizing information
• Detecting important words or phrases
Today, we are generating millions of text-based data every second—on social media, blogs, emails,
forums, etc. Manually reading all of it is impossible.
Most textual data is unstructured (no clear format), unlike numbers in a spreadsheet. Textual analysis
converts unstructured text into structured insights.
🟡 c. Improves Decision-Making
Automated tools can analyze thousands of documents in seconds, which saves human labor and time.
• Social patterns
• Ideologies
• Cultural meanings in interviews, speeches, or news media
• Customer satisfaction
• Public opinion
• Trends or viral content
Example: Twitter sentiment about a political leader or a product launch.
🔵 c. Healthcare
Doctors or researchers analyze patient feedback, clinical notes, or online health forums to understand:
• Patient experiences
• Common symptoms
• Effectiveness of treatments
• Witness statements
• Case documents
• Social media posts of suspects
To find useful patterns or contradictions.
🔵 e. Academic Research
• Speeches
• Interviews
• Media articles
To study ideologies, cultural shifts, or gender/race narratives.
🔵 f. News and Media
🔵 g. Customer Service
🔵 h. Political Analysis
• Political ideology
• Public opinion during elections
• Misinformation or propaganda
🔺 a. Language Complexity
🔺 b. Multilingual Data
If tools are not trained properly, they may give biased results.
Too much data (big volume) and too many types of sources (blogs, articles, chats, reports) make it harder to
standardize analysis.
• Ethical issues arise if users are not aware that their data is being analyzed.
✅ Summary
Aspect Explanation
What is it? Understanding and analyzing written text to find meaning, patterns, or insights.
Aspect Explanation
Helps deal with large unstructured data, supports research, decision-making, saves
Why is it important?
time.
Where is it used? Marketing, politics, healthcare, law, media, academic research, customer service.
What are the Language complexity, sarcasm, slang, multilingual issues, data cleaning, privacy,
challenges? need for technical skills.
Why R?
Before analysis, you need to get the data. You can use:
Cleaning Steps:
• Convert to lowercase
• Remove punctuation
• Remove numbers
• Remove stopwords (e.g., "is", "the", "and")
• Stemming or Lemmatization (reduce words to root form)
• Strip whitespace
This step converts text into a structured format — a matrix of documents vs. terms.
• Rows = documents
• Columns = terms (words)
• Cells = frequency of that word in that document
🔹 5. Sentiment Analysis
You can determine the emotional tone of text (positive, negative, angry, sad, etc.) using predefined
lexicons.
Text mining ka matlab hota hai extract karna useful information from huge sets of unstructured text data.
It is like digging gold from tons of sand — hum meaningless text se meaningful insights nikalte hain.
• Text collect karte hain from sources like social media, news, blogs, emails, surveys, etc.
• Example: 10,000 Amazon product reviews.
b. Text Pre-processing:
c. Information Extraction:
d. Pattern Detection:
Categorization ya classification ka matlab hota hai text ko specific predefined categories me divide karna.
🧠 Example:
b. Common Algorithms:
c. Applications:
3. Sentiment Analysis
📌 What is it?
Sentiment Analysis me hum ye samajhne ki koshish karte hain ki text me emotion ya attitude kaisa hai —
Positive, Negative, ya Neutral.
🧠 Why is it useful?
c. Fine-Grained Sentiment:
d. Aspect-based Sentiment:
o Camera → Positive
o Battery → Negative
🧰 Tools Used:
• NLTK (Python)
• TextBlob
• Vader Sentiment
• IBM Watson
• Google Cloud Natural Language API
🔄 Summary Table:
Method Purpose Example
Text Mining Extract patterns and insights Finding trending topics from 1 million tweets
Categorization Sort text into categories Classify articles into Tech, Politics, Health, etc.
Sentiment Analysis Detect emotion/mood in text Review says "Poor service" → Negative sentiment
Unit 3 (10 Marks)
1. Introduction to R
R is a programming language designed specifically for statistical computing, data analysis, and
visualization. Unlike general-purpose languages like Java or C++, R is built with the core intention of
handling, analyzing, and visualizing data easily and efficiently.
• History: R was developed in the early 1990s by Ross Ihaka and Robert Gentleman. It is based on the
S programming language but with added features and open-source accessibility.
• Use Cases: It is used in fields such as data science, bioinformatics, financial modeling, academic
research, and any domain that involves large amounts of data.
• Nature: R is not just a language; it's a complete environment that includes built-in functions for data
manipulation, tools for reading/writing data, and support for graphics and statistical modeling.
2. Advantages of R
R stands out from other programming languages when it comes to data and statistical work. Here's why:
• Open Source and Free: Anyone can download and use R. It’s maintained by a strong community
and regularly updated.
• Specialized for Data: It was created specifically for statistics and data analysis, so tasks like data
cleaning, visualization, and analysis are very simple and intuitive.
• Large Number of Packages: Thousands of community-developed packages extend R’s
functionality in fields like machine learning, genetics, economics, and more.
• Powerful Visualization: You can create advanced and customizable plots (bar charts, histograms,
maps, heatmaps) using packages like ggplot2 and lattice.
• Cross-Platform Support: Runs on Windows, macOS, and Linux seamlessly.
• Easy Integration: R can work alongside other tools and languages like SQL, Python, Hadoop, and
even Excel.
3. Installation of R
To use R, two installations are typically required:
a) R Base
4. Packages in R
A package is a collection of R functions, documentation, and data created to solve a specific problem or
extend R’s capabilities.
• Think of packages as “apps” for R. If R is your phone, packages are additional apps you install to do
more things.
• For example, if you want to make beautiful graphs, you install the ggplot2 package. For advanced
data manipulation, use dplyr.
Important Points:
a) CSV Files:
b) Excel Files:
• Excel files may contain multiple sheets, formulas, and cell formatting.
• Reading Excel in R requires external packages like readxl or openxlsx.
Once the data is loaded into R, it is stored in an object (usually a data frame) which can then be
manipulated or analyzed.
Key Concepts:
Example Behavior:
• When you type a command and press Enter, R evaluates it immediately and shows the result.
To use a package:
R comes with some default packages, but most useful work requires installing external packages from
CRAN, GitHub, or Bioconductor.
8. Data Structures in R
R provides various ways to store and organize data. These structures define how data is arranged and
accessed.
a) Vectors:
b) Matrices:
c) Arrays:
d) Lists:
e) Factors:
• Used for categorical data (like gender: male/female).
• Internally stored as integers but displayed as labels.
• Useful for grouping and statistical analysis.
f) Data Frames:
Conditional Statements:
These are used to make decisions and execute different blocks of code accordingly.
10. Loops in R
Loops allow you to repeat a task multiple times. This is especially useful when working with large datasets
or repeating actions.
a) For Loop:
b) While Loop:
c) Repeat Loop:
11. Functions in R
Functions are blocks of code that perform a specific task. You define them once and use them many times,
saving effort and reducing mistakes.
Key Features:
Types:
These functions are very helpful in data analysis, especially when you're doing repetitive operations.
Unit 4 (10 Marks)
When we talk about importing a data file in R, we mean bringing external data into R so we can analyze
and visualize it. Think of R as a kitchen and your data file (CSV, Excel, etc.) as ingredients you bring in to
cook (analyze).
• Data is stored in data frames: like Excel sheets with rows and columns.
• Each column = a variable (like "Age", "Income")
• Each row = an observation (like "Person A", "Person B")
🔷 A. Histogram
🧠 What is it?
A histogram shows how often different ranges of values appear in a dataset. It’s used for continuous
numerical data.
🖼 Image:
📝 Example:
You have exam scores of 100 students. A histogram can tell you how many students scored between 50–60,
60–70, etc.
📌 Interpretation:
🔷 B. Bar Chart
🧠 What is it?
A bar chart shows frequency of categories using bars. Used for categorical data (like gender, product
type, city name).
🖼 Image:
📝 Example:
A survey asks people about their favorite fruit. Categories = Apple, Mango, Banana. The bar height shows
how many people chose each.
📌 Interpretation:
🧠 What is it?
A box plot shows summary statistics: min, max, median, and quartiles, and detects outliers.
🖼 Image:
📝 Example:
You want to compare incomes of two cities. Box plots help you see:
📌 Interpretation:
🔷 D. Line Graph
🧠 What is it?
A line graph shows how something changes over time. Used for time-series data.
🖼 Image:
📝 Example:
Track daily temperature over a month. Line graph shows if it’s rising, falling, or fluctuating.
📌 Interpretation:
🔷 E. Scatter Plot
🧠 What is it?
A scatter plot shows relationship between two numerical variables. Each dot = one observation.
🖼 Image:
📝 Example:
You want to see if study time affects exam scores. Put study hours on X-axis and scores on Y-axis.
📌 Interpretation:
🔷 Mean (Average)
🧠 What is it?
📌 When to use:
🔷 Median
🧠 What is it?
📌 When to use:
🔷 Mode
🧠 What is it?
📌 When to use:
• Mean = 36
• Median = 30
• Mode = 30
🟢 4. Measures of Dispersion
These show how spread out or clustered the data values are.
🔷 Range
🧠 What is it?
📌 Use:
🔷 Variance
🧠 What is it?
Average of the squared differences from the mean. Tells how far values are from the mean on average.
🔷 Standard Deviation
🧠 What is it?
📌 Use:
🧠 What is it?
📌 Use:
🔷 Covariance
🧠 What is it?
Shows if two variables move in the same direction (positive) or opposite (negative).
📝 Example:
📌 Limitation:
🔷 Correlation
🧠 What is it?
📏 Range: -1 to +1
📝 Example:
• Study time vs grades = +0.85 → strong positive correlation
• TV time vs grades = -0.60 → moderate negative correlation
🧠 What is it?
Tells how much of the variation in one variable is explained by the other.
📏 Range: 0 to 1
📌 Use:
• In regression models, this shows how well the model fits the data.