0% found this document useful (0 votes)
7 views38 pages

BA TH Exam

The document provides a comprehensive overview of data, data science, and analytics, defining key concepts and types of data. It explains the importance of data science in extracting insights from data and outlines the various types of analytics and their applications across different industries. Additionally, it discusses the challenges faced in data analytics, including data quality issues and the need for skilled professionals.

Uploaded by

Sakshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

BA TH Exam

The document provides a comprehensive overview of data, data science, and analytics, defining key concepts and types of data. It explains the importance of data science in extracting insights from data and outlines the various types of analytics and their applications across different industries. Additionally, it discusses the challenges faced in data analytics, including data quality issues and the need for skilled professionals.

Uploaded by

Sakshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit 1 (15 Marks)

🔹 1 What is Data?

Definition:
Data is a collection of raw facts, figures, symbols, or observations that do not carry any meaning by
themselves. Once processed or analyzed, data can become information, which is meaningful and useful.

Example:
If you have a list of numbers like 95, 80, 76, 88 — they are just raw marks of students. This is data. But if
you calculate the average of these marks, it becomes information (like: "Average score is 84.75").

Types of Data:

1. Structured Data
o Data that is organized in rows and columns.
o Stored in databases, spreadsheets, etc.
o Example: Student records (Name, Roll No., Marks, etc.)
2. Unstructured Data
o Data that does not have a predefined format.
o Example: Images, videos, social media posts, emails, etc.
3. Semi-Structured Data
o Data that is partly structured.
o Example: XML files, JSON data.

Based on Nature, data can also be:

• Quantitative Data: Numerical (e.g., height, weight, salary)


• Qualitative Data: Descriptive (e.g., gender, color, feedback)

🔹 2 What is Data Science?

Definition:
Data Science is an interdisciplinary field that uses mathematics, statistics, computer science, and domain
knowledge to extract insights and knowledge from data.

It involves collecting, cleaning, analyzing, and interpreting data to support decision-making or to make
predictions.

Why Data Science?


In today’s digital world, we are generating a huge amount of data every second (from phones, websites,
sensors, etc.). This data is very valuable, and Data Science helps in understanding it and using it for solving
real-world problems.

Key Components of Data Science:

1. Data Collection
o Gathering data from various sources (databases, web, sensors, etc.)
2. Data Cleaning (Preprocessing)
o Removing errors, duplicates, and missing values to improve data quality.
3. Data Exploration and Analysis
o Using statistical tools to understand patterns, relationships, and trends.
4. Data Visualization
o Creating graphs, charts, dashboards to represent data visually.
5. Modeling and Machine Learning
o Applying algorithms to make predictions or automate tasks (e.g., predicting sales,
recommending movies).
6. Interpretation and Decision Making
o Making business decisions or strategies based on the analysis.

🔹 3. Data Analysis vs Data Analytics

These two terms are related but not exactly the same. Think of data analysis as a part of the broader field
called data analytics.

✅ Data Analysis:

Definition:
Data Analysis is the process of examining, cleaning, transforming, and modeling data to discover useful
information, patterns, or conclusions.

Goal: To understand past data and answer questions like “What happened?” or “Why did it happen?”

Example:
A company analyzing last month’s sales report to find which products sold the most.

✅ Data Analytics:

Definition:
Data Analytics is a broader field that includes data analysis, along with prediction, automation, decision-
making, and optimization using advanced tools and technologies.

Goal: To not only understand the past but also predict the future, find hidden patterns, and support data-
driven decisions.

Includes:

• Data analysis
• Data mining
• Machine learning
• Statistical modeling
• Business intelligence

Example:
An e-commerce site using analytics to recommend products, predict customer behavior, and optimize
delivery routes.
🔹 4. Classification of Analytics

Data analytics is generally classified into four main types:

1. Descriptive Analytics – "What happened?"

• Purpose: Summarizes past data into understandable patterns or reports.


• Tools: Charts, graphs, dashboards, Excel, Tableau.
• Example: Monthly sales reports, website traffic reports.

2. Diagnostic Analytics – "Why did it happen?"

• Purpose: Finds reasons behind past outcomes.


• Tools: Drill-down, data mining, correlation analysis.
• Example: Analyzing why sales dropped in a specific region.

3. Predictive Analytics – "What is likely to happen?"

• Purpose: Uses historical data to forecast future outcomes.


• Tools: Machine learning, statistical models.
• Example: Predicting customer churn or future sales.

4. Prescriptive Analytics – "What should we do?"

• Purpose: Suggests actions or strategies based on predictions.


• Tools: AI, optimization algorithms, simulation.
• Example: Suggesting the best marketing strategy to boost sales.

🔹 5. Applications of Analytics in Business

Data analytics is used in almost every industry today. Here’s how different sectors benefit from it:

✅ 1. Marketing & Sales

• Understanding customer behavior and preferences.


• Running targeted marketing campaigns.
• Forecasting future sales and demand.
• Recommendation systems (like Amazon, Netflix).

✅ 2. Finance & Banking

• Fraud detection using transaction data.


• Credit scoring and risk analysis.
• Portfolio and investment optimization.

✅ 3. Human Resources (HR)

• Analyzing employee performance and attrition.


• Predicting hiring needs.
• Improving employee engagement.

✅ 4. Supply Chain & Logistics


• Optimizing delivery routes.
• Managing inventory efficiently.
• Forecasting product demand.

✅ 5. Healthcare

• Diagnosing diseases from medical data.


• Predicting patient readmission.
• Improving treatment plans and hospital efficiency.

✅ 6. Manufacturing

• Predictive maintenance of machinery.


• Quality control using sensor data.
• Reducing production cost through optimization.

✅ 7. E-commerce

• Personalized product recommendations.


• Price optimization.
• Understanding shopping patterns.

🔹 6 What are the Types of Data?


In statistics, data is often classified based on levels of measurement. There are three main types used in
analytics:

1. Nominal Data
2. Ordinal Data
3. Scale Data (which includes Interval and Ratio data)

✅ 1. Nominal Data — Labels or Categories

📌 Definition:

Nominal data represents categories that do not have any order or ranking. They are just names or labels
used to identify items.

🎯 Key Features:

• No numerical meaning
• No logical order or ranking
• You can count how many belong to each category
• You cannot add, subtract, or average them

🧠 Examples:
• Gender: Male, Female, Other
• Colors: Red, Blue, Green
• Marital status: Single, Married, Divorced
• Religion: Hindu, Muslim, Christian, Sikh
• Types of fruits: Apple, Mango, Banana

✔ What can you do with it?

• Count frequency
• Mode (most common category)
• Use bar charts or pie charts

✅ 2. Ordinal Data — Ordered Categories

📌 Definition:

Ordinal data represents categories that have a meaningful order or ranking, but the difference between
the values is not measurable.

🎯 Key Features:

• Data has a rank or order


• You can say one is more/less than the other
• But you can’t measure how much more
• You still can’t do arithmetic operations like addition or average

🧠 Examples:

• Customer satisfaction: Very Satisfied, Satisfied, Neutral, Dissatisfied


• Education level: High School, Bachelor's, Master's, PhD
• Rankings: 1st, 2nd, 3rd place in a race
• Star ratings: 1 star to 5 stars

✔ What can you do with it?

• Find median or mode


• Use bar charts
• Compare ranks

✅ 3. Scale Data — Numeric Data (Interval + Ratio)

Also called: Quantitative or Continuous data

Scale data is numerical and can be used for mathematical calculations. It is divided into two subtypes:

➤ A. Interval Data — Numerical, but no true zero


📌 Definition:

Interval data has numbers with equal spacing between values, but no true zero point.

🧠 Examples:

• Temperature in Celsius or Fahrenheit (0°C does not mean 'no temperature')


• IQ scores
• Dates on a calendar

✔ What can you do with it?

• Add and subtract


• Calculate mean, standard deviation
• Create histograms, line graphs

➤ B. Ratio Data — Numerical with a true zero

📌 Definition:

Ratio data has all the features of interval data, plus a true zero. This means you can say “twice as much” or
“half as much”.

🧠 Examples:

• Height, Weight
• Age
• Income
• Distance
• Time duration

✔ What can you do with it?

• All statistical operations (add, subtract, multiply, divide)


• Mean, median, mode
• Histograms, scatter plots

🔁 Quick Comparison Table

Feature Nominal Ordinal Interval Ratio


Is it categorical? Yes Yes No No
Has meaningful order? No Yes Yes Yes
Has equal spacing? No No Yes Yes
Has a true zero point? No No No Yes
Can do math operations? No No Yes (limited) Yes (fully)

🧠 Why is this important?


Knowing the type of data helps us choose the right statistical tools and visualization methods. For
example:

• You can't calculate the average of colors (nominal).


• You shouldn't treat ranks like exact numbers (ordinal).
• You can safely calculate the average of weights or heights (ratio).

✅ In simple terms:

• Nominal = Just names (e.g., Apple, Orange)


• Ordinal = Names with order (e.g., Poor < Average < Good)
• Interval = Numbers, no true zero (e.g., Temperature)
• Ratio = Numbers, true zero exists (e.g., Height, Income)

🔹7 What is Big Data?


Definition:
Big Data refers to extremely large and complex datasets that are too big to be handled by traditional data
processing software (like Excel or normal databases). This data is generated from various sources like
social media, sensors, mobile apps, e-commerce sites, and more.

Big Data is not just about the size of the data but also about how it is generated, stored, processed, and
analyzed to get useful insights.

✅ Examples of Big Data:

• Facebook or Instagram generating billions of posts, likes, and comments every day.
• Online shopping websites like Amazon collecting data on user clicks, purchases, reviews.
• Smart cities using sensors and cameras to monitor traffic, weather, and pollution.
• Banks recording millions of transactions every hour.

🔹 Characteristics of Big Data (The 5 Vs)


These are the main properties that define Big Data:

1. Volume – How much data?

• Refers to the huge size of data.


• Data is generated in terabytes, petabytes, or even zettabytes.
• Example: YouTube users upload hundreds of hours of video every minute.

2. Velocity – How fast is data coming in?


• Refers to the speed at which data is created, collected, and processed.
• Real-time data processing is often needed.
• Example: Credit card fraud detection systems process data instantly.

3. Variety – What types of data?

• Refers to the different forms of data.


• Structured data (databases), semi-structured data (JSON, XML), unstructured data (images, videos,
texts).
• Example: Tweets, emails, video files, and GPS data—all in different formats.

4. Veracity – How accurate is the data?

• Refers to the quality or trustworthiness of the data.


• Big Data can include incomplete, inconsistent, or noisy data.
• Example: User reviews may have spelling errors or fake content.

5. Value – Is the data useful?

• Refers to the usefulness of the data in decision-making.


• The goal of Big Data is to extract meaningful insights, patterns, or predictions.
• Example: Retail companies analyzing customer behavior to increase sales.

🔹 Applications of Big Data (Across Different Sectors)


Big Data is used in almost every field today. Here are some of the most important applications:

✅ 1. Healthcare

• Predicting disease outbreaks (e.g., COVID-19 tracking)


• Analyzing patient records to improve diagnosis and treatment
• Personalized medicine based on genetic data
• Managing hospital resources and reducing wait times

✅ 2. Retail & E-commerce

• Personalized product recommendations (like Amazon, Flipkart)


• Analyzing customer purchase patterns
• Dynamic pricing based on demand and competition
• Inventory management and supply chain optimization

✅ 3. Banking & Finance

• Detecting fraudulent transactions in real-time


• Credit risk analysis before giving loans
• Algorithmic trading (automatic stock buying/selling)
• Customer behavior analysis to offer better services

✅ 4. Education

• Tracking student performance through online platforms


• Predicting dropouts or academic struggles
• Designing personalized learning content
• Improving institutional performance and decision-making

✅ 5. Transport & Logistics

• Route optimization using GPS and traffic data


• Predictive maintenance of vehicles
• Real-time tracking of shipments
• Fuel consumption analysis and cost reduction

✅ 6. Telecommunications

• Improving network performance using usage data


• Detecting service outages in real time
• Analyzing customer complaints and feedback
• Offering personalized mobile plans

✅ 7. Social Media & Entertainment

• Trend analysis (e.g., what topics are going viral)


• Targeted advertising based on user interests
• Recommending videos or content (like YouTube, Netflix)
• Monitoring public opinion on events and brands

✅ 8. Government & Smart Cities

• Traffic management using sensor and camera data


• Predicting crime hotspots using historical data
• Disaster management and early warning systems
• Monitoring pollution and environment in real time

🔚 Conclusion:
• Big Data is not just about size — it is about complex, fast, and diverse data that needs advanced
tools and technologies to manage.
• Its 5 Vs — Volume, Velocity, Variety, Veracity, and Value — help us understand its challenges and
potential.
• Big Data is transforming every industry by helping in smarter decision-making, cost saving, and
better customer experience.

🔹8 Challenges in Data Analytics


✅ 1. Data Quality Issues

• Problem: If the data is incorrect, outdated, incomplete, or duplicated, the analysis will be
misleading.
• Example: Missing customer information or wrong entries can lead to poor business decisions.
• Impact: Reduces trust in the analytics results.

✅ 2. Handling Large Volumes of Data

• Problem: The massive amount of data being generated is difficult to store and manage.
• Example: Social media platforms generate millions of posts per day.
• Impact: Requires powerful storage systems and computing power.

✅ 3. Data Integration from Multiple Sources

• Problem: Data often comes from many sources (like websites, apps, sensors, etc.) in different
formats.
• Example: Combining sales data from online and offline stores.
• Impact: Makes it hard to unify and analyze data accurately.

✅ 4. Data Security and Privacy

• Problem: Protecting sensitive data from leaks, hacks, or misuse is critical.


• Example: User data being stolen from a company’s database.
• Impact: Legal penalties, loss of customer trust, and ethical issues.

✅ 5. Lack of Skilled Professionals

• Problem: Skilled data scientists, analysts, and engineers are in high demand but short supply.
• Example: A company may have lots of data but no experts to analyze it.
• Impact: Data remains unused or poorly analyzed.
✅ 6. Choosing the Right Tools and Technologies

• Problem: So many tools are available (Python, R, Tableau, Hadoop, etc.) that it’s hard to select the
right one.
• Example: Using a complex tool for a simple task wastes time and resources.
• Impact: Increases project costs and slows down progress.

✅ 7. Real-Time Data Processing

• Problem: Some situations need instant data analysis (like fraud detection or traffic monitoring).
• Example: Credit card companies need to detect fraud the moment it happens.
• Impact: Requires high-speed data processing systems.

✅ 8. Interpreting Results Correctly

• Problem: Even accurate data can be misunderstood or misinterpreted.


• Example: Confusing correlation with causation.
• Impact: Leads to wrong conclusions and poor business decisions.

✅ 9. Cost of Infrastructure

• Problem: Setting up and maintaining servers, cloud platforms, and analytical tools is expensive.
• Example: A startup may not afford advanced data systems.
• Impact: Limited ability to handle or scale data analysis.

✅ 10. Changing Data Regulations

• Problem: Governments regularly update data protection laws (like GDPR in Europe).
• Example: Not following regulations can lead to legal trouble.
• Impact: Analytics strategies must constantly adapt to new rules.
Unit 5 (25 Marks)
🔷 1. Simple Linear Regression Model
📘 Meaning:

Simple Linear Regression is a statistical method used to study the relationship between two variables:

• One dependent variable (Y) — the outcome we want to predict.


• One independent variable (X) — the input that helps in making the prediction.

📈 Purpose:

It helps in predicting the value of Y when we know the value of X, assuming a linear (straight-line)
relationship between them.

🧮 Mathematical Equation:
Y=a+bX+eY = a + bX + eY=a+bX+e

• Y: Predicted value (dependent variable)


• X: Independent variable
• a: Intercept (value of Y when X is 0)
• b: Slope (how much Y increases when X increases by 1)
• e: Error term (difference between actual and predicted value)

✅ Example:

Imagine we want to predict a student’s score based on hours studied. We collect data and find the following
regression line:

Score=40+5×(Hours)\text{Score} = 40 + 5 \times (\text{Hours})Score=40+5×(Hours)

This means:

• When no study is done (0 hours), the expected score is 40.


• For every 1 extra hour studied, the score increases by 5 marks.

📝 Use in Real Life:

• Predicting sales based on advertising budget


• Estimating crop yield based on rainfall
• Forecasting temperature based on time of day
🔷 2. Confidence Interval vs Prediction Interval
✅ Confidence Interval (CI):

• A Confidence Interval tells us the range in which the average (mean) value of Y is expected to fall
for a given X value.
• It gives certainty about the average prediction.
• Usually given with a confidence level like 95%.

🔹 Example:
If a model says the average score for students who study 5 hours is between 70 and 75, we are 95%
confident that this range contains the true average.

✅ Prediction Interval (PI):

• A Prediction Interval tells us the range in which an individual’s actual Y value is expected to fall
for a given X.
• It accounts for more uncertainty, so it is wider than a confidence interval.

🔹 Example:
For a student who studied 5 hours, their predicted score might fall between 65 and 80. This is a prediction
interval.

📌 Summary of Difference:
Feature Confidence Interval Prediction Interval
Tells about Mean of Y One new value of Y
Width Narrower Wider
Use General trend Individual forecast

🔷 3. Multiple Linear Regression (MLR)


📘 Meaning:

When we use two or more independent variables (X₁, X₂, X₃…) to predict a single dependent variable
(Y), it is called Multiple Linear Regression.

📈 Equation:
Y=a+b1X1+b2X2+b3X3+⋯+eY = a + b_1X_1 + b_2X_2 + b_3X_3 + \dots + eY=a+b1X1+b2X2+b3X3
+⋯+e

• a = Intercept
• b₁, b₂… = Slopes (impact of each independent variable)
• X₁, X₂… = Independent variables
• e = Error term
✅ Example:

Predicting the price of a house based on:

• Size of the house (X₁)


• Number of bedrooms (X₂)
• Age of the house (X₃)

Model:

Price=50,000+2000×(Size)+10,000×(Bedrooms)−500×(Age)\text{Price} = 50,000 + 2000 \times


(\text{Size}) + 10,000 \times (\text{Bedrooms}) - 500 \times
(\text{Age})Price=50,000+2000×(Size)+10,000×(Bedrooms)−500×(Age)

This means:

• Price increases by ₹2000 for each extra square foot


• ₹10,000 more for each bedroom
• ₹500 less for every year older the house is

🔷 4. Interpretation of Regression Coefficients


Each regression coefficient in the equation tells us how the dependent variable (Y) changes when that
particular independent variable (X) increases by 1 unit, keeping all other variables constant.

✅ Example:

Model:

Sales=10,000+300×(TV Ads)+150×(Online Ads)\text{Sales} = 10,000 + 300 \times (\text{TV Ads}) + 150


\times (\text{Online Ads})Sales=10,000+300×(TV Ads)+150×(Online Ads)

Interpretation:

• For each 1 unit increase in TV advertising, sales increase by ₹300 (if Online Ads stay the same).
• For each 1 unit increase in Online Ads, sales increase by ₹150 (if TV Ads remain same).

👉 This helps businesses understand which factor has more influence.

🔷 5. Heteroscedasticity
📘 Meaning:

Heteroscedasticity occurs when the spread (variance) of the errors is not constant across all levels of the
independent variable(s).

In simple words, the variation in prediction errors changes at different values of X.


✅ Example:

When predicting people’s spending, low-income people may have small variation in spending, but high-
income people may have very large variation. So the errors are unequal.

🔍 Visual Sign:

• If we plot residuals (errors) and they form a cone or funnel shape, it indicates heteroscedasticity.

⚠ Problem:

• Violates one assumption of linear regression


• Leads to incorrect standard errors
• Makes hypothesis tests (like t-tests) unreliable

🔷 6. Multicollinearity
📘 Meaning:

Multicollinearity happens when two or more independent variables in a regression model are highly
correlated with each other.

✅ Example:

In predicting house price:

• House size in sq. feet (X₁)


• Number of rooms (X₂)

These two may be highly correlated, because larger houses usually have more rooms. This leads to
multicollinearity.

🔍 Why is it bad?

• It becomes hard to know the individual effect of each variable


• Makes the model unstable
• Coefficients may become misleading or even change signs
• The model may look good overall (high R²), but individual predictors may appear insignificant

📊 How to Detect?

• Using Variance Inflation Factor (VIF). If VIF > 10, multicollinearity is likely a problem.
🟩 Final Summary:
Topic Explanation

Simple Linear Regression Predicts Y from one X using a straight-line relationship

Confidence Interval Predicts the range of average values of Y for given X

Prediction Interval Predicts the range of a single Y for given X

Multiple Linear Regression Predicts Y using more than one X

Regression Coefficients Show how much Y changes with 1 unit change in X

Heteroscedasticity Unequal error variance — violates regression assumptions

Multicollinearity High correlation among Xs — makes model unstable


Textual data means any kind of information that is written in the form of words or characters (letters,
symbols, punctuation). It is also called unstructured data because it doesn't follow a fixed format like
numbers in an Excel sheet. it can express feelings, thoughts, opinions, stories, or information. Unlike
numbers, it requires interpretation and context to understand it fully.

🔹 Characteristics of Textual Data:

1. Unstructured – Doesn’t fit neatly into rows/columns like numbers do.


2. Rich in meaning – Words carry emotions, opinions, context.
3. Highly variable – People use different styles, languages, and spellings.
4. Context-dependent – Meaning of words can change based on situation or tone.

🔹 Types of Textual Data:


Type Description Example

Formal Text Structured and grammatically correct News articles, academic papers

Informal Text Casual, often with slang or emojis Tweets, WhatsApp chats

Short Text Few words or lines Comments, tweets

Long Text Paragraphs or pages Reports, essays, blog posts

🔹 Examples of Textual Data:

• A WhatsApp chat or email


• A social media post (like a tweet or Instagram caption)
• A book, article, or news report
• Online product reviews on Amazon or Flipkart
• Interview transcripts or survey responses
• Web pages or blogs

✅ 1. What is Textual Data Analysis? (Basics)


Textual Data Analysis means studying and understanding written text using scientific or
computational methods.

Text can be anything like:

• Social media posts


• News articles
• Emails
• Reviews (e.g., product reviews on Amazon)
• Interview transcripts
• Reports or academic documents

In simple terms, textual data analysis helps you make sense of huge amounts of written text by:
• Identifying patterns
• Understanding emotions or sentiments
• Finding common topics
• Summarizing information
• Detecting important words or phrases

🔹 Two broad approaches:

1. Manual Analysis (Qualitative)


Done by humans—e.g., reading and interpreting themes in interviews or essays.
2. Automated Analysis (Quantitative/Computational)
Done by computers using algorithms, Natural Language Processing (NLP), etc.

✅ 2. Significance of Textual Data Analysis (Why It Matters)


🟡 a. Huge Growth of Textual Data

Today, we are generating millions of text-based data every second—on social media, blogs, emails,
forums, etc. Manually reading all of it is impossible.

🟡 b. Helps Extract Meaning from Unstructured Data

Most textual data is unstructured (no clear format), unlike numbers in a spreadsheet. Textual analysis
converts unstructured text into structured insights.

🟡 c. Improves Decision-Making

Organizations can understand:

• What customers are saying


• Public opinion on political issues
• What’s trending or changing in public behavior
This helps in better marketing, policymaking, and planning.

🟡 d. Saves Time and Effort

Automated tools can analyze thousands of documents in seconds, which saves human labor and time.

🟡 e. Supports Research and Academics

In social sciences, textual analysis helps researchers find:

• Social patterns
• Ideologies
• Cultural meanings in interviews, speeches, or news media

✅ 3. Applications of Textual Data Analysis


Textual Data Analysis is used in many fields. Here are some real-life applications with examples:
🔵 a. Social Media Monitoring

Companies analyze tweets, Facebook posts, etc., to understand:

• Customer satisfaction
• Public opinion
• Trends or viral content
Example: Twitter sentiment about a political leader or a product launch.

🔵 b. Marketing and Product Reviews

Companies analyze product reviews to know:

• What customers like/dislike


• Common problems
Example: Amazon analyzing customer feedback on a new phone model.

🔵 c. Healthcare

Doctors or researchers analyze patient feedback, clinical notes, or online health forums to understand:

• Patient experiences
• Common symptoms
• Effectiveness of treatments

🔵 d. Legal and Criminal Investigation

Lawyers or police use text analysis on:

• Witness statements
• Case documents
• Social media posts of suspects
To find useful patterns or contradictions.

🔵 e. Academic Research

Social science researchers analyze:

• Speeches
• Interviews
• Media articles
To study ideologies, cultural shifts, or gender/race narratives.
🔵 f. News and Media

News organizations use it to:

• Track trending stories


• Detect fake news
• Summarize long reports

🔵 g. Customer Service

Chatbot messages and support emails are analyzed to:

• Detect common complaints


• Improve response quality
• Train bots to reply better

🔵 h. Political Analysis

Textual analysis of political speeches, manifestos, or social media helps understand:

• Political ideology
• Public opinion during elections
• Misinformation or propaganda

✅ 4. Challenges in Textual Data Analysis


Even though it’s useful, textual analysis has several problems or limitations:

🔺 a. Language Complexity

Human language is complex, ambiguous, and context-based.

• Same word can have different meanings.


• Sarcasm and humor are difficult for machines to understand.
Example: “Great! Another delay!” – might sound positive, but it’s actually sarcastic.

🔺 b. Multilingual Data

Text is written in different languages or mixed languages (like Hinglish).

• Hard to analyze all languages with one tool.


• Translation may lose original meaning.
🔺 c. Slang, Emojis, and Informal Text

People use emojis, abbreviations, and slang in social media.


Example: “LOL”, “BRB”, “🔥💯” – these are not easy for machines to understand.

🔺 d. Data Cleaning is Difficult

Before analysis, the text must be cleaned:

• Remove unnecessary characters, ads, spam, or formatting.


• This cleaning is time-consuming and error-prone.

🔺 e. Bias and Misinterpretation

If tools are not trained properly, they may give biased results.

• May misclassify emotions.


• May give wrong conclusions if the training data is biased.

🔺 f. Volume and Variety

Too much data (big volume) and too many types of sources (blogs, articles, chats, reports) make it harder to
standardize analysis.

🔺 g. Requires Technical Skills

Automated text analysis needs knowledge of:

• Natural Language Processing (NLP)


• Programming (Python, R)
• Statistics
This makes it hard for non-technical users.

🔺 h. Privacy and Ethics

Analyzing people’s messages, reviews, or emails can raise privacy concerns.

• Ethical issues arise if users are not aware that their data is being analyzed.

✅ Summary
Aspect Explanation
What is it? Understanding and analyzing written text to find meaning, patterns, or insights.
Aspect Explanation
Helps deal with large unstructured data, supports research, decision-making, saves
Why is it important?
time.
Where is it used? Marketing, politics, healthcare, law, media, academic research, customer service.
What are the Language complexity, sarcasm, slang, multilingual issues, data cleaning, privacy,
challenges? need for technical skills.

✅ Introduction to Textual Analysis using R

🧠 Why Use R for Textual Analysis?


R is a powerful statistical programming language that comes with a rich ecosystem of packages for text
mining and analysis.

Why R?

• It’s free and open-source.


• It has specialized libraries like tm, quanteda, tidytext, textdata.
• It integrates well with data visualization tools like ggplot2.
• It supports preprocessing, machine learning, and NLP (Natural Language Processing) tasks.

🧰 Basic Steps in Textual Analysis using R


Let’s go through each step in detail:

🔹 1. Collecting the Text Data

Before analysis, you need to get the data. You can use:

• Text files (like .txt)


• CSV files with a text column
• Web scraping (using rvest)
• APIs (like Twitter or Reddit)
• Manual entry of small data

🔹 2. Preprocessing the Text

Raw text is messy. We need to clean it before analysis.

Cleaning Steps:

• Convert to lowercase
• Remove punctuation
• Remove numbers
• Remove stopwords (e.g., "is", "the", "and")
• Stemming or Lemmatization (reduce words to root form)
• Strip whitespace

🔹 3. Creating a Document-Term Matrix (DTM)

This step converts text into a structured format — a matrix of documents vs. terms.

• Rows = documents
• Columns = terms (words)
• Cells = frequency of that word in that document

🔹 4. Exploratory Data Analysis (EDA) of Text

You can now do interesting things like:

• Most frequent words


• Word clouds
• Bar plots of word counts

🔹 5. Sentiment Analysis

You can determine the emotional tone of text (positive, negative, angry, sad, etc.) using predefined
lexicons.

🔹 6. Visualization of Topics or Sentiments

You can visualize:

• Word clouds per topic


• Sentiment scores across time
• Trends of keywords

🧩 Real-life Applications of Textual Analysis in R


1. Customer reviews analysis on Amazon, Flipkart, etc.
2. Social media sentiment (e.g., Twitter opinion on a movie).
3. Academic paper classification into topics.
4. News clustering into categories.
5. Speech analysis (e.g., political leader’s speeches).

✅Methods and Techniques of textual analysis


1. Text Mining (Also called Text Data Mining)
📌 What is it?

Text mining ka matlab hota hai extract karna useful information from huge sets of unstructured text data.
It is like digging gold from tons of sand — hum meaningless text se meaningful insights nikalte hain.

🧠 Key Steps of Text Mining:


a. Data Collection:

• Text collect karte hain from sources like social media, news, blogs, emails, surveys, etc.
• Example: 10,000 Amazon product reviews.

b. Text Pre-processing:

• Text clean karte hain to remove noise.


• Steps:
o Tokenization: Break sentences into words.
§ “I love coffee” → [“I”, “love”, “coffee”]
o Stopword Removal: Common words jaise “is”, “the”, “and” ko hata dete hain.
o Stemming/Lemmatization: Words ko root form me convert karte hain.
§ “Running”, “Runs” → “Run”

c. Information Extraction:

• Ab clean data se patterns, keywords, phrases, etc. nikaalte hain.


• Tools like TF-IDF (Term Frequency – Inverse Document Frequency) use hote hain.

d. Pattern Detection:

• Algorithms help karte hain to find trends or clusters.


• Example: “Most customers are complaining about battery life in mobile phones.”

2. Categorization (Also known as Text Classification)


📌 What is it?

Categorization ya classification ka matlab hota hai text ko specific predefined categories me divide karna.

🧠 Example:

• A news article can be categorized as:


o Politics
o Sports
o Technology
o Entertainment

Or customer feedbacks can be:


• Complaint
• Praise
• Neutral Suggestion

🧪 How does it work?


a. Supervised Learning:

• Pehle system ko labeled data diya jaata hai.


• Example: 5000 emails jahan likha hai "Spam" or "Not Spam".
• Algorithm learns from examples and predicts category of new texts.

b. Common Algorithms:

• Naive Bayes Classifier


• Support Vector Machines (SVM)
• Decision Trees
• Neural Networks

c. Applications:

• Email spam detection


• Categorizing tweets into “support” or “complaint”
• Sorting resumes into “Marketing”, “IT”, “Finance” profiles

3. Sentiment Analysis
📌 What is it?

Sentiment Analysis me hum ye samajhne ki koshish karte hain ki text me emotion ya attitude kaisa hai —
Positive, Negative, ya Neutral.

🧠 Why is it useful?

• Companies use it to understand customer opinions.


• Politicians use it to check public mood on social media.
• Brands track whether product reviews are happy or angry.

🧪 How does it work?


a. Lexicon-based Approach:

• Ready-made word lists (lexicons) use karte hain.


• Example:
o “Great”, “Happy”, “Amazing” → Positive
o “Bad”, “Terrible”, “Disappointed” → Negative

b. Machine Learning-based Approach:

• Train a model on labeled data.


• Example:
o Text: “The delivery was late and the product was broken.”
o Label: Negative
• Model learns patterns and predicts for new sentences.

c. Fine-Grained Sentiment:

• Kabhi kabhi sirf Positive/Negative kaafi nahi hota.


• We may use 5-star scale:
o 5 Stars → Very Positive
o 1 Star → Very Negative

d. Aspect-based Sentiment:

• Ek hi review me multiple opinions ho sakte hain.


• Example:

“The camera is amazing but the battery life sucks.”

o Camera → Positive
o Battery → Negative

🧰 Tools Used:

• NLTK (Python)
• TextBlob
• Vader Sentiment
• IBM Watson
• Google Cloud Natural Language API

🔄 Summary Table:
Method Purpose Example

Text Mining Extract patterns and insights Finding trending topics from 1 million tweets

Categorization Sort text into categories Classify articles into Tech, Politics, Health, etc.

Sentiment Analysis Detect emotion/mood in text Review says "Poor service" → Negative sentiment
Unit 3 (10 Marks)

1. Introduction to R
R is a programming language designed specifically for statistical computing, data analysis, and
visualization. Unlike general-purpose languages like Java or C++, R is built with the core intention of
handling, analyzing, and visualizing data easily and efficiently.

• History: R was developed in the early 1990s by Ross Ihaka and Robert Gentleman. It is based on the
S programming language but with added features and open-source accessibility.
• Use Cases: It is used in fields such as data science, bioinformatics, financial modeling, academic
research, and any domain that involves large amounts of data.
• Nature: R is not just a language; it's a complete environment that includes built-in functions for data
manipulation, tools for reading/writing data, and support for graphics and statistical modeling.

2. Advantages of R
R stands out from other programming languages when it comes to data and statistical work. Here's why:

• Open Source and Free: Anyone can download and use R. It’s maintained by a strong community
and regularly updated.
• Specialized for Data: It was created specifically for statistics and data analysis, so tasks like data
cleaning, visualization, and analysis are very simple and intuitive.
• Large Number of Packages: Thousands of community-developed packages extend R’s
functionality in fields like machine learning, genetics, economics, and more.
• Powerful Visualization: You can create advanced and customizable plots (bar charts, histograms,
maps, heatmaps) using packages like ggplot2 and lattice.
• Cross-Platform Support: Runs on Windows, macOS, and Linux seamlessly.
• Easy Integration: R can work alongside other tools and languages like SQL, Python, Hadoop, and
even Excel.

3. Installation of R
To use R, two installations are typically required:

a) R Base

• This is the actual R language and environment.


• You download it from CRAN (Comprehensive R Archive Network) which is the official repository
for R.
• It gives you a console to write and execute R commands.

b) RStudio (IDE for R)


• RStudio is a user-friendly interface that makes coding in R easier.
• It provides a script editor, console, workspace viewer, plot viewer, and package manager—all in one
window.
• It enhances productivity by providing features like auto-completion, error highlighting, and file
organization.

4. Packages in R
A package is a collection of R functions, documentation, and data created to solve a specific problem or
extend R’s capabilities.

• Think of packages as “apps” for R. If R is your phone, packages are additional apps you install to do
more things.
• For example, if you want to make beautiful graphs, you install the ggplot2 package. For advanced
data manipulation, use dplyr.

Important Points:

• Once installed, a package stays in your system.


• But you must load it every time you want to use it in a new session.

5. Importing Data from Spreadsheet Files


Data often comes from external sources, especially spreadsheet software like Microsoft Excel or CSV files
from online platforms.

a) CSV Files:

• Most commonly used for storing tabular data.


• These are plain text files with data separated by commas.
• In R, they are simple to read and quick to load.

b) Excel Files:

• Excel files may contain multiple sheets, formulas, and cell formatting.
• Reading Excel in R requires external packages like readxl or openxlsx.

Once the data is loaded into R, it is stored in an object (usually a data frame) which can then be
manipulated or analyzed.

6. Commands and Syntax


R commands are written in a specific structure called syntax, which tells R what to do.

Key Concepts:

• R is case-sensitive: Variable and variable are different.


• Commands are structured using functions.
• Each function can take one or more arguments (pieces of data you give it to work on).
• Comments start with # and are ignored during execution. They are useful for writing explanations in
your code.

Example Behavior:

• When you type a command and press Enter, R evaluates it immediately and shows the result.

7. Packages and Libraries


• Package: A folder that contains functions, data sets, and help files.
• Library: The place (directory) on your system where all the installed packages are stored.

To use a package:

1. Install it (only once).


2. Load it into memory (every session).

R comes with some default packages, but most useful work requires installing external packages from
CRAN, GitHub, or Bioconductor.

8. Data Structures in R
R provides various ways to store and organize data. These structures define how data is arranged and
accessed.

a) Vectors:

• Basic and most common structure.


• A collection of values that are all of the same type (e.g., all numbers or all characters).

b) Matrices:

• A two-dimensional structure like a table.


• All elements must be of the same type.
• Each value is arranged in rows and columns.

c) Arrays:

• Like matrices but with more than two dimensions.


• Used for more complex data, like 3D datasets.

d) Lists:

• A flexible structure where each element can be of a different type.


• You can have a list with numbers, characters, vectors, and even other lists inside.

e) Factors:
• Used for categorical data (like gender: male/female).
• Internally stored as integers but displayed as labels.
• Useful for grouping and statistical analysis.

f) Data Frames:

• One of the most powerful and commonly used data structures.


• Think of them like Excel sheets: each column can have a different type (text, number, etc.).
• Best suited for storing datasets.

9. Conditionals and Control Flow


Control structures in R let you change the flow of your program based on conditions.

Conditional Statements:

• If: Checks if a condition is true.


• If-else: Chooses between two alternatives.
• Else-if ladder: Allows checking multiple conditions in order.

These are used to make decisions and execute different blocks of code accordingly.

10. Loops in R
Loops allow you to repeat a task multiple times. This is especially useful when working with large datasets
or repeating actions.

a) For Loop:

• Used to run a block of code for each element in a sequence.


• Example use: printing each value in a list or calculating the total of multiple items.

b) While Loop:

• Repeats a task as long as a condition is true.


• Useful when you don’t know beforehand how many times to repeat.

c) Repeat Loop:

• Keeps repeating until a certain condition is manually broken using break.


• Risky if no break condition is given—it may lead to an infinite loop.

11. Functions in R
Functions are blocks of code that perform a specific task. You define them once and use them many times,
saving effort and reducing mistakes.
Key Features:

• You can create your own functions.


• Functions can take input values (arguments) and return output.
• Promotes modular programming: break big problems into smaller reusable parts.

12. The Apply Family


The apply family includes a set of functions that help apply the same operation over elements of a data
structure without writing loops. These are faster and more efficient than traditional loops.

Types:

• apply(): Used for rows or columns of matrices.


• lapply(): Works on lists and returns a list.
• sapply(): Like lapply() but tries to return a simple structure (like a vector or matrix).
• tapply(): Used on a vector split into groups based on another vector.
• mapply(): Applies a function to multiple arguments in parallel.

These functions are very helpful in data analysis, especially when you're doing repetitive operations.
Unit 4 (10 Marks)

🟢 1. Importing Data File


🔹 What Does It Mean?

When we talk about importing a data file in R, we mean bringing external data into R so we can analyze
and visualize it. Think of R as a kitchen and your data file (CSV, Excel, etc.) as ingredients you bring in to
cook (analyze).

🔹 Why Do We Import Data?

• Most real-world data is stored in external files, not typed in manually.


• These could be sales records, survey responses, sensor readings, stock prices, etc.
• Importing allows you to manipulate, summarize, and visualize that data in R.

🔹 Types of Data Files You Can Import:

File Type Description


.csv Comma-separated values, plain text format
.xlsx Excel spreadsheet file
.txt Plain text files
.json JavaScript Object Notation files

🔹 What Happens After Import?

• Data is stored in data frames: like Excel sheets with rows and columns.
• Each column = a variable (like "Age", "Income")
• Each row = an observation (like "Person A", "Person B")

🟢 2. Data Visualization Using Charts


Data visualization helps convert numbers into pictures, so we can understand trends, patterns, and
relationships more easily.

🔷 A. Histogram

🧠 What is it?

A histogram shows how often different ranges of values appear in a dataset. It’s used for continuous
numerical data.
🖼 Image:

📝 Example:

You have exam scores of 100 students. A histogram can tell you how many students scored between 50–60,
60–70, etc.

📌 Interpretation:

• A bell-shaped histogram = Normal distribution.


• Right-skewed = more low scores; left-skewed = more high scores.
• Helps identify central tendency, spread, and outliers.

🔷 B. Bar Chart

🧠 What is it?

A bar chart shows frequency of categories using bars. Used for categorical data (like gender, product
type, city name).

🖼 Image:

📝 Example:

A survey asks people about their favorite fruit. Categories = Apple, Mango, Banana. The bar height shows
how many people chose each.

📌 Interpretation:

• Tall bar = high frequency


• Easy to compare categories
• Bars don’t touch each other (unlike histograms)

🔷 C. Box Plot (Box-and-Whisker Plot)

🧠 What is it?

A box plot shows summary statistics: min, max, median, and quartiles, and detects outliers.

🖼 Image:
📝 Example:

You want to compare incomes of two cities. Box plots help you see:

• Which city has a higher median income?


• Which city has more variability?
• Any extreme values?

📌 Interpretation:

• Line inside box = Median


• Box length = Interquartile range (middle 50%)
• Dots outside whiskers = Outliers

🔷 D. Line Graph

🧠 What is it?

A line graph shows how something changes over time. Used for time-series data.

🖼 Image:

📝 Example:

Track daily temperature over a month. Line graph shows if it’s rising, falling, or fluctuating.

📌 Interpretation:

• Upward slope = increase over time


• Downward slope = decrease over time
• Useful in finance, weather, health data, etc.

🔷 E. Scatter Plot

🧠 What is it?

A scatter plot shows relationship between two numerical variables. Each dot = one observation.

🖼 Image:

📝 Example:

You want to see if study time affects exam scores. Put study hours on X-axis and scores on Y-axis.
📌 Interpretation:

• Dots in rising pattern = Positive correlation


• Dots in falling pattern = Negative correlation
• No pattern = No relationship

🟢 3. Measures of Central Tendency


These are summary values that describe the center of the dataset.

🔷 Mean (Average)

🧠 What is it?

Add all values and divide by how many there are.

📌 When to use:

• When data is symmetric and no outliers.


• Sensitive to outliers. One very large value can distort the mean.

🔷 Median

🧠 What is it?

The middle value when data is sorted.

📌 When to use:

• For skewed data (e.g., income, property prices)


• Not affected by outliers

🔷 Mode

🧠 What is it?

The most frequent value.

📌 When to use:

• Best for categorical or discrete data.


• A dataset can have no mode, one mode, or multiple modes.
📝 Example Comparison:

Data: 10, 20, 30, 30, 90

• Mean = 36
• Median = 30
• Mode = 30

If the 90 was 900 instead:

• Mean becomes 198 → heavily affected


• Median and Mode stay at 30 → stable

🟢 4. Measures of Dispersion
These show how spread out or clustered the data values are.

🔷 Range

🧠 What is it?

Difference between largest and smallest value.

📌 Use:

• Quick overview of spread


• Doesn’t tell how values are spread in between

🔷 Variance

🧠 What is it?

Average of the squared differences from the mean. Tells how far values are from the mean on average.

🔷 Standard Deviation

🧠 What is it?

Square root of variance. Interpreted in the same units as the data.

📌 Use:

• A small SD → data points are close to the mean


• A large SD → data is more spread out
🔷 Interquartile Range (IQR)

🧠 What is it?

Difference between Q3 and Q1 — middle 50% of data.

📌 Use:

• Removes the influence of outliers


• Focuses on core values

🟢 5. Relationship Between Variables


Let’s now explore how two variables move together.

🔷 Covariance

🧠 What is it?

Shows if two variables move in the same direction (positive) or opposite (negative).

📝 Example:

• If income and spending increase together → positive covariance.


• If income increases but spending decreases → negative covariance.

📌 Limitation:

• Doesn’t show strength or scale. That’s where correlation comes in.

🔷 Correlation

🧠 What is it?

Measures direction and strength of relationship between two variables.

📏 Range: -1 to +1

• +1: Perfect positive relationship


• 0: No relationship
• -1: Perfect negative relationship

📝 Example:
• Study time vs grades = +0.85 → strong positive correlation
• TV time vs grades = -0.60 → moderate negative correlation

🔷 Coefficient of Determination (R²)

🧠 What is it?

Tells how much of the variation in one variable is explained by the other.

📏 Range: 0 to 1

• R² = 0.90 → 90% of variation explained


• R² = 0.20 → only 20% explained

📌 Use:

• In regression models, this shows how well the model fits the data.

✅ Final Summary Table


Topic Description Use Case
Mean Average value Balanced data
Median Middle value Skewed data
Mode Most frequent value Categorical data
Standard Deviation Average distance from mean Spread of values
Histogram Distribution of numeric data Shape of data
Bar Chart Frequency of categories Compare categories
Box Plot 5-number summary + outliers Variability & outliers
Line Graph Changes over time Time series
Scatter Plot Relationship between two variables Correlation
Covariance Direction of relationship Basic relationship check
Correlation Strength & direction of relationship Predictive insights
Coefficient of Determination (R²) How well one variable predicts another Model performance

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy