AFDM UNIT 2 Notes
AFDM UNIT 2 Notes
Data refers to raw facts, figures, or information that is collected, processed, and analyzed to
extract meaningful insights. It can be in various forms such as numbers, text, images, audio,
or even video, and it can be used for various purposes such as decision-making, analysis, or
reporting.
Characteristics of Data:
1. Raw and Unprocessed: Data in its raw form may not have any meaning until it is
processed or analyzed.
2. Variety: Data can be in different types, including numbers (quantitative data), text
(qualitative data), or other formats like audio, video, or images.
3. Can Be Structured or Unstructured:
• Structured Data: Organized in a tabular format (e.g., databases,
spreadsheets).
• Unstructured Data: Data that lacks a predefined structure (e.g., emails, social
media posts, images, and videos).
4. Collected from Various Sources: Data can come from many places, including
sensors, surveys, social media, websites, or business transactions.
Types of Data:
1. Structured Data
Definition: Structured data is highly organized and follows a specific format. It is typically
stored in a tabular format with rows and columns (like in databases or spreadsheets), making
it easy to search, query, and analyze. This type of data is easy to process and manage using
traditional data tools (e.g., SQL databases).
Characteristics:
• Highly Organized: Data is stored in predefined formats like tables.
• Easily Searchable: It’s straightforward to search and analyze with tools like SQL.
• Data Types: Numbers, dates, strings (text).
Examples:
• Relational Databases: A table in a database that holds customer information, like:
o Columns: Customer ID, Name, Address, Phone Number.
o Rows: Different customer records.
• Spreadsheets: An Excel sheet with sales data over multiple years, with columns for
dates, product names, quantities, and prices.
2. Unstructured Data
Definition: Unstructured data is data that does not have a predefined structure or format. It is
often difficult to store and analyse with traditional data tools because it lacks the organization
of structured data. However, advancements in technology (like machine learning) are making
it easier to process unstructured data.
Characteristics:
• No predefined format: Does not follow a table-like format.
• Harder to process: Requires more advanced techniques (e.g., NLP, image
recognition) for analysis.
• Data Types: Text, audio, video, images, social media posts.
Examples:
• Text Files: Emails, documents, or reports where data is not organized in tables.
• Multimedia: Images, videos, and audio files that lack an internal structure.
• Social Media Posts: Tweets, Facebook posts, or Instagram photos, which are not
organized in a table or database.
3. Semi-Structured Data
Definition: Semi-structured data lies somewhere between structured and unstructured data.
While it does not have a rigid structure like structured data, it still contains tags or markers to
separate elements and organize the data in a way that is somewhat identifiable and
analysable.
Characteristics:
• Partially organized: It does not follow a strict format like structured data but still
contains some organizational elements (such as tags or metadata).
• Flexible: It is more flexible than structured data but more manageable than
unstructured data.
• Data Types: JSON, XML, CSV (with inconsistent rows), log files.
4. Big Data
Big Data refers to extremely large and complex datasets that traditional data processing tools
and systems cannot handle effectively. It encompasses a vast amount of data, which is often
difficult to process and analyze using conventional methods due to the sheer volume,
velocity, and variety of the data.
Characteristics (commonly referred to as the 5 Vs of Big Data):
• Volume: The amount of data is massive (terabytes to petabytes).
• Velocity: The speed at which data is generated and processed is fast.
• Variety: Big data can come in many forms – structured, semi-structured, and
unstructured.
• Veracity: The uncertainty or quality of the data (whether the data is reliable or not).
• Value: The usefulness or insights derived from big data.
Examples:
• Social Media Data: Billions of social media posts, comments, images, and videos
generated every day.
• Sensor Data: Data from IoT devices, such as traffic sensors, weather stations, or
smart home devices.
• Financial Data: Transaction logs from banking systems, stock markets, and financial
exchanges.
• Healthcare Data: Patient records, genomic data, and medical images from hospitals
and health systems.
• Clickstream Data: Data captured from users interacting with websites or mobile
apps.
Technologies Used for Big Data:
• Hadoop: A framework that allows for distributed processing of large datasets across
clusters of computers.
• Apache Spark: A fast data processing engine for handling big data.
• NoSQL Databases: Used for managing unstructured data at a large scale (e.g.,
MongoDB, Cassandra).
b) Secondary Data Collection (Data already collected by others, used for analysis)
Gathers information from existing sources.
Sources:
• Government Reports (e.g., census data, economic reports)
• Research Papers & Articles (e.g., academic studies)
• Company Records (e.g., sales reports, customer databases)
• Social media & Web Scraping (e.g., analysing Twitter trends)
Example: A marketing agency uses past industry reports to analyse consumer trends.
Organization of Data
Data organization refers to the systematic arrangement of data to ensure it is easily
accessible, manageable, and usable. Proper organization enhances efficiency, supports
decision-making, and allows for effective data retrieval and analysis.
1. Better Decision-Making
High-quality data allows businesses to make informed, data-driven decisions based on
accurate insights. Poor-quality data can lead to misleading trends and costly mistakes.
2. Enhanced Customer Insights
Clean and well-organized data enables companies to analyse customer behaviour,
personalize marketing, and improve customer satisfaction.
3. Improved Operational Efficiency
Reliable data helps streamline business processes, reduce inefficiencies, and optimize
resources, leading to cost savings.
4. Compliance & Risk Management
Many industries (finance, healthcare) have strict data regulations (GDPR, HIPAA).
Good data quality ensures compliance, reducing legal and financial risks.
5. Competitive Advantage
Companies that maintain high data quality gain an edge over competitors by
leveraging accurate insights for market trends and innovation.
6. Effective AI & Machine Learning Models
High-quality data is essential for training AI and ML models. Poor data quality leads
to biased or inaccurate predictions.
Missing Data
Missing data refers to the absence of values in a dataset where information is
expected. This issue can arise due to various reasons, such as human error, system
failures, or incomplete surveys.
Missing data can lead to biased analysis, incorrect insights, and flawed business
decisions if not handled properly.
Example: An AI-based loan approval system may reject eligible applicants if their income
or credit history data is missing.
Example: A retailer forecasting demand may underestimate future sales if online purchase
data is incomplete.
Example: A supply chain management system may order too much or too little inventory
if sales data is missing or incomplete.
Example: A healthcare provider may face legal action if patient medical history records
are incomplete, leading to incorrect treatments.
Example: A telecom company using complete customer data can create more effective
retention campaigns, while a competitor with missing data struggles to identify at-risk
customers.
Data Visualization:
Data visualization is the process of representing data in graphical or visual formats, such
as charts, graphs, maps, and dashboards. It helps businesses and individuals quickly
understand patterns, trends, and insights from large datasets, making data more accessible
and actionable.
Types of Data Visualization
1. Charts and Graphs
Used for summarizing data and identifying trends.
Bar Chart – Compares categories (e.g., sales by product).
Line Chart – Shows trends over time (e.g., monthly revenue).
Pie Chart – Displays proportions (e.g., market share).
Histogram – Represents frequency distributions (e.g., age group distribution).
4. Infographics
Used for storytelling through visuals and text.
Static Infographics – Combines images, charts, and icons (e.g., business reports).
Interactive Infographics – Allows users to explore data dynamically (e.g., digital
dashboards).
2. Enhances Decision-Making
Clear visual insights lead to faster and more informed decisions.
Example: A stock market heatmap helps investors quickly spot rising and falling stocks.
✔ Convert the Model into an API: Use tools like Flask, FastAPI, or TensorFlow Serving.
✔ Deploy on Cloud or Edge Devices: Host models on AWS, Google Cloud, Azure, or on-
premise servers.
✔ Automate Data Pipelines: Ensure real-time data feeds into the model for continuous
predictions.
✔ Monitor Performance Post-Deployment: Track accuracy and response time in
production.