Unit-I Da
Unit-I Da
Dr T V RAJINI KNATH
Professor
B.TECH. VI SEMESTER CM602PC: DATA ANALYTICS
UNIT-I
Data Management:
Design Data Architecture and manage the data for analysis, understand various
sources of Data like Sensors/Signals/GPS etc. Data Management, Data
Quality(noise, outliers, missing values, duplicate data) and Data Processing &
Processing.
TEXT BOOKS: 1. Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, Fundamentals of
Data science.
2. Practical statistics for Data Scientiest, Peter & Andrew Bruce, O‘Reilly publications.
REFERENCE BOOKS:
1. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, 3rd ed.
2. The Morgan Kaufmann Series in Data Management Systems
• Data management refers to the processes and
techniques used to handle data effectively
throughout its lifecycle.
• A well-structured data architecture ensures the
data is accessible, secure, and supports analytical
needs.
Data
• Data sources can vary widely, including databases,
manageme APIs, sensors, and external datasets.
nt • Effective data management is crucial for data-
driven decision-making in organizations.
• Data quality issues, like missing or duplicate data,
can lead to erroneous conclusions if not
addressed.
• Processing data involves cleaning, transforming,
and preparing it for analysis.
• Data management refers to the processes and
techniques used to handle data effectively
throughout its lifecycle.
• A well-structured data architecture ensures the
data is accessible, secure, and supports analytical
needs.
Data
• Data sources can vary widely, including databases,
manageme APIs, sensors, and external datasets.
nt • Effective data management is crucial for data-
driven decision-making in organizations.
• Data quality issues, like missing or duplicate data,
can lead to erroneous conclusions if not
addressed.
• Processing data involves cleaning, transforming,
and preparing it for analysis.
• A data architecture outlines the frameworks for
collecting, storing, and using data efficiently.
• It includes decisions on data storage technologies,
formats, and integration methods.
• Scalability is an important factor, ensuring the
Data architecture grows with increasing data volumes.
architectu • Data security must be embedded in the
re architecture to protect sensitive information.
• Good architecture ensures data consistency
across multiple sources and platforms.
• A well-designed data architecture supports real-
time processing for dynamic data applications.
• Sensor data is a common source in IoT
applications, providing real-time information.
• Sensors capture data like temperature, motion,
pressure, and environmental variables.
Sensor • Signals, such as audio or electrical signals, provide
data another source of raw data for analysis.
• GPS data provides spatial and location-based
insights, widely used in navigation and logistics.
• Data from these sources often requires
preprocessing to address variability and noise.
• The integration of multiple sources leads to
comprehensive datasets for complex analyses.
• Data quality is crucial for reliable insights and
includes measures like accuracy and completeness.
• Noise refers to irrelevant or random data that can
distort results during analysis.
• Outliers are data points significantly different from
Data the rest, indicating potential errors or unique cases.
Quality • Missing values can arise from incomplete data
collection or transmission errors.
• Duplicate data occurs when the same record exists
multiple times in a dataset.
• Addressing these issues requires techniques like
imputation, deduplication, and normalization.
• Data preprocessing transforms raw data into a
usable format for analysis.
• Cleaning involves removing or correcting errors,
Data like missing or inconsistent entries.
• Transformation ensures data is in the correct
preproces format, scale, or structure for analysis.
sing • Techniques like feature extraction help reduce
dimensionality and focus on relevant variables.
• Integration combines data from multiple sources
into a cohesive dataset.
• Preprocessing improves the efficiency and
accuracy of machine learning models.
• Data storage technologies vary from relational
databases to NoSQL systems.
Data • Relational databases use structured schemas,
storage while NoSQL is suitable for unstructured data.
• Cloud storage offers scalable, on-demand
technologi solutions for growing data needs.
es • Data stored in distributed systems ensures
reliability through redundancy.
• Choosing the right technology depends on factors
like data type, volume, and access frequency.
• Data storage decisions directly impact retrieval
speeds and processing efficiency.
• Data integration ensures seamless access to data
from diverse sources.
• Tools like ETL (Extract, Transform, Load) pipelines
Data help manage this process.
• Integration challenges include handling
integratio heterogeneous formats and ensuring data
n consistency.
• APIs are commonly used for real-time integration
of external data.
• Proper integration is vital for cross-functional data
analysis within organizations.
• Automated integration tools improve scalability
and reduce errors.
• Data integration ensures seamless access to data
from diverse sources.
• Tools like ETL (Extract, Transform, Load) pipelines
Data help manage this process.
• Integration challenges include handling
integratio heterogeneous formats and ensuring data
n consistency.
• APIs are commonly used for real-time integration
of external data.
• Proper integration is vital for cross-functional data
analysis within organizations.
• Automated integration tools improve scalability
and reduce errors.
• Outliers are extreme values that deviate
significantly from other data points.
• They can occur due to data entry errors or
genuine variability.
Handling • Outliers may distort statistical measures like mean
Outliers in and variance.
Data • Methods to detect outliers include boxplots and z-
scores.
• They can be removed, capped, or treated based
on context.
• Careful analysis is required to decide the best
handling approach.
• Outliers are extreme values that deviate
significantly from other data points.
• They can occur due to data entry errors or
genuine variability.
Handling • Outliers may distort statistical measures like mean
Outliers in and variance.
Data • Methods to detect outliers include boxplots and z-
scores.
• They can be removed, capped, or treated based
on context.
• Careful analysis is required to decide the best
handling approach.
• Global Outliers: Points far outside the overall data
distribution.
• Contextual Outliers: Unusual within a specific
context or condition.
• Collective Outliers: A group of data points
Types of behaving unusually.
Outliers • Identification relies on domain knowledge and
statistical methods.
• Tools like IQR, DBSCAN, and isolation forests are
effective.
• Correct handling ensures integrity in data-driven
results.
• Missing data occurs due to sensor malfunctions,
human errors, or system failures.
Dealing • Types include missing completely, at random, or
with not at random.
• Ignoring missing values can lead to biased results.
Missing
• Imputation methods fill gaps using statistical or
Values machine learning techniques.
• Common methods include mean, median, and k-
nearest neighbors imputation.
• Selecting an imputation method depends on the
dataset's nature.
• Missing values reduce dataset size and predictive
power.
• They complicate data preprocessing and model
Impact of building.
• Ignoring missing values can introduce hidden
Missing biases.
Values • Careful assessment of missing patterns is critical.
• Advanced techniques like multiple imputation
offer robust solutions.
• Missing data handling improves dataset reliability
and consistency.
• Duplicate data arises from repeated entries or
integration errors.
Identifyin • They inflate dataset size without adding useful
g information.
• Tools like SQL queries help detect duplicates.
Duplicate
• Machine learning models may overfit to
Data redundant entries.
• Eliminating duplicates reduces noise in data
analysis.
• Maintaining unique identifiers ensures data
integrity.
• Deduplication involves identifying and removing
duplicate entries.
• Methods include rule-based checks and similarity
matching.
Resolving • Tools like pandas, Excel, and Open Refine aid in
Duplicate the process.
Data • Automation can streamline deduplication in large
datasets.
• A robust system prevents duplicate creation
during data entry.
• Clean data enhances the efficiency of downstream
processes.
• Data processing converts raw data into usable
information.
Overview • It involves cleaning, transforming, and analyzing
of Data datasets.
• Steps include ingestion, validation, and
Processin enrichment.
g • Automated tools reduce manual effort in
processing.
• Effective processing ensures actionable insights.
• Modern approaches integrate AI for advanced
data handling.
• Data Collection: Aggregating data from multiple
sources.
Key Steps • Data Cleaning: Removing errors, duplicates, and
inconsistencies.
in Data • Transformation: Standardizing formats and
Processin creating new features.
g • Integration: Merging multiple datasets into a
unified structure.
• Analysis: Applying statistical or machine learning
models.
• Each step builds on the previous to refine data
quality.
• Popular programming languages: Python, R, SQL.
• Libraries like pandas, NumPy, and dplyr enable
Tools for efficient processing.
Data • Cloud platforms (e.g., AWS, Google Cloud) support
Processin scalability.
• Workflow automation tools (e.g., Apache Airflow)
g simplify ETL processes.
• Visualization libraries like matplotlib and ggplot
help present data.
• Tool selection depends on project requirements
and data size.
• Preprocessing ensures data is clean and ready for
Importanc analysis.
e of • It includes scaling, normalization, and encoding.
• Standardizing data improves model performance.
Preproces
• Feature selection reduces dimensionality and
sing computation.
• It mitigates issues like overfitting and underfitting.
• Preprocessing is essential for accurate machine
learning outcomes.
• Normalization rescales data to a 0-1 range.
• Scaling standardizes data to a mean of 0 and
standard deviation of 1.
Normaliza • Both methods improve algorithm performance.
tion and • Algorithms like kNN and SVM benefit significantly
Scaling from scaling.
• Libraries like scikit-learn offer easy-to-use
functions for scaling.
• Choosing the right method depends on the data
and task.
• Feature engineering involves creating meaningful
input variables.
• It enhances the predictive power of models.
Feature • Methods include combining, splitting, and
Engineeri transforming features.
• Domain knowledge often drives effective feature
ng Basics creation.
• Examples include one-hot encoding and
polynomial features.
• A well-designed feature set simplifies model
complexity.
• Data cleaning removes inconsistencies and errors.
• Techniques include handling missing values and
Data correcting typos.
Cleaning • Detecting and removing outliers is part of
Technique cleaning.
• Duplication is resolved to maintain dataset
s accuracy.
• Automated cleaning tools reduce manual
intervention.
• Clean data forms a foundation for reliable
analytics.
• Data transformation converts data into a
structured format.
Transform • Common techniques include normalization,
ation encoding, and aggregation.
• Transformation ensures compatibility with
Technique machine learning models.
s • Logarithmic transformations handle skewed data.
• Aggregation reduces data granularity for
summarization.
• Transformation enhances the usability of raw
datasets.
Case • Sensors generate continuous streams of raw data.
Study: • Preprocessing includes noise reduction and trend
Sensor extraction.
• Data is aggregated for meaningful insights.
Data
• Feature engineering highlights key metrics like
Processing average or peak values.
• Machine learning models predict failures based on
trends.
• Clean, processed data improves IoT applications.
Best
Practices • Define clear data collection and storage policies.
in Data • Regularly validate and clean datasets.
• Use secure methods to protect sensitive data.
Manageme
• Automate repetitive tasks to improve efficiency.
nt
• Employ data versioning for tracking changes.
• Encourage collaboration across teams for
consistent practices.
Common • Overcomplicated architecture increases
Pitfalls in maintenance challenges.
Data • Ignoring scalability can lead to system
inefficiencies.
Architectu • Poor integration leads to data silos and
re redundancy.
• Neglecting data quality affects analysis outcomes.
• Security lapses expose sensitive information.
• Regular reviews mitigate these issues over time.
• Big Data: Handling high volume, variety, and
Advanced velocity of data.
Topics in • Cloud Solutions: Scalable storage and processing
infrastructure.
Data • Data Lakes: Storing unstructured and structured
Manageme data.
nt • Real-time Analytics: Insights from live data
streams.
• Edge Computing: Processing data closer to its
source.
• Emerging technologies continuously redefine best
practices.
Ethical • Data privacy is a critical concern in management.
Considerat • Regulations like GDPR and CCPA govern data use.
• Ethical AI frameworks ensure unbiased model
ions in predictions.
Data • Consent and transparency are essential in data
Handling collection.
• Protecting sensitive data builds trust with
stakeholders.
• Ethics in data handling enhances organizational
reputation.
• Data management involves quality, architecture,
and processing.
• Data sources include sensors, signals, GPS, and
systems.
Summary • Preprocessing improves accuracy and model
of Key performance.
Concepts • Tools and techniques simplify data cleaning and
transformation.
• Addressing ethical concerns ensures responsible
data use.
• Robust practices lead to reliable, actionable
insights.
• Big Data refers to massive datasets with high
volume, velocity, and variety.
Big Data • It is commonly used in industries like healthcare,
retail, and finance.
in Modern • Tools like Hadoop and Spark enable distributed
Applicatio processing.
ns • Cloud platforms provide scalable solutions for Big
Data storage and analysis.
• Real-time Big Data applications include fraud
detection and personalization.
• Handling Big Data requires advanced
infrastructure and skilled personnel.
• Cloud platforms offer scalable and cost-effective
Cloud data storage solutions.
Solutions • Popular platforms include AWS, Microsoft Azure,
and Google Cloud.
for Data • They support tools for data ingestion, cleaning,
Manageme and analysis.
nt • Security measures like encryption ensure data
protection in the cloud.
• Pay-as-you-go models make them accessible for
businesses of all sizes.
• Cloud solutions enable global accessibility and
collaborative workflows.
• Data Lakes store raw, unstructured, and
structured data.
Data • Data Warehouses store processed, structured
data for analysis.
Lakes and • Data Lakes are used for advanced analytics and
Warehous machine learning.
es • Warehouses support business intelligence and
reporting.
• Tools like Snowflake and Databricks facilitate these
storage solutions.
• Choosing between them depends on specific
business needs.
• Real-time analytics processes data as it is
generated.
• Use cases include stock trading, IoT monitoring,
and customer interactions.
• Tools like Apache Kafka and Flink enable stream
Real-Time processing.
Analytics • It provides immediate insights for critical decision-
making.
• Challenges include ensuring low latency and high
availability.
• Real-time analytics is transforming industries like
finance and healthcare.
• Edge computing processes data close to its source.
Edge • It reduces latency by minimizing data transmission
Computing to centralized servers.
in Data • Common use cases include IoT devices and
Manageme autonomous vehicles.
• It enhances real-time decision-making in critical
nt applications.
• Security and power management are key
challenges.
• Edge computing complements cloud solutions for
a hybrid approach.
• Data governance defines policies for data usage
and management.
Importanc • It ensures data accuracy, consistency, and security.
e of Data • Governance frameworks comply with regulatory
Governanc standards like GDPR.
• Key aspects include metadata management and
e access control.
• It fosters accountability and trust in data-driven
environments.
• Organizations should regularly audit governance
practices.
Challenges • Inconsistent formats and incomplete data reduce
usability.
in Data • Noise and outliers distort analytical outcomes.
Quality • Duplicate records lead to redundant analysis
Manageme efforts.
nt • Missing values create gaps in critical datasets.
• Automation tools can help maintain consistent
quality standards.
• Regular validation ensures datasets meet quality
benchmarks.
• Biased datasets lead to unfair outcomes in AI
models.
Ethical • Data misuse violates user privacy and regulatory
Challenges laws.
• Informed consent is critical for ethical data
in Data collection.
Science • Transparency builds trust in data-driven systems.
• Ethical frameworks promote responsible use of
data.
• Addressing these challenges ensures equitable
applications of data science.
• AI-driven data management automates complex
Future processes.
Directions • Block chain enhances data security and
transparency.
in Data • Federated learning enables collaborative AI
Manageme without data sharing.
nt • Real-time analytics will become the norm in
dynamic industries.
• Advanced data integration tools simplify multi-
source analysis.
• Innovations in this field will continue to drive data
democratization.
• Data management is essential for meaningful
insights and informed decisions.
• Understanding data sources, architecture, and
quality ensures accuracy.
• Proper preprocessing enhances the effectiveness
of analytical models.
Conclusion • Tools and techniques streamline data handling
across industries.
• Ethical and governance considerations are key to
responsible data usage.
• A robust data management strategy unlocks the
full potential of data science.