Rudra Bhatt Data
Rudra Bhatt Data
Ddddd
Types of data: structured, unstructured and semi-structured
• Structured Data
This type of data consists of various addressable elements to encourage effective
analysis. The structured form of data gets organized into a repository (formatted)
that acts as a typical database. Structured data works with all kinds of data that one
can store in the SQL database in a table that consists of columns and rows. These
consist of relational keys, and one can easily map them into pre-designed fields.
People mostly use and process structured data for managing data in the simplest
form during the development process. Relational data is one of the most
commendable examples of Structured Data.
• Semi-Structured Data
It is the type of information and data that does not get stored in a relational
analysis. In other words, it is not as organized as the structured data but still
has a better organization than the unstructured data. One can use some
processes for storing this type of data and info in the relational database, and
this process can be pretty difficult for some semi-structured data. But overall,
they ease the space available for the contained information. XML data is an
• Unstructured Data
It is a type of data structure that does not exist in a predefined organized
manner. In other words, it does not consist of any predefined data model.
As a result, the unstructured data is not at all fit for the relational database
used mainstream. Thus, we have alternate platforms to store and manage
unstructured data. It is pretty common in IT systems. Various organizations
use unstructured data for various business intelligence apps and analytics.
A few examples of the unstructured data structure are Text, PDF, Media
logs, Word, etc.
DATA COLLECTION METHODS AND SOURCES
• Data Collection Methods:
Data storage
Historically, data storage began with physical mediums such as paper documents,
punched cards, and magnetic tapes.
The advent of electronic storage introduced hard disk drives (HDDs) and solid-state
drives (SSDs), which remain fundamental components of contemporary data storage
• Cloud Storage:
Cloud storage has emerged as a transformative paradigm, offering scalable, flexible,
and cost-effective solutions.
Services provided by major cloud providers, such as Amazon S3, Google Cloud
Storage, and Microsoft Azure Blob Storage, have become integral to modern data
storage architectures.
• Object Storage:
Object storage, in contrast to traditional file and block storage, treats data as
objects with unique identifiers.
These systems are well-suited for handling large datasets and achieving high availability
DATA PREPROCESSING AND CLEANING
I. Data Cleaning:
• Data Standardization:
• Normalization:
Normalization adjusts the scale of numerical features to a standard range (e.g., between 0 and
1). This is crucial for algorithms sensitive to the magnitude of input variables, such as neural
networks.
Machine learning algorithms often require numerical input, necessitating the transformation of
categorical variables. Techniques like one-hot encoding convert categorical data into a format
suitable for analysis.
• Feature Engineering:
Feature engineering involves creating new features or modifying existing ones to enhance the
performance of machine learning models. This may include aggregating, transforming, or
combining features to capture relevant information.
• Data Reduction:
Clean and preprocessed data contribute to the robustness and accuracy of machine learning
models. A well-prepared dataset ensures that models can learn patterns effectively.
STATISTICAL METHODS FOR DATA ANALYSIS
I. Descriptive Statistics:
• Measures of Dispersion:
• Hypothesis Testing:
Hypothesis testing is a fundamental concept in inferential statistics. It
involves formulating a hypothesis, collecting data, and using statistical
tests to determine if the observed results are statistically significant.
BAYESIAN STATISTICS:
• Bayesian Inference:
Bayesian methods are particularly useful when dealing with limited data or incorporating prior
knowledge.
• Bayesian Networks:
Bayesian networks model probabilistic relationships among a set of variables using directed
acyclic graphs. They are employed in various fields, including healthcare, finance, and artificial
intelligence.
DATA VISUALIZATION:
Visualizing data aids in understanding patterns and trends. Box plots display the distribution of a
dataset, scatter plots show relationships between two variables, and heatmaps represent data
in a matrix format, revealing correlations.
• Powerful Tools:
Statistical software tools, such as R, Python with libraries like NumPy and Pandas, and
commercial tools like SPSS and SAS, empower analysts to apply sophisticated statistical
methods efficiently.
ENT SENIOR SECOND
CONV ARY
SCH
SAL
ER OO
IV
L
UN
AI project file
Topic- DATA
Name- Rudra bhatt
Class- XII
Section- ‘B’
Roll no- 24
Date of submission- 6/12/2023
Submitted to- Mr sunil singh chuphal
Submitted by- Rudra bhatt