0% found this document useful (0 votes)
93 views32 pages

Data Science Full

This document discusses several key concepts in data science including classification, regression, algorithms, data structures, and data visualization. 1) Classification categorizes data into different classes based on attributes. Regression analyzes relationships between variables to identify probabilities and forecast values of dependent variables based on independent variables. 2) Algorithms are problem-solving processes that use well-defined rules and instructions. Data structures organize data in memory in different ways like arrays, linked lists, stacks, and queues. 3) Data visualization translates large datasets into visual formats like charts and graphs to make complex data easier for humans to understand and identify patterns. It is an important tool across many fields.

Uploaded by

BCS Wala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views32 pages

Data Science Full

This document discusses several key concepts in data science including classification, regression, algorithms, data structures, and data visualization. 1) Classification categorizes data into different classes based on attributes. Regression analyzes relationships between variables to identify probabilities and forecast values of dependent variables based on independent variables. 2) Algorithms are problem-solving processes that use well-defined rules and instructions. Data structures organize data in memory in different ways like arrays, linked lists, stacks, and queues. 3) Data visualization translates large datasets into visual formats like charts and graphs to make complex data easier for humans to understand and identify patterns. It is an important tool across many fields.

Uploaded by

BCS Wala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

DATA SCIENCE

1) Explain classification in Data Science .


This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes. Data mining techniques can be classified by
different criteria, as follows:
i. Classification of Data mining frameworks as per the type of data sources mined: This classification is
as per the type of data handled. For example, multimedia, spatial data, text data, time-series data,
World Wide Web, and so on..
ii. ii. Classification of data mining frameworks as per the database involved: This classification based
on the data model involved. For example. Object oriented database, transactional database,
relational database, and so on.
iii. Classification of data mining frameworks as per the kind of knowledge discovered: This classification
depends on the types of knowledge discovered or data mining functionalities. For example,
discrimination, classification, clustering, characterization, etc. some frameworks tend to be extensive
frameworks offering a few data mining functionalities together.
iv. Classification of data mining frameworks according to data mining techniques used: This
classification is as per the data analysis approach utilized, such as neural networks, machine learning,
genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented, etc. The
classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems

2) Explain regression in Data Science .

Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modelling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition. Primarily
it gives the exact relationship between two or more variables in the given data set.

Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest. The process that is adapted to perform regression analysis helps
to understand which factors are important, which factors can be ignored, and how they are influencing each
other.

In regression, we normally have one dependent variable and one or more independent variables. Here we
try to “regress” the value of the dependent variable “Y” with the help of the independent variables. In other
words, we are trying to understand, how the value of ‘Y’ changes w.r.t change in ‘X’.

For the regression analysis is be a successful method, we understand the following terms:

• Dependent Variable: This is the variable that we are trying to understand or forecast.
• Independent Variable: These are factors that influence the analysis or target variable and
provide us with information regarding the relationship of the variables with the target
variable.
3) Explain essential of algorithms and data structure

The data structure name indicates itself that organizing the data in memory. There are many ways of
organizing the data in the memory as we have already seen one of the data structures, i.e., array in C
language. Array is a collection of memory elements in which data is stored sequentially, i.e., one after
another. In other words, we can say that array stores the elements in a continuous manner. This organization
of data is done with the help of an array of data structures. There are also other ways to organize the data
in memory. Let's see the different types of data structures. To structure the data in memory, 'n' number of
algorithms were proposed, and all these algorithms are known as Abstract data types. These abstract data
types are the set of rules.
Types of Data Structures There are two types of data structures:
o Primitive data structure
o Non-primitive data structure
Primitive Data structure The primitive data structures are primitive data types. The int, char, float, double,
and pointer are the primitive data structures that can hold a single value.
Non-Primitive Data structure The non-primitive data structure is divided into two types:
o Linear data structure
o Non-linear data structure
Linear Data Structure The arrangement of data in a sequential manner is known as a linear data
structure. The data structures used for this purpose are Arrays, Linked list, Stacks, and Queues. In
these data structures, one element is connected to only one another element in a linear form.
Advantages of Data structures
The following are the advantages of a data structure:
o Efficiency: If the choice of a data structure for implementing a particular ADT is proper, it makes the
program very efficient in terms of time and space.
o Reusability: The data structure provides reusability means that multiple client programs can use the data
structure.
o Abstraction: The data structure specified by an ADT also provides the level of abstraction. The client cannot
see the internal working of the data structure, so it does not have to worry about the implementation part.
The client can only see the interface

Alogoithm
An algorithm is a process or a set of rules required to perform calculations or some other problem-solving
operations especially by a computer. The formal definition of an algorithm is that it contains the finite set of
instructions which are being carried in a specific order to perform the specific task. It is not the complete
program or code; it is just a solution (logic) of a problem, which can be represented either as an informal
description using a Flowchart or Pseudocode.
Characteristics of an Algorithm
The following are the characteristics of an algorithm:
o Input: An algorithm has some input values. We can pass 0 or some input value to an algorithm.
o Output: We will get 1 or more output at the end of an algorithm.
o Unambiguity: An algorithm should be unambiguous which means that the instructions in an algorithm
should be clear and simple.
o Finiteness: An algorithm should have finiteness. Here, finiteness means that the algorithm should contain
a limited number of instructions, i.e., the instructions should be countable
. o Effectiveness: An algorithm should be effective as each instruction in an algorithm affects the overall
process.
o Language independent: An algorithm must be language-independent so that the instructions in an
algorithm can be implemented in any of the languages with the same output’

4) Explain the concept of data visualization


Data Visualization is used to communicate information clearly and efficiently to users by the usage of
information graphics such as tables and charts. It helps users in analyzing a large amount of data in a simpler
way. It makes complex data more accessible, understandable, and usable.
Tables are used where users need to see the pattern of a specific parameter, while charts are used to show
patterns or relationships in the data for one or more parameters.
OR
Data visualization is the process of translating large data sets and metrics into charts, graphs and other
visuals. The resulting visual representation of data makes it easier to identify and share real-time trends,
outliers, and new insights about the information represented in the data.
Data visualization gives us a clear idea of what the information means by giving it visual context
through maps or graphs. This makes the data more natural for the human mind to comprehend and therefore
makes it easier to identify trends, patterns, and outliers within large data sets.
Data visualization is important for almost every career. It can be used by teachers to display student
test results, by computer scientists exploring advancements in artificial intelligence (AI) or by executives
looking to share information with stakeholders. It also plays an important role in big data projects.
Data visualization is the process of translating large data sets and metrics into charts, graphs and
other visuals. The resulting visual representation of data makes it easier to identify and share real-time
trends, outliers, and new insights about the information represented in the data.
Pro and Cons of Data Visualization
Here are some pros and cons to representing data visually –
Pros
• It can be accessed quickly by a wider audience.
• It conveys a lot of information in a small space.
• It makes your report more visually appealing.
Cons
• It can misrepresent information – if an incorrect visual representation is made.
• It can be distracting – if the visual data is distorted or excessively used.
5) What is Software Engineering trends and techniques ? Explain in detail
Software Engineering is an engineering branch related to the evolution of software product using well-
defined scientific principles, techniques, and procedures. The result of software engineering is an effective
and reliable software product.

Need of Software Engineering


The necessity of software engineering appears because of a higher rate of progress in user requirements and
the environment on which the program is working.
o Huge Programming: It is simpler to manufacture a wall than to a house or building, similarly, as the
measure of programming become extensive engineering has to step to give it a scientific process.
o Adaptability: If the software procedure were not based on scientific and engineering ideas, it would be
simpler to re-create new software than to scale an existing one.
o Cost: As the hardware industry has demonstrated its skills and huge manufacturing has let down the cost
of computer and electronic hardware. But the cost of programming remains high if the proper process is not
adapted.
o Dynamic Nature: The continually growing and adapting nature of programming hugely depends upon the
environment in which the client works. If the quality of the software is continually changing, new upgrades
need to be done in the existing one.
o Quality Management: Better procedure of software development provides a better and quality software
product.
Importance of Software Engineering
• Reduces complexity: Big software is always complicated and challenging to progress. Software
engineering has a great solution to reduce the complication of any project. Software engineering
divides big problems into various small issues. And then start solving each small issue one by one. All
these small problems are solved independently to each other.
• To minimize software cost: Software needs a lot of hardwork and software engineers are highly paid
experts. A lot of manpower is required to develop software with a large number of codes. But in
software engineering, programmers project everything and decrease all those things that are not
needed. In turn, the cost for software productions becomes less as compared to any software that
does not use software engineering method.
• To decrease time: Anything that is not made according to the project always wastes time. And if you
are making great software, then you may need to run many codes to get the definitive running code.
This is a very time-consuming procedure, and if it is not well handled, then this can take a lot of time.
So if you are making your software according to the software engineering method, then it will
decrease a lot of time.
• Handling big projects: Big projects are not done in a couple of days, and they need lots of patience,
planning, and management. And to invest six and seven months of any company, it requires heaps of
planning, direction, testing, and maintenance. No one can say that he has given four months of a
company to the task, and the project is still in its first stage. Because the company has provided many
resources to the plan and it should be completed. So to handle a big project without any problem,
the company has to go for a software engineering method
• Reliable software: Software should be secure, means if you have delivered the software, then it
should work for at least its given time or subscription. And if any bugs come in the software, the
company is responsible for solving all these bugs. Because in software engineering, testing and
maintenance are given, so there is no worry of its reliability.
• Effectiveness: Effectiveness comes if anything has made according to the standards. Software
standards are the big target of companies to make it more effective. So Software becomes more
effective in the act with the help of software engineering
Software techniques are methods or procedures for designing, developing., 'documenting, and maintaining
programs, or for managing these activities. There are generally two types of software techniques: those used
by personnel who work on programs and those used by managers to control the work.
Following are the techniques
• Agile development methodology. ...
• DevOps deployment methodology. ...
• Waterfall development method. ...
• Rapid application development.

6) What is Database .
In computing, a database is an organized collection of data stored and accessed electronically. Small
databases can be stored on a file system, while large databases are hosted on computer clusters or cloud
storage. The design of databases spans formal techniques and practical considerations including data
modeling, efficient data representation and storage, query languages, security and privacy of sensitive data,
and distributed computing issues including supporting concurrent access and fault tolerance.
A database management system (DBMS) is the software that interacts with end users, applications,
and the database itself to capture and analyze the data. The DBMS software additionally encompasses the
core facilities provided to administer the database. The sum total of the database, the DBMS and the
associated applications can be referred to as a database system. Often the term "database" is also used
loosely to refer to any of the DBMS, the database system or an application associated with the database.
Computer scientists may classify database management systems according to the database models
that they support. Relational databases became dominant in the 1980s. These model data as rows and
columns in a series of tables, and the vast majority use SQL for writing and querying data. In the 2000s, non-
relational databases became popular, collectively referred to as NoSQL because they use different query
languages.
Database languages
Database languages are special-purpose languages, which allow one or more of the following tasks,
sometimes distinguished as sublanguages:
• Data control language (DCL) – controls access to data;
• Data definition language (DDL) – defines data types such as creating, altering, or dropping tables and the
relationships among them;
• Data manipulation language (DML) – performs tasks such as inserting, updating, or deleting data
occurrences;
• Data query language (DQL) – allows searching for information and computing derived information.
Database languages are specific to a particular data model.
Notable examples include:
• SQL combines the roles of data definition, data manipulation, and query in a single language. It was one
of the first commercial languages for the relational model, although it departs in some respects from the
relational model as described by Codd (for example, the rows and columns of a table can be ordered). SQL
became a standard of the American National Standards Institute (ANSI) in 1986, and of the International
Organization for Standardization (ISO) in 1987. The standards have been regularly enhanced since and are
supported (with varying degrees of conformance) by all mainstream commercial relational DBMS.
• OQL is an object model language standard (from the Object Data Management Group). It has influenced
the design of some of the newer query languages like JDOQL and EJB QL.
• XQuery is a standard XML query language implemented by XML database systems such as MarkLogic and
eXist, by relational databases with XML capability such as Oracle and DB2, and also by in-memory XML
processors such as Saxon.
• SQL/XML combines XQuery with SQL.A database language may also incorporate features like:
• DBMS-specific configuration and storage engine management
• Computations to modify query results, like counting, summing, averaging, sorting, grouping, and
crossreferencing
• Constraint enforcement (e.g. in an automotive database, only allowing one engine type per car)
• Application programming interface version of the query language, for programmer convenience .

7) What is Data warehouse ? Explain in detail


Decision support systems (DSS) are subject-oriented, integrated, time-variant, and non-volatile. The
term data warehouse was first used by William Inmon in the early 1980s. He defined data warehouse to be
a set of data that supports DSS and is "subject-oriented, integrated, time-variant and nonvolatile" [Inm95].
With data warehousing, corporate-wide data (current & historical) are merged into a single repository.
Traditional databases contain operational data that represent the day-to7day needs of a company.
Traditional business data processing (such as billing, inventory control, payroll, and manufacturing support
online transaction processing and. batch reporting applications. A data warehouse, however, Contains
informational data, which are used to support other functions such as planning and forecasting. Although
much of the content is similar between the operational and informational data, much is different. The data
which uses market supports such diverse industries as manufacturing, retail, telecommunication and health
care. Think of a personnel database for a company that is continually modified as personnel are added and
deleted. A personnel database that contain information about the current set of employees is sufficient.
However, if management wishes to analyze trends with respect to employment history, more data are
needed. They may wish to determine if there is a problem with too many employees quitting. To analyze this
problem, they would need to know which employees have left, when they left, why they left, and other
information about their employment. For management to make these types of high-level business analyses,
more historical data (not just the current snapshot that is typically stored) and data from other sources
(perhaps employment application and results of exit interviews) are required. In addition, some of the data
in the personnel database, such as address, may not be needed. A data warehouse provides just this
information. In a nutshell, a data warehouse is a data repository used to support decision support systems.
The basic motivation for this shift to the strategic use of data is to increase business probabilities. Traditional
data processing applications support the day-to-day clerical and administrative strategic decisions, while
data warehousing supports long-term strategic decisions. Following figure shows simple view of data
warehouse. The basic components of a data warehousing system include data migration, the warehouse,
and access tools. The data are extracted from operational systems, but must be reformatted, cleansed,
integrated, and summarized before being placed in the warehouse. Much of the operational data are not
needed in the warehouse and are removed during this conversion process. This migration process is similar
to that needed for data mining applications except that data mining application need not necessarily be
performed on summarized or business-wide data.
• The data transformation process required to convert operational data to informational involves many
functions including:
• Unwanted data must be removed.
• Converting heterogeneous sources into one common schema. This problem is the same as that found
when accessing data from multiple heterogeneous sources. Each operational database may contain the same
data with different attribute names. For example, one system may use "Employee ID," while another uses
"EID" for the same attribute. In addition, there may be multiple data types for the same attribute.
• As the operational data is probably a snapshot of the data, multiple snapshots may need to be merged
to create the historical view.
• Summarizing data is performed to provide a higher level view of the data. This summarization may be
done at multiple granularities and for different dimensions.
• New derived data (e.g., using age rather than birth date) may be added to better facilitate decision
support functions.
• Handling missing and erroneous data must be performed. This could entail replacing them with
predicted or default values or simply removing these entries.
• The portion of the transformation that deals with ensuring valid and consistent data is sometimes
referred to as data scrubbing or data staging. There are many benefits to the use of a data warehouse.
Because it provides an integration of data from multiple sources, its use can provide more efficient access of
the data. The data that are stored often provide different levels of summarization. For example, sales data
may be found at a low level (purchase order), at a city level (total of sales for a city), or at higher levels
(county, state, country, world). The summary can be provided for different types of granularity. The sales
data could be summarized by both salesman and department. These summarizations are provided by the
conversion process instead of being calculated when the data are accessed. Thus, this also speeds up the
processing of the data for decision support applications. The data warehouse may appear to increase the
complexity of database management because it is a replica of the operational data. But keep in mind that
much of the data in the warehouse are not simply a replication but an extension to or aggregation of the
data. In addition, because the data warehouse contains historical data, data stored there probably will have
a longer life span than the snapshot data found in the operational databases. The fact that the data in the
warehouse need not be kept consistent with the current operational data also simplifies its maintenance.
The benefits obtained by the capabilities (e.g., DSS support) provided usually are deemed to outweigh any
disadvantages. There are several ways to improve the performance of data warehouse applications.
8) Explain AI and ANN
In today's world, technology is growing very fast, and we are getting in touch with different new
technologies day by day. Here, one of the booming technologies of computer science is Artificial Intelligence
which is ready to create a new revolution in the world by making intelligent machines.The Artificial
Intelligence is now all around us. It is currently working with a variety of subfields, ranging from general to
specific, such as self-driving cars, playing chess, proving theorems, playing music, Painting, etc. Artificial
Intelligence is composed of two words Artificial and Intelligence, where Artificial defines "man-made," and
intelligence defines "thinking power", hence AI means "a man-made thinking power."
Advantages of Artificial Intelligence
Following are some main advantages of Artificial Intelligence:
o High Accuracy with less errors: AI machines or systems are prone to less errors and high accuracy as it
takes decisions as per pre-experience or information. o High-Speed: AI systems can be of very high-speed
and fast-decision making, because of that AI systems can beat a chess champion in the Chess game.
o High reliability: AI machines are highly reliable and can perform the same action multiple times with high
accuracy.
o Useful for risky areas: AI machines can be helpful in situations such as defusing a bomb, exploring the
ocean floor, where to employ a human can be risky.
o Digital Assistant: AI can be very useful to provide digital assistant to the users such as AI technology is
currently used by various E-commerce websites to show the products as per customer requirement.
o Useful as a public utility: AI can be very useful for public utilities such as a self-driving car which can make
our journey safer and hassle-free, facial recognition for security purpose, Natural language processing to
communicate with the human in humanlanguage, etc.
Disadvantages of Artificial Intelligence
Every technology has some disadvantages, and thesame goes for Artificial intelligence. Being so
advantageous technology still, it has some disadvantages which we need to keep in our mind while creating
an AI system. Following are the disadvantages of AI:
o High Cost: The hardware and software requirement of AI is very costly as it requires lots of maintenance
to meet current world requirements.
o Can't think out of the box: Even we are making smarter machines with AI, but still they cannot work out
of the box, as the robot will only do that work for which they are trained, or programmed.
o No feelings and emotions: AI machines can be an outstanding performer, but still it does not have the
feeling so it cannot make any kind of emotional attachment with human, and may sometime be harmful for
users if the proper care is not taken.
o Increase dependency on machines: With the increment of technology, people are getting more dependent
on devices and hence they are losing their mental capabilities.
o No Original Creativity: As humans are so creative and can imagine some new ideas but still AI machines
cannot beat this power of human intelligence and cannot be creative and imaginative.

ANN
The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic the network
of neurons makes up a human brain so that computers will have an option to understand things and make
decisions in a human-like manner. The artificial neural network is designed by programming computers to
behave simply like interconnected brain cells. There are around 1000 billion neurons in the human brain.
Each neuron has an association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up of incredibly
amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a digital logic
gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs are
"On," then we get "On" in output. If both the inputs are "Off," then we get "Off" in output. Here the output
depends upon input. Our brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning” .

9) Explain descriptive statistics in Data Science.


Descriptive statistics is the simplest form of statistical analysis, using numbers to describe the qualities
of a data set. It helps reduce large data sets into simple and more compact forms for easy interpretation.
You can use descriptive statistics to summarize the data from a sample or represent a whole sample in a
research population. Descriptive statistics uses data visualization tools such as tables, graphs and charts to
make analysis and interpretation easier. However, descriptive statistics is not suitable for making
conclusions. It can only represent data so you can apply more sophisticated statistical analysis tools to draw
inferences.
Descriptive statistics can use measures of central tendency, which uses a single value to describe a group.
Mean, median and mode are used to get the central value for a given data set. For example, you can use
descriptive statistical analysis to find the average age of drivers with a ticket in a municipality. Descriptive
statistics can also find the measure of spread. For example, you can find the age range of drivers with a DUI
and at-fault car accidents in a state. Techniques used to find a measure of spread include range, variation
and standard deviation.
A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or
summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is
the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential
statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about
the population that the sample of data is thought to represent. This generally means that descriptive
statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently
non-parametric statistics. Even when a data analysis draws its main conclusions using inferential statistics,
descriptive statistics are generally also presented.
Descriptive statistics provide simple summaries about the sample and about the observations that have
been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-
tounderstand graphs. These summaries may either form the basis of the initial description of the data as part
of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular
investigation.
For example, the shooting percentage in basketball is a descriptive statistic that summarizes the
performance of a player or a team. This number is the number of shots made divided by the number of shots
taken. For example, a player who shoots 33% is making approximately one shot in every three. The
percentage summarizes or describes multiple discrete events. Consider also the grade point average. This
single number describes the general performance of a student across the range of their course experiences.
The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation
of populations and of economic data was the first way the topic of statistics appeared. More recently, a
collection of summarization techniques has been formulated under the heading of exploratory data analysis:
an example of such a technique is the box plot.
In the business world, descriptive statistics provides a useful summary of many types of data. For
example, investors and brokers may use a historical account of return behavior by performing empirical and
analytical analyses on their investments in order to make better investing decisions in the future.

10) Explain Inferential statistics data science .


Statistical inference is the process of using data analysis to infer properties of an underlying distribution
of probability. Inferential statistical analysis infers properties of a population, for example by testing
hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger
population.
Statistical inference makes propositions about a population, using data drawn from the population
with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences,
statistical inference consists of (first) selecting a statistical model of the process that generates the data and
(second) deducing propositions from the model.
The majority of the problems in statistical inference can be considered to be problems related to
statistical modeling", Sir David Cox has said, "How [the] translation from subject-matter problem to statistical
model is done is often the most critical part of an analysis".
The conclusion of a statistical inference is a statistical proposition ( Proposition is the meaning of a
declarative sentence) Some common forms of statistical proposition are the following:
• A point estimate ( point estimation involves the use of sample data to calculate a single value) i.e. a
particular value that best approximates some parameter of interest;
• An interval estimate (interval estimation is the use of sample data to estimate an interval of plausible values
of a parameter of interest. This is in contrast to point estimation, which gives a single value) e.g. a confidence
interval (confidence interval (CI) is a range of estimates for an unknown parameter, defined as an interval
with a lower bound and an upper bound )(or set estimate), i.e. an interval constructed using a dataset drawn
from a population so that, under repeated sampling of such datasets, such intervals would contain the true
parameter value with the probability at the stated confidence level;
• A credible interval (credible interval is an interval within which an unobserved parameter value falls with a
particular probability) i.e. a set of values containing, for example, 95% of posterior belief.
• Rejection of a hypothesis
• Clustering or Classification of data points into groups
11) Explain Data Analysis in detail .
Data analysis is a process of inspecting, cleansing (Data cleansing or data cleaning is the process of
detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database
and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing,
modifying, or deleting the dirty data), transforming (Data transformation is the process of converting data
from one format or structure into another format or structure) and modeling data (process of creating data
model) with the goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety
of names, and is used in different business, science, and social science domains. In today's business world,
data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.
Data mining is a particular data analysis technique that focuses on statistical modeling and
knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence
covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical
applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and
confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on
confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical
models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and
structural techniques to extract and classify information from textual sources, a species of unstructured data.
All of the above are varieties of data analysis .

The process of data analysis


Analysis refers to dividing a whole into its separate components for individual examination. Data
analysis is a process for obtaining raw data, and subsequently converting it into information useful for
decision-making by users. Following are different processes which are used during data analysis & also similar
for data mining too.

Data requirements : The data is necessary as inputs to the analysis, which is specified based upon the
requirements of those directing the analysis (or customers, who will use the finished product of the analysis).
The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g.,
a person or population of people). Specific variables regarding a population (e.g., age and income) may be
specified and obtained. Data may be numerical or categorical (i.e., a text label for numbers).

Data collection : Data is collected from a variety of sources. The requirements may be communicated by
analysts to custodians of the data; such as, Information Technology personnel within an organization. The
data may also be collected from sensors in the environment, including traffic cameras, satellites, recording
devices, etc. It may also be obtained through interviews, downloads from online sources, or reading
documentation.

Data Processing : Data, when initially obtained, must be processed or organized for analysis. For instance,
these may involve placing data into rows and columns in a table format (known as structured data) for further
analysis, often through the use of spreadsheet or statistical software

Data Cleaning : Once processed and organized, the data may be incomplete, contain duplicates, or contain
errors. The need for data cleaning will arise from problems in the way that the datum are entered and stored.
Data cleaning is the process of preventing and correcting these errors. Common tasks include record
matching, identifying inaccuracy of data, and overall quality of existing data, duplication and column
segmentation.

Exploratory data analysis : Once the datasets are cleaned, they can then be analyzed. Analysts may
apply a variety of techniques, referred to as exploratory data analysis, to begin understanding the messages
contained within the obtained data. The process of data exploration may result in additional data cleaning
or additional requests for data

12) What are Hypothesis techniques ? Explain in detail


A hypothesis is a proposed explanation for a phenomenon (observable fact or event). For a hypothesis
to be a scientific hypothesis, the scientific method requires that one can test it. Scientist’s generally base
scientific hypotheses on previous observations (s the active acquisition of information from a primary
source.) of that cannot satisfactorily be explained with the available scientific theories. Even though the
words "hypothesis" and "theory" are often used synonymously, a scientific hypothesis is not the same as a
scientific theory. A working hypothesis is a provisionally accepted hypothesis proposed for further research,
in a process beginning with an educated guess or thought. Even though the words "hypothesis" and "theory"
are often used synonymously, a scientific hypothesis is not the same as a scientific theory. A working
hypothesis is a provisionally accepted hypothesis proposed for further research, in a process beginning with
an educated guess or thought.
Scientific hypothesis : People refer to a trial solution to a problem as a hypothesis, often called an "educated
guess because it provides a suggested outcome based on the evidence. However, some scientists reject the
term "educated guess" as incorrect. Experimenters may test and reject several hypotheses before solving
the problem.
Working hypothesis : A working hypothesis is a hypothesis that is provisionally accepted as a basis for
further research in the hope that a tenable theory will be produced, even if the hypothesis ultimately fails.
Like all hypotheses, a working hypothesis is constructed as a statement of expectations, which can be linked
to the exploratory research purpose in empirical investigation.
Simple Hypothesis : Simple hypothesis is that one in which there exists relationship between two variables
one is called independent variable or cause and other is dependent variable or effect.
Complex Hypothesis : Complex hypothesis is that one in which as relationship among variables exists. I
recommend you should read characteristics of a good research hypothesis. In this type dependent as well as
independent variables are more than two.
Empirical Hypothesis Working hypothesis is that one which is applied to a field. During the formulation it is
an assumption only but when it is pat to a test become an empirical or working hypothesis.
Null Hypothesis: Null hypothesis is contrary to the positive statement of a working hypothesis. According to
null hypothesis there is no relationship between dependent and independent variable. It is denoted by ‘HO”.
Alternative Hypothesis Firstly many hypotheses are selected then among them select one which is more
workable and most efficient. That hypothesis is introduced latter on due to changes in the old formulated
hypothesis. It is denote by “HI”.
Logical Hypothesis It is that type in which hypothesis is verified logically. J.S. Mill has given four cannons of
these hypothesis e.g. agreement, disagreement, difference and residue.
Statistical Hypothesis A hypothesis which can be verified statistically called statistical hypothesis. The
statement would be logical or illogical but if statistic verifies it, it will be statistical hypothesis.
Characteristics of Hypothesis
▪ A hypothesis should state the expected pattern, relationship or difference between two or more
variables;
▪ A hypothesis should be testable;
▪ A hypothesis should offer a tentative explanation based on theories or previousresearch;A
hypothesis should be concise and lucid.

13) Explain computational techniques in data science .


ICCW uses the following computational methods and information and communication technologies:

• Data science and Big Data


• Modelling and simulation

• Artificial intelligence (AI) and machine learning

• Internet of Things (IoT)

• Cloud and edge computing

• Virtual, augmented and mixed reality (VR/AR/MR)

• Combinations of the above with other deep technologies.

DATA SCIENCE AND BIG DATA

Data science involves the combined application of statistical and computational methods with domain
knowledge to understand, visualize and make predictions, and thus derive value from data in that domain.
It encompasses data analysis, Big Data and machine learning.
Big Data has an important role to play in ensuring all the sustainable development goals, and not only that
for water. Real-time data from water and environmental sensors over a timescale of seconds, minutes and
hours can generate big data and will need appropriate methodologies and models to be developed to derive
insights and value from them.
Big Data is huge in volume and grows exponentially. It is distinguished from normal data by its four principal
attributes, known as the four V’s- volume, velocity, variety, and veracity. Simple examples could be data
from all the sensors in the aircraft, or the number of messages on a social media platform.

Volume- Data on the order of terabytes (1012), petabytes (1015) and larger require different information
processing systems to normal data.

Velocity– The rate at which data is created, stored, processed and analyzed.

Veracity- Measures how accurate and reliable the data, such as the amount of bias or uncertainty, eg. noise
in the readings from IoT sensors.
Variety– Structured data can be classified into fields such as names, addresses, water quality values.
Unstructured data cannot be classified easily into fields such as satellite images, paragraphs of written text,
photos and graphic images, videos, streaming instrument data, webpages, and document files. Semi-
structured data which is a type of combination of structured and unstructured data is also possible.

MACHINE LEARNING

Machine learning is used for predictive data analysis in an automated way. It is based on pattern recognition
and enables computers to learn without being programmed to perform specific tasks. Using machine-
learning algorithms, computers can learn from data and make predictions in the same way as humans learn
from experience. As machine-learning models are exposed to new data, they can adapt, and their accuracy
is improved as they learn from previous computations to produce more reliable, repeatable decisions and
results.

14) Explain Machine Learning .


In the real world, we are surrounded by humans who can learn everything from their experiences with their
learning capability, and we have computers or machines which work on our instructions. But can a machine also learn
from experiences or past data like a human does? So here comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development of
algorithms which allow a computer to learn from the data and past experiences on their own.
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed. With the help of sample historical data, which is
known as training data, machine learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the algorithms that learn from historical data. The
more we will provide the information, the higher will be the performance. A machine has the ability to learn if it can
improve its performance by gaining more data.
How does Machine Learning work A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The accuracy of predicted output depends upon
the amount of data, as the huge amount of data helps to build a better model which predicts the output more
accurately. 3 Suppose we have a complex problem, where we need to perform some predictions, so instead of writing
a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine builds
the logic as per the data and predict the output. Machine learning has changed our way of thinking about the problem.
The below block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The reason behind the need for machine learning is
that it is capable of doing tasks that are too complex for a person to implement directly. As a human, we have some
limitations as we cannot access the huge amount of data manually, so for this, we need some computer systems and
here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let them explore
the data, construct the models, and predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its uses cases, Currently, machine learning is
used in self-driving cars, cyber fraud detection, face recognition, and friend suggestion by Facebook, etc. Various top
companies such as Netflix and Amazon 4 have build machine learning models that are using a vast amount of data to
analyze the user interest and recommend product accordingly .
15) Explain in detail Big data .
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is
stated that almost 90% of today's data has been generated in the past 3 years.
Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a
day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish their
plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily
transaction.

3V's of Big Data


1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in
every 2 years.

2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured.
Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like the
transaction data of the bank.

3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
4. Machine Learning tutorial provides basic and advanced concepts of machine learning. Our machine
learning tutorial is designed for students and working professionals.

5. Machine learning is a growing technology which enables computers to learn automatically from past data.
Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information. Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many more.

6. This machine learning tutorial gives you an introduction to machine learning along with the wide range of
machine learning techniques such as Supervised, Unsupervised, and Reinforcement learning. You will learn
about regression and classification models, clustering methods, hidden Markov models, and various
sequential models.
16) Explain Parallel Computing and Algorithms .
It is the use of multiple processing elements simultaneously for solving any problem. Problems are
broken down into instructions and are solved concurrently as each resource that has been applied to work
is working at the same time.
Advantages of Parallel Computing over Serial Computing are as follows:
1. It saves time and money as many resources working together will reduce the time and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.
4. Serial Computing ‘wastes’ the potential computing power, thus Parallel Computing makes better work of
the hardware.
Types of Parallelism:
1. Bit-level parallelism – It is the form of parallel computing which is based on the increasing processor’s
size. It reduces the number of instructions that the system must execute in order to perform a task on large-
sized data. Example: Consider a scenario where an 8-bit processor must compute the sum of two 16- bit
integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two
instructions to perform the operation. A 16-bit processor can perform the operation with just one
instruction.
2. Instruction-level parallelism – A processor can only address less than one instruction for each clock cycle
phase. These instructions can be re-ordered and grouped which are later on executed concurrently without
affecting the result of the program. This is called instruction-level parallelism.
3. Task Parallelism – Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks concurrently.
Applications of Parallel Computing:
• Databases and Data mining.
• Real-time simulation of systems.
• Science and Engineering.
• Advanced graphics, augmented reality, and virtual reality.
Limitations of Parallel Computing:
• It addresses such as communication and synchronization between multiple sub-tasks and processes which
is difficult to achieve.
• The algorithms must be managed in such a way that they can be handled in a parallel mechanism.
• The algorithms or programs must have low coupling and high cohesion. But it’s difficult to create such
programs.
• More technically skilled and expert programmers can code a parallelism-based program well.
17) What are different techniques to manage Big data

Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is
stated that almost 90% of today's data has been generated in the past 3 years.

By combining a set of techniques that analyse and integrate data from multiple sources and solutions, theinsights are more
efficient and potentially more accurate than if developed through a single source of data.

1. Data mining

A common tool used within big data analytics, data mining extracts patterns from large data sets by
combining methods from statistics and machine learning, within database management. An example would
be when customer data is mined to determine which segments are most likely to react to an offer.

2. Machine learning

Well known within the field of artificial intelligence, machine learning is also used for data analysis. Emerging
from computer science, it works with computer algorithms to produce assumptions based on data. 14 It
provides predictions that would be impossible for human analysts.

3. Natural language processing (NLP).

This technique works to collect, organise, and interpret data, within surveys and experiments.Other data
analysis techniques include spatial analysis, predictive modelling, association rule learning, network analysis
and many, many more. The technologies that process, manage, and analyse this data are of an entirely
different and expansive field, that similarly evolves and develops over time. Techniques and technologies
aside, any form or size of data is valuable. Managed accurately and effectively, it can reveal a host of business,
product, and market insights. What does the future of data analysis look like? It’s hard to say with the
tremendous pace analytics and technology progresses, but undoubtedly data innovation is changing the face of
business and society in its holistic entirety.
18) Explain research methodology basics and importance in data science

Research methodology is the specific procedures or techniques used to identify, select, process, and
analyze information about a topic. In a research paper, the methodology section allows the reader to
critically evaluate a study's overall validity and reliability.
The 4 types of research methodology
Data may be grouped into four main types based on methods for collection: observational, experimental,
simulation, and derived. Most frequently used methods include:
• Observation / Participant Observation.
• Surveys.
• Interviews.
• Focus Groups.
• Experiments.
• Secondary Data Analysis / Archival Study.
• Mixed Methods (combination of some of the above)
It is necessary not just to identify the problem for Research but to determine the best method to
solve that problem as well.
• A suitable method for the decision problem.
• The order of accuracy of the outcome of a way for the problem
Characteristics of Good Research
▪ Good research is systematic:
▪ Good research is logical
▪ Good research is empirical
▪ Good research is replicable
19) Explain various Applications of Data Science .
Data Science is the deep study of a large quantity of data, which involves extracting some meaningful
from the raw, structured, and unstructured data. The extracting out meaningful data from large amounts
use processing of data and this processing can be done using statistical techniques and algorithm, scientific
techniques, different technologies, etc. It uses various tools and techniques to extract meaningful data from
raw data. Data Science is also known as the Future of Artificial Intelligence.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search
for something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc. So Data
Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ” then at
that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is Done using Data Science, and we get the Topmost
visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless
Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow Roads,
etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out
strategic decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to
predict the future. It allows the companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used
to examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed
in such a way that it makes it possible to predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience
with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar
to choices according to our past data and also we get recommendations according to most buy the product,
most rated, most searched, etc. This is all done with the help of Data Science.
5. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the
help of machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture matched with someone else
profile then Facebook suggests us auto-tagging .
20) Describe importance of data science in future
The evolution of Data Science over the years has taken form in many phases. It all started with
Statistics. Simple statistics models were employed to collect, analyse and manage data since the early 1800s.
These principles underwent various modulations over time until the rise of the digital age. Once computers
were introduced as mainstream public devices, there was a shift in the industry to the Digital Age. A flood of
data and digital information was created. This resulted in the statistical practices and models getting
computerized giving rise to digital analytics. Then came the rise of the internet that exponentially grew the
data available giving rise to what we know as Big Data. This explosion of information available to the masses
gave rise to the need for expertise to process, manage, analyse and visualize this data for the purpose of
decision making through the use of various models. This gave birth to the term Data Science.
Splitting of Data Science:
Presently, the term data science is perceived quite vaguely. There are various designations and
descriptions that are associated with data science like Data Analyst, Data Engineer, Data Visualization, Data
Architect, Machine Learning and Business Intelligence to name a few. However, as we move into the future,
we would begin to better interpret and understand the contribution of each of their roles independently.
This would greatly broaden the domain and we would begin to have professionals gaining expertise in these
domain-specific roles giving a clearer picture of the workflow associated with each role.
Data Explosion:
Today enormous amounts of data are being produced on a daily basis. Every organization is
dependent on the data being created for its processes. Whether it is medicine, entertainment, sports,
manufacturing, agriculture or transport, it is all dependent on data. There would be a continuous increase in
the demand for expertise to extract valuable insights from the data as it continues to keep increasing leaps
and bounds.
Rise of Automation:
With an increase in the complexity of operations, there is always a strive to simplify processes. Into
the future, it is evident that most machine learning frameworks would contain libraries of models that are
pre-structured and pre-trained. This would bring about a paradigm shift in the working of a Data Scientist.
Creation of models for analysis wouldn’t remain as their native responsibility but rather it would shift to the
true analytics of the data extracted from these models. Soft skills like Data Visualization would come to the
forefront of a Data Scientists skill set.
Scarcity or Abundance of Data Scientists:
Today thousands of individuals learn Data Science related skills through college degrees or the
numerous resources that can be found online and this could result in newer aspirants getting a feeling of
saturation in this domain. However it is essential to realize that data science is not a domain that can just be
learned, it needs to be inculcated. No doubt the skills being learned are of immense importance, but these
are just the tools that help to work with the data. The mindset and the applicative sense of using these tools
to accomplish various analytical tasks is what makes a true data scientist. Thus it should be remembered that
there could always be an abundance of individuals who have learned Data Science, but there would always
be a scarcity of Data Scientists.
The future of Data Science is not definitive however what is certain is that it would continue to evolve into a
new phase depending on the need of the hour. Data Scientists would exist as long as there exists Data.
21) Describe the programming paradigm / Explain different types of programming
languages .
programming paradigm
o The term programming paradigm refers to a style of programming. It does not refer to a specific language,
but rather it refers to the way you program. There are lots of programming languages that are well-known
but all of them need to follow some strategy when they are implemented. And that strategy is a paradigm.
Four different programming paradigms –
Procedural, Object-Oriented, Functional and Logical.
Advantage:
1. Very simple to implement
2. It contains loops, variables etc.
Disadvantage:
1. Complex problem cannot be solved
2. Less efficient and less productive
3. Parallel programming is not possible

Procedural programming paradigm – This paradigm emphasizes on procedure in terms of


under lying machine model. There is no difference in between procedural and imperative approach.
It has the ability to reuse the code and it was boon at that time when it was in use because of its
reusability.

o Object oriented programming – The program is written as a collection of classes and object which
are meant for communication. The smallest and basic entity is object and all kind of computation is

performed on the objects only. More emphasis is on data rather procedure. It can handle almost all
kind of real life problems which are today in scenario.
o Logic programming paradigms – It can be termed as abstract model of computation. It would solve
logical problems like puzzles, series etc. In logic programming we have a knowledge base which we
know before and along with the question and knowledge base which is given to machine, it produces
result. In normal programming languages, such concept of knowledge base is not available but while
using the concept of artificial intelligence, machine learning we have some models like Perception
model which is using the same mechanism. In logical programming the main emphasize is on
knowledge base and the problem. The execution of the program is very much like proof of
mathematical statement, e.g., Prolog
o Functional programming paradigms – The functional programming paradigms has its roots in
mathematics and it is language independent. The key principal of this paradigms is the execution of
series of mathematical functions. The central model for the abstraction is the function which are
meant for some specific computation and not the data structure. Data are loosely coupled to
functions.The function hide their implementation. Function can be replaced with their values without
changing the meaning of the program. Some of the languages like perl, javascript mostly uses this
paradigm.
22) Explain different Analysis techniques in data science
Data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of
discovering useful information, informing conclusions, and supporting decisionmaking.
A simple example of Data analysis is whenever we take any decision in our day-to-day life is by
thinking about what happened last time or what will happen by choosing that particular decision. This is
nothing but analyzing our past or future and making decisions based on it.
Predictive analysis
Predictive analysis uses powerful statistical algorithms and machine learning tools to predict future
events and behavior based on new and historical data trends. It relies on a wide range of probabilistic
techniques such as data mining, big data, predictive modeling, artificial intelligence and simulations to guess
what is likely to occur in the future. Predictive analysis is a branch of business intelligence as many
organizations with operations in marketing, sales, insurance and financial services rely on data to make long-
term plans.
Prescriptive analysis
Prescriptive analysis helps organizations use data to guide their decision-making process. Companies
can use tools such as graph analysis, algorithms, machine learning and simulation for this type of analysis.
Prescriptive analysis helps businesses make the best choice from several alternative courses of action.
Exploratory data analysis
Exploratory data analysis is a technique data scientists use to identify patterns and trends in a data
set. They can also use it to determine relationships among samples in a population, validate assumptions,
test hypotheses and find missing data points. Companies can use exploratory data analysis to make insights
based on data and validate data for errors.

Descriptive analytics :is the process of using current and historical data to identify trends and relationships. It’s
sometimes called the simplest form of data analysis because it describes trends and relationships but doesn’t dig
deeper.
Descriptive analytics is relatively accessible and likely something your organization uses daily. Basic statistical
software, such as Microsoft Excel or data visualization tools, such as Google Charts and Tableau, can help parse
data, identify trends and relationships between variables, and visually display information.

Descriptive analytics is especially useful for communicating change over time and uses trends as a springboard for
further analysis to drive decision-making.
23) What is Data mining ? Explain in detail
Data mining is the process of finding anomalies, patterns and correlations within large data sets to
predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut
costs, improve customer relationships, reduce risks and more.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms
for data segments and evaluates the probability of future events. Data Mining is also called Knowledge
Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular
data set, with an objective. This process includes various types of services such as text mining, web mining,
audio and video mining, pictorial data mining, and social media mining. It is done through software that is
simple or highly specific. By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is impossible to locate manually.
The biggest challenge is to analyze the data to extract important information that can be used to solve a
problem or for company development. There are many powerful instruments and techniques available to
mine data and find better insight from it.
Data mining allows you to:
• Sift through all the chaotic and repetitive noise in your data.
• Understand what is relevant and then make good use of that information to assess likely outcomes.
• Accelerate the pace of making informed decisions.
Advantages of Data Mining
o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
Disadvantages of Data Mining
o There is a probability that the organizations may sell useful data of customers to other organizations for
money. As per the report, American Express has sold credit card purchases of their customers to other
organizations.
o Many data mining analytics software is difficult to operate and needs advance training to work on.
24) Differentiate between Data mining and Data Science
S DATA DATA
R SCIENCE MINING
N
O.
0 Data Science is an area. Data Mining is a technique.
1
0 It is about collection, processing, It is about extracting the vital and
2 analyzing and utilizing of data into valuableinformation from the data.
various operations. It is more
conceptual.
0 It is a field of study just like the It is a technique which is a part of the
3 Computer Science, Applied Statistics Knowledge Discovery in Data Base
or Applied Mathematics processes (KDD)

0 The goal is to build data- The goal is to make data more vital
4 dominantproducts for a venture. and usable i.e. by extracting only
important information.

0 It deals with the all types of data i.e. It mainly deals with the structured
5 structured, unstructured or semi- forms ofthe data.
structured.
0 It is a super set of Data Mining as data It is a sub set of Data Science as
6 science consists of Data scrapping, mining activities which is in a
cleaning, visualization, statistics and pipeline of the Data science.
manymore techniques.
0 It is mainly used for scientific purposes It is mainly used for business
7 purposes.

0 It broadly focuses on the science of It is more involved with the


8 thedata. processes.
25) Explain Evaluation in Data Science .
• Interpretation/evaluation: How the data mining results are presented to the users is extremely
important because the usefulness of the results is dependent on it. Various visualization and GUI strategies
are used at this last step. Transformation techniques are used to make the data easier to mine and more
useful, and to provide more meaningful results. The actual distribution of the data may be Some attribute
values may be combined to provide new values, thus reducing the complexity of the data.
For example, current date and birth date could be replaced by age. One attribute could be substituted
for another.
As with all steps in the KDD process, however, care must be used in performing transformation. If
used incorrectly, the transformation could actually change the data such that the results of the data mining
step are inaccurate. Visualization refers to the visual presentation of data. The old expression "a picture is
worth a thousand words" certainly is true when examining the structure of data. For example, a line graph
that shows the distribution of a data variable is easier to understand and perhaps more informative than the
formula for the corresponding distribution. The use of visualization techniques allows users to summarize,
extract, and grasp more complex results than more mathematical or text type descriptions of the results.
26) Explain Predictive Analytics and Segmentation using Clustering .
The term predictive analytics refers to the use of statistics and modeling techniques to make
predictions about future outcomes and performance. Predictive analytics looks at current and historical
data patterns to determine if those patterns are likely to emerge again. This allows businesses and
investors to adjust where they use their resources to take advantage of possible future events. Predictive
analysis can also be used to improve operational efficiencies and reduce risk.
Data Segmentation is the process of taking the data you hold and dividing it up and grouping similar
data together based on the chosen parameters so that you can use it more efficiently within marketing and
operations.
Clustering
Clustering is similar to classification except that the groups are not predefined, but rather defined by
the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be
thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data on predefined attributes.
The most similar data are grouped into clusters. Since the clusters are not predefined, a domain expert is
often required to interpret the meaning of the created clusters

27) Explain Exploratory Data Analysis .


Exploratory data analysis is a technique data scientists use to identify patterns and trends in a data
set. They can also use it to determine relationships among samples in a population, validate assumptions,
test hypotheses and find missing data points. Companies can use exploratory data analysis to make insights
based on data and validate data for errors.
Once the datasets are cleaned, they can then be analyzed. Analysts may apply a variety of techniques,
referred to as exploratory data analysis, to begin understanding the messages contained within the obtained
data. The process of data exploration may result in additional data cleaning or additional requests for data .
28) Decribe Data Scientist's roles and responsibilities .
A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such as R, SAS, SQL,
Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have an understanding of Statistics,
Mathematics, visualization, and communication skills.
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect
and analyze the numerical data in a large amount and finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means
specialized knowledge or skills of a particular area. In data science, there are various areas for which we need
domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving,
and transforming the data. Data engineering also includes metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so that people can easily
understand the significance of data. Data visualization makes it easy to access the huge amount of data in
visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing involves
designing, writing, debugging, and maintaining the source code of computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of quantity,
structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to provide
training to a machine so that it can act as a human brain. In data science, we use various machine learning
algorithms to solve the problems.
Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
29) Explain in detail Data Science life cycle .
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When you start any data
science project, you need to determine what are the basic requirements, priorities, and project budget. In
this phase, we need to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to perform the
following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to establish the
relation between input variables. We will apply Exploratory data analytics(EDA) by using various statistical
formula and visualization tools to understand the relations between variable and to see what data can inform
us. Common tools used for model planning are:
o SQL Analysis Services
oR
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association, classification, and clustering, to
build the model. Following are some common Model building tools:
o SAS Enterprise Miner
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with briefings, code,
and technical documents. This phase provides you a clear overview of complete project performance and
other components on a small scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.
30) Explain Hadoop integration with R Programming .
Hadoop is an open-source framework that was introduced by the ASF — Apache Software
Foundation. Hadoop is the most crucial framework for copying with Big Data. Hadoop has been written in
Java, and it is not based on OLAP (Online Analytical Processing). The best part of this big data framework is
that it is scalable and can be deployed for any type of data in various varieties like the structured,
unstructured,andsemi-structuredtype.

Integration of Hadoop and R


As we know that data is the precious thing that matters most for an organization and it’ll be not an
exaggeration if we say data is the most valuable asset. But in order to deal with this huge structure and
unstructured we need an effective tool that could effectively do the data analysis, so we get this tool by
merging the features of both R language and Hadoop framework of big data analysis, this merging result
increment in its scalability. Hence, we need to integrate both then only we can find better insights and result
from data. Soon we’ll go through the various methodologies which help to integrate these two.
R is an open-source programming language that is extensively used for statistical and graphical
analysis. R supports a large variety of Statistical-Mathematical based library for(linear and nonlinear
modeling, classical-statistical tests, time-series analysis, data classification, data clustering, etc) and graphical
techniques for processing data efficiently.
One major quality of R’s is that it produces well-designed quality plots with greater ease, including
mathematical symbols and formulae where needed. If you are in a crisis of strong data-analytics and
visualization features then combining this R language with Hadoop into your task will be the last choice for
you to reduce the complexity. It is a highly extensible object-oriented programming language and it has
strong graphical capabilities.
The Main Motive behind R and Hadoop Integration :
No suspicion, that R is the most picked programming language for statistical computing, graphical
analysis of data, data analytics, and data visualization. On the other hand, Hadoop is a powerful Bigdata
framework that is capable to deal with a large amount of data. In all the processing and analysis of data the
distributed file system(HDFS) of Hadoop plays a vital role, It applies the map-reduce processing approach
during data processing(provides by rmr package of R Hadoop), Which make the data analyzing process more
efficient and easier.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy