Data Science Full
Data Science Full
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modelling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition. Primarily
it gives the exact relationship between two or more variables in the given data set.
Regression is defined as a statistical method that helps us to analyze and understand the relationship
between two or more variables of interest. The process that is adapted to perform regression analysis helps
to understand which factors are important, which factors can be ignored, and how they are influencing each
other.
In regression, we normally have one dependent variable and one or more independent variables. Here we
try to “regress” the value of the dependent variable “Y” with the help of the independent variables. In other
words, we are trying to understand, how the value of ‘Y’ changes w.r.t change in ‘X’.
For the regression analysis is be a successful method, we understand the following terms:
• Dependent Variable: This is the variable that we are trying to understand or forecast.
• Independent Variable: These are factors that influence the analysis or target variable and
provide us with information regarding the relationship of the variables with the target
variable.
3) Explain essential of algorithms and data structure
The data structure name indicates itself that organizing the data in memory. There are many ways of
organizing the data in the memory as we have already seen one of the data structures, i.e., array in C
language. Array is a collection of memory elements in which data is stored sequentially, i.e., one after
another. In other words, we can say that array stores the elements in a continuous manner. This organization
of data is done with the help of an array of data structures. There are also other ways to organize the data
in memory. Let's see the different types of data structures. To structure the data in memory, 'n' number of
algorithms were proposed, and all these algorithms are known as Abstract data types. These abstract data
types are the set of rules.
Types of Data Structures There are two types of data structures:
o Primitive data structure
o Non-primitive data structure
Primitive Data structure The primitive data structures are primitive data types. The int, char, float, double,
and pointer are the primitive data structures that can hold a single value.
Non-Primitive Data structure The non-primitive data structure is divided into two types:
o Linear data structure
o Non-linear data structure
Linear Data Structure The arrangement of data in a sequential manner is known as a linear data
structure. The data structures used for this purpose are Arrays, Linked list, Stacks, and Queues. In
these data structures, one element is connected to only one another element in a linear form.
Advantages of Data structures
The following are the advantages of a data structure:
o Efficiency: If the choice of a data structure for implementing a particular ADT is proper, it makes the
program very efficient in terms of time and space.
o Reusability: The data structure provides reusability means that multiple client programs can use the data
structure.
o Abstraction: The data structure specified by an ADT also provides the level of abstraction. The client cannot
see the internal working of the data structure, so it does not have to worry about the implementation part.
The client can only see the interface
Alogoithm
An algorithm is a process or a set of rules required to perform calculations or some other problem-solving
operations especially by a computer. The formal definition of an algorithm is that it contains the finite set of
instructions which are being carried in a specific order to perform the specific task. It is not the complete
program or code; it is just a solution (logic) of a problem, which can be represented either as an informal
description using a Flowchart or Pseudocode.
Characteristics of an Algorithm
The following are the characteristics of an algorithm:
o Input: An algorithm has some input values. We can pass 0 or some input value to an algorithm.
o Output: We will get 1 or more output at the end of an algorithm.
o Unambiguity: An algorithm should be unambiguous which means that the instructions in an algorithm
should be clear and simple.
o Finiteness: An algorithm should have finiteness. Here, finiteness means that the algorithm should contain
a limited number of instructions, i.e., the instructions should be countable
. o Effectiveness: An algorithm should be effective as each instruction in an algorithm affects the overall
process.
o Language independent: An algorithm must be language-independent so that the instructions in an
algorithm can be implemented in any of the languages with the same output’
6) What is Database .
In computing, a database is an organized collection of data stored and accessed electronically. Small
databases can be stored on a file system, while large databases are hosted on computer clusters or cloud
storage. The design of databases spans formal techniques and practical considerations including data
modeling, efficient data representation and storage, query languages, security and privacy of sensitive data,
and distributed computing issues including supporting concurrent access and fault tolerance.
A database management system (DBMS) is the software that interacts with end users, applications,
and the database itself to capture and analyze the data. The DBMS software additionally encompasses the
core facilities provided to administer the database. The sum total of the database, the DBMS and the
associated applications can be referred to as a database system. Often the term "database" is also used
loosely to refer to any of the DBMS, the database system or an application associated with the database.
Computer scientists may classify database management systems according to the database models
that they support. Relational databases became dominant in the 1980s. These model data as rows and
columns in a series of tables, and the vast majority use SQL for writing and querying data. In the 2000s, non-
relational databases became popular, collectively referred to as NoSQL because they use different query
languages.
Database languages
Database languages are special-purpose languages, which allow one or more of the following tasks,
sometimes distinguished as sublanguages:
• Data control language (DCL) – controls access to data;
• Data definition language (DDL) – defines data types such as creating, altering, or dropping tables and the
relationships among them;
• Data manipulation language (DML) – performs tasks such as inserting, updating, or deleting data
occurrences;
• Data query language (DQL) – allows searching for information and computing derived information.
Database languages are specific to a particular data model.
Notable examples include:
• SQL combines the roles of data definition, data manipulation, and query in a single language. It was one
of the first commercial languages for the relational model, although it departs in some respects from the
relational model as described by Codd (for example, the rows and columns of a table can be ordered). SQL
became a standard of the American National Standards Institute (ANSI) in 1986, and of the International
Organization for Standardization (ISO) in 1987. The standards have been regularly enhanced since and are
supported (with varying degrees of conformance) by all mainstream commercial relational DBMS.
• OQL is an object model language standard (from the Object Data Management Group). It has influenced
the design of some of the newer query languages like JDOQL and EJB QL.
• XQuery is a standard XML query language implemented by XML database systems such as MarkLogic and
eXist, by relational databases with XML capability such as Oracle and DB2, and also by in-memory XML
processors such as Saxon.
• SQL/XML combines XQuery with SQL.A database language may also incorporate features like:
• DBMS-specific configuration and storage engine management
• Computations to modify query results, like counting, summing, averaging, sorting, grouping, and
crossreferencing
• Constraint enforcement (e.g. in an automotive database, only allowing one engine type per car)
• Application programming interface version of the query language, for programmer convenience .
ANN
The term "Artificial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
artificial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.
An Artificial Neural Network in the field of Artificial intelligence where it attempts to mimic the network
of neurons makes up a human brain so that computers will have an option to understand things and make
decisions in a human-like manner. The artificial neural network is designed by programming computers to
behave simply like interconnected brain cells. There are around 1000 billion neurons in the human brain.
Each neuron has an association point somewhere in the range of 1,000 and 100,000. In the human brain,
data is stored in such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up of incredibly
amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a digital logic
gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs are
"On," then we get "On" in output. If both the inputs are "Off," then we get "Off" in output. Here the output
depends upon input. Our brain does not perform the same task. The outputs to inputs relationship keep
changing because of the neurons in our brain, which are "learning” .
Data requirements : The data is necessary as inputs to the analysis, which is specified based upon the
requirements of those directing the analysis (or customers, who will use the finished product of the analysis).
The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g.,
a person or population of people). Specific variables regarding a population (e.g., age and income) may be
specified and obtained. Data may be numerical or categorical (i.e., a text label for numbers).
Data collection : Data is collected from a variety of sources. The requirements may be communicated by
analysts to custodians of the data; such as, Information Technology personnel within an organization. The
data may also be collected from sensors in the environment, including traffic cameras, satellites, recording
devices, etc. It may also be obtained through interviews, downloads from online sources, or reading
documentation.
Data Processing : Data, when initially obtained, must be processed or organized for analysis. For instance,
these may involve placing data into rows and columns in a table format (known as structured data) for further
analysis, often through the use of spreadsheet or statistical software
Data Cleaning : Once processed and organized, the data may be incomplete, contain duplicates, or contain
errors. The need for data cleaning will arise from problems in the way that the datum are entered and stored.
Data cleaning is the process of preventing and correcting these errors. Common tasks include record
matching, identifying inaccuracy of data, and overall quality of existing data, duplication and column
segmentation.
Exploratory data analysis : Once the datasets are cleaned, they can then be analyzed. Analysts may
apply a variety of techniques, referred to as exploratory data analysis, to begin understanding the messages
contained within the obtained data. The process of data exploration may result in additional data cleaning
or additional requests for data
Data science involves the combined application of statistical and computational methods with domain
knowledge to understand, visualize and make predictions, and thus derive value from data in that domain.
It encompasses data analysis, Big Data and machine learning.
Big Data has an important role to play in ensuring all the sustainable development goals, and not only that
for water. Real-time data from water and environmental sensors over a timescale of seconds, minutes and
hours can generate big data and will need appropriate methodologies and models to be developed to derive
insights and value from them.
Big Data is huge in volume and grows exponentially. It is distinguished from normal data by its four principal
attributes, known as the four V’s- volume, velocity, variety, and veracity. Simple examples could be data
from all the sensors in the aircraft, or the number of messages on a social media platform.
Volume- Data on the order of terabytes (1012), petabytes (1015) and larger require different information
processing systems to normal data.
Velocity– The rate at which data is created, stored, processed and analyzed.
Veracity- Measures how accurate and reliable the data, such as the amount of bias or uncertainty, eg. noise
in the readings from IoT sensors.
Variety– Structured data can be classified into fields such as names, addresses, water quality values.
Unstructured data cannot be classified easily into fields such as satellite images, paragraphs of written text,
photos and graphic images, videos, streaming instrument data, webpages, and document files. Semi-
structured data which is a type of combination of structured and unstructured data is also possible.
MACHINE LEARNING
Machine learning is used for predictive data analysis in an automated way. It is based on pattern recognition
and enables computers to learn without being programmed to perform specific tasks. Using machine-
learning algorithms, computers can learn from data and make predictions in the same way as humans learn
from experience. As machine-learning models are exposed to new data, they can adapt, and their accuracy
is improved as they learn from previous computations to produce more reliable, repeatable decisions and
results.
2. Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured.
Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like the
transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
4. Machine Learning tutorial provides basic and advanced concepts of machine learning. Our machine
learning tutorial is designed for students and working professionals.
5. Machine learning is a growing technology which enables computers to learn automatically from past data.
Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information. Currently, it is being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender system, and many more.
6. This machine learning tutorial gives you an introduction to machine learning along with the wide range of
machine learning techniques such as Supervised, Unsupervised, and Reinforcement learning. You will learn
about regression and classification models, clustering methods, hidden Markov models, and various
sequential models.
16) Explain Parallel Computing and Algorithms .
It is the use of multiple processing elements simultaneously for solving any problem. Problems are
broken down into instructions and are solved concurrently as each resource that has been applied to work
is working at the same time.
Advantages of Parallel Computing over Serial Computing are as follows:
1. It saves time and money as many resources working together will reduce the time and cut potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.
4. Serial Computing ‘wastes’ the potential computing power, thus Parallel Computing makes better work of
the hardware.
Types of Parallelism:
1. Bit-level parallelism – It is the form of parallel computing which is based on the increasing processor’s
size. It reduces the number of instructions that the system must execute in order to perform a task on large-
sized data. Example: Consider a scenario where an 8-bit processor must compute the sum of two 16- bit
integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two
instructions to perform the operation. A 16-bit processor can perform the operation with just one
instruction.
2. Instruction-level parallelism – A processor can only address less than one instruction for each clock cycle
phase. These instructions can be re-ordered and grouped which are later on executed concurrently without
affecting the result of the program. This is called instruction-level parallelism.
3. Task Parallelism – Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks concurrently.
Applications of Parallel Computing:
• Databases and Data mining.
• Real-time simulation of systems.
• Science and Engineering.
• Advanced graphics, augmented reality, and virtual reality.
Limitations of Parallel Computing:
• It addresses such as communication and synchronization between multiple sub-tasks and processes which
is difficult to achieve.
• The algorithms must be managed in such a way that they can be handled in a parallel mechanism.
• The algorithms or programs must have low coupling and high cohesion. But it’s difficult to create such
programs.
• More technically skilled and expert programmers can code a parallelism-based program well.
17) What are different techniques to manage Big data
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc
,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is
stated that almost 90% of today's data has been generated in the past 3 years.
By combining a set of techniques that analyse and integrate data from multiple sources and solutions, theinsights are more
efficient and potentially more accurate than if developed through a single source of data.
1. Data mining
A common tool used within big data analytics, data mining extracts patterns from large data sets by
combining methods from statistics and machine learning, within database management. An example would
be when customer data is mined to determine which segments are most likely to react to an offer.
2. Machine learning
Well known within the field of artificial intelligence, machine learning is also used for data analysis. Emerging
from computer science, it works with computer algorithms to produce assumptions based on data. 14 It
provides predictions that would be impossible for human analysts.
This technique works to collect, organise, and interpret data, within surveys and experiments.Other data
analysis techniques include spatial analysis, predictive modelling, association rule learning, network analysis
and many, many more. The technologies that process, manage, and analyse this data are of an entirely
different and expansive field, that similarly evolves and develops over time. Techniques and technologies
aside, any form or size of data is valuable. Managed accurately and effectively, it can reveal a host of business,
product, and market insights. What does the future of data analysis look like? It’s hard to say with the
tremendous pace analytics and technology progresses, but undoubtedly data innovation is changing the face of
business and society in its holistic entirety.
18) Explain research methodology basics and importance in data science
Research methodology is the specific procedures or techniques used to identify, select, process, and
analyze information about a topic. In a research paper, the methodology section allows the reader to
critically evaluate a study's overall validity and reliability.
The 4 types of research methodology
Data may be grouped into four main types based on methods for collection: observational, experimental,
simulation, and derived. Most frequently used methods include:
• Observation / Participant Observation.
• Surveys.
• Interviews.
• Focus Groups.
• Experiments.
• Secondary Data Analysis / Archival Study.
• Mixed Methods (combination of some of the above)
It is necessary not just to identify the problem for Research but to determine the best method to
solve that problem as well.
• A suitable method for the decision problem.
• The order of accuracy of the outcome of a way for the problem
Characteristics of Good Research
▪ Good research is systematic:
▪ Good research is logical
▪ Good research is empirical
▪ Good research is replicable
19) Explain various Applications of Data Science .
Data Science is the deep study of a large quantity of data, which involves extracting some meaningful
from the raw, structured, and unstructured data. The extracting out meaningful data from large amounts
use processing of data and this processing can be done using statistical techniques and algorithm, scientific
techniques, different technologies, etc. It uses various tools and techniques to extract meaningful data from
raw data. Data Science is also known as the Future of Artificial Intelligence.
Applications of Data Science
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search
for something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc. So Data
Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses ” then at
that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is Done using Data Science, and we get the Topmost
visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless
Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow Roads,
etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out
strategic decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to
predict the future. It allows the companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used
to examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed
in such a way that it makes it possible to predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience
with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar
to choices according to our past data and also we get recommendations according to most buy the product,
most rated, most searched, etc. This is all done with the help of Data Science.
5. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the
help of machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture matched with someone else
profile then Facebook suggests us auto-tagging .
20) Describe importance of data science in future
The evolution of Data Science over the years has taken form in many phases. It all started with
Statistics. Simple statistics models were employed to collect, analyse and manage data since the early 1800s.
These principles underwent various modulations over time until the rise of the digital age. Once computers
were introduced as mainstream public devices, there was a shift in the industry to the Digital Age. A flood of
data and digital information was created. This resulted in the statistical practices and models getting
computerized giving rise to digital analytics. Then came the rise of the internet that exponentially grew the
data available giving rise to what we know as Big Data. This explosion of information available to the masses
gave rise to the need for expertise to process, manage, analyse and visualize this data for the purpose of
decision making through the use of various models. This gave birth to the term Data Science.
Splitting of Data Science:
Presently, the term data science is perceived quite vaguely. There are various designations and
descriptions that are associated with data science like Data Analyst, Data Engineer, Data Visualization, Data
Architect, Machine Learning and Business Intelligence to name a few. However, as we move into the future,
we would begin to better interpret and understand the contribution of each of their roles independently.
This would greatly broaden the domain and we would begin to have professionals gaining expertise in these
domain-specific roles giving a clearer picture of the workflow associated with each role.
Data Explosion:
Today enormous amounts of data are being produced on a daily basis. Every organization is
dependent on the data being created for its processes. Whether it is medicine, entertainment, sports,
manufacturing, agriculture or transport, it is all dependent on data. There would be a continuous increase in
the demand for expertise to extract valuable insights from the data as it continues to keep increasing leaps
and bounds.
Rise of Automation:
With an increase in the complexity of operations, there is always a strive to simplify processes. Into
the future, it is evident that most machine learning frameworks would contain libraries of models that are
pre-structured and pre-trained. This would bring about a paradigm shift in the working of a Data Scientist.
Creation of models for analysis wouldn’t remain as their native responsibility but rather it would shift to the
true analytics of the data extracted from these models. Soft skills like Data Visualization would come to the
forefront of a Data Scientists skill set.
Scarcity or Abundance of Data Scientists:
Today thousands of individuals learn Data Science related skills through college degrees or the
numerous resources that can be found online and this could result in newer aspirants getting a feeling of
saturation in this domain. However it is essential to realize that data science is not a domain that can just be
learned, it needs to be inculcated. No doubt the skills being learned are of immense importance, but these
are just the tools that help to work with the data. The mindset and the applicative sense of using these tools
to accomplish various analytical tasks is what makes a true data scientist. Thus it should be remembered that
there could always be an abundance of individuals who have learned Data Science, but there would always
be a scarcity of Data Scientists.
The future of Data Science is not definitive however what is certain is that it would continue to evolve into a
new phase depending on the need of the hour. Data Scientists would exist as long as there exists Data.
21) Describe the programming paradigm / Explain different types of programming
languages .
programming paradigm
o The term programming paradigm refers to a style of programming. It does not refer to a specific language,
but rather it refers to the way you program. There are lots of programming languages that are well-known
but all of them need to follow some strategy when they are implemented. And that strategy is a paradigm.
Four different programming paradigms –
Procedural, Object-Oriented, Functional and Logical.
Advantage:
1. Very simple to implement
2. It contains loops, variables etc.
Disadvantage:
1. Complex problem cannot be solved
2. Less efficient and less productive
3. Parallel programming is not possible
o Object oriented programming – The program is written as a collection of classes and object which
are meant for communication. The smallest and basic entity is object and all kind of computation is
performed on the objects only. More emphasis is on data rather procedure. It can handle almost all
kind of real life problems which are today in scenario.
o Logic programming paradigms – It can be termed as abstract model of computation. It would solve
logical problems like puzzles, series etc. In logic programming we have a knowledge base which we
know before and along with the question and knowledge base which is given to machine, it produces
result. In normal programming languages, such concept of knowledge base is not available but while
using the concept of artificial intelligence, machine learning we have some models like Perception
model which is using the same mechanism. In logical programming the main emphasize is on
knowledge base and the problem. The execution of the program is very much like proof of
mathematical statement, e.g., Prolog
o Functional programming paradigms – The functional programming paradigms has its roots in
mathematics and it is language independent. The key principal of this paradigms is the execution of
series of mathematical functions. The central model for the abstraction is the function which are
meant for some specific computation and not the data structure. Data are loosely coupled to
functions.The function hide their implementation. Function can be replaced with their values without
changing the meaning of the program. Some of the languages like perl, javascript mostly uses this
paradigm.
22) Explain different Analysis techniques in data science
Data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of
discovering useful information, informing conclusions, and supporting decisionmaking.
A simple example of Data analysis is whenever we take any decision in our day-to-day life is by
thinking about what happened last time or what will happen by choosing that particular decision. This is
nothing but analyzing our past or future and making decisions based on it.
Predictive analysis
Predictive analysis uses powerful statistical algorithms and machine learning tools to predict future
events and behavior based on new and historical data trends. It relies on a wide range of probabilistic
techniques such as data mining, big data, predictive modeling, artificial intelligence and simulations to guess
what is likely to occur in the future. Predictive analysis is a branch of business intelligence as many
organizations with operations in marketing, sales, insurance and financial services rely on data to make long-
term plans.
Prescriptive analysis
Prescriptive analysis helps organizations use data to guide their decision-making process. Companies
can use tools such as graph analysis, algorithms, machine learning and simulation for this type of analysis.
Prescriptive analysis helps businesses make the best choice from several alternative courses of action.
Exploratory data analysis
Exploratory data analysis is a technique data scientists use to identify patterns and trends in a data
set. They can also use it to determine relationships among samples in a population, validate assumptions,
test hypotheses and find missing data points. Companies can use exploratory data analysis to make insights
based on data and validate data for errors.
Descriptive analytics :is the process of using current and historical data to identify trends and relationships. It’s
sometimes called the simplest form of data analysis because it describes trends and relationships but doesn’t dig
deeper.
Descriptive analytics is relatively accessible and likely something your organization uses daily. Basic statistical
software, such as Microsoft Excel or data visualization tools, such as Google Charts and Tableau, can help parse
data, identify trends and relationships between variables, and visually display information.
Descriptive analytics is especially useful for communicating change over time and uses trends as a springboard for
further analysis to drive decision-making.
23) What is Data mining ? Explain in detail
Data mining is the process of finding anomalies, patterns and correlations within large data sets to
predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut
costs, improve customer relationships, reduce risks and more.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms
for data segments and evaluates the probability of future events. Data Mining is also called Knowledge
Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular
data set, with an objective. This process includes various types of services such as text mining, web mining,
audio and video mining, pictorial data mining, and social media mining. It is done through software that is
simple or highly specific. By outsourcing data mining, all the work can be done faster with low operation
costs. Specialized firms can also use new technologies to collect data that is impossible to locate manually.
The biggest challenge is to analyze the data to extract important information that can be used to solve a
problem or for company development. There are many powerful instruments and techniques available to
mine data and find better insight from it.
Data mining allows you to:
• Sift through all the chaotic and repetitive noise in your data.
• Understand what is relevant and then make good use of that information to assess likely outcomes.
• Accelerate the pace of making informed decisions.
Advantages of Data Mining
o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
Disadvantages of Data Mining
o There is a probability that the organizations may sell useful data of customers to other organizations for
money. As per the report, American Express has sold credit card purchases of their customers to other
organizations.
o Many data mining analytics software is difficult to operate and needs advance training to work on.
24) Differentiate between Data mining and Data Science
S DATA DATA
R SCIENCE MINING
N
O.
0 Data Science is an area. Data Mining is a technique.
1
0 It is about collection, processing, It is about extracting the vital and
2 analyzing and utilizing of data into valuableinformation from the data.
various operations. It is more
conceptual.
0 It is a field of study just like the It is a technique which is a part of the
3 Computer Science, Applied Statistics Knowledge Discovery in Data Base
or Applied Mathematics processes (KDD)
0 The goal is to build data- The goal is to make data more vital
4 dominantproducts for a venture. and usable i.e. by extracting only
important information.
0 It deals with the all types of data i.e. It mainly deals with the structured
5 structured, unstructured or semi- forms ofthe data.
structured.
0 It is a super set of Data Mining as data It is a sub set of Data Science as
6 science consists of Data scrapping, mining activities which is in a
cleaning, visualization, statistics and pipeline of the Data science.
manymore techniques.
0 It is mainly used for scientific purposes It is mainly used for business
7 purposes.