DAFD UNit-2
DAFD UNit-2
Phase 1: Discovery –
The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
This approach enables team to learn about performance and related
constraints of the model in production environment on small scale , and
make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Obtaining Data Files:
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is to
be analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.
Data collection starts with asking some questions such as what type of data is to
be collected and what is the source of collection. Most of the data collected are
of two types known as “qualitative data“ which is a group of non-numerical data
such as words, sentences mostly focus on behavior and actions of the group and
another one is “quantitative data” which is in numerical forms and can be
calculated using different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data
1.Primary data:
The data which is Raw, original, and extracted directly from the official sources
is known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would be a burden in the data
processing.
1. Interview method:
The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions are
asked and noted down in the form of notes, audio, or video and this data is stored
for processing. These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face, email, etc.
2. Survey method:
ADVERTISING
The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing data.
Examples are online surveys or surveys through social media polls.
3. Observation method:
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc. The
cost and time consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples of
external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
Data analytics in the context of this write-up could be seen as the act
of sourcing, processing, analysing, interpreting and visualizing data
with the primary objective of extracting actionable insights from the
results of the analysis.
Collect data
Collecting data is the process of assembling all the data you need for ML. Data collection can be
tedious because data resides in many data sources, including on laptops, in data warehouses, in the
cloud, inside applications, and on devices. Finding ways to connect to different data sources can
be challenging. Data volumes are also increasing exponentially, so there is a lot of data to search
through. Additionally, data has vastly different formats and types depending on the source. For
example, video data and tabular data are not easy to use together.
Clean data
Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you
have clean data, you will need to transform it into a consistent, readable format. This process can
include changing field formats like dates and currency, modifying naming conventions, and
correcting values and units of measure so they are consistent.
Label data
Data labeling is the process of identifying raw data (images, text files, videos, and so on) and
adding one or more meaningful and informative labels to provide context so an ML model can
learn from it. For example, labels might indicate if a photo contains a bird or car, which words
were mentioned in an audio recording, or if an X-ray discovered an irregularity. Data labeling is
required for various use cases, including computer vision, natural language processing, and speech
recognition.
After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and
ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and
bar charts are all useful tools to confirm data is correct. Additionally, visualizations also help data
science teams complete exploratory data analysis. This process uses visualizations to discover
patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does
not require formal modeling; instead, data science teams can use visualizations to decipher the
data.
Data organization is the process of putting data into groups and categories to
make it easier to use so that it can be accessed, processed, and analyzed
more quickly.
You’ll need to organize your data in the most logical and orderly way
possible, similar to how we collect critical papers in file folders, so you and
anybody who has access to it can quickly find what they’re searching for.
In a world where data sets are some of the most valuable assets possessed by
businesses across many different sectors, companies use this method to use
their data assets better.
Executives and other professionals may put a lot of effort into organizing
data as part of a larger plan to streamline business processes, get better
business intelligence, and improve a business model in general.
Data sampling is an effective approach for data analysis that comes with
various benefits and also a few challenges.
Time savings. Sampling can be particularly useful with data sets that are
too large to efficiently analyze in full -- for example, in big data
analytics applications or surveys. Identifying and analyzing a
representative sample is more efficient and less time-consuming than
surveying the entirety of the data or population.
It represents the whole set of data by a single value. It gives us the location of
the central points. There are three main measures of central tendency:
Mean
Mode
Median
Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.
Mode
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. Also, we can
have more than one mode if we encounter two or more data points having the
same frequency.
Median
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median
and if it is even then the median would be the average of two central elements.
Measure of Variability
Unit - 3
Given a large set of numerical data, Benford’s Law asserts that the first digit of these
numbers is more likely to be small. If the data follows Benford’s Law, then
approximately 30% of the time the first digit would be a 1, whilst 9 would only be the
first digit around 5% of the time. If the distribution of the first digit was uniform, then
they would all occur equally often (around 11% of the time). It also proposes a
distribution of the second digit, third digit, combinations of digits, and so
on. According to Benford’s Law, the probability that the first digit in a dataset is d is
given by P(d) = log10(1 + 1/d).
Why is it useful?
There are plenty of data sets that have proven to have followed Benford’s Law,
including stock prices, population numbers, and electricity bills. Due to the large
availability of data known to follow Benford’s Law, checking a data set to see if it
follows Benford’s Law can be a good indicator as to whether the data has been
manipulated. While this is not definitive proof that the data is erroneous or fraudulent,
it can provide a good indication of problematic trends in your data.
In the context of fraud, Benford’s law can be used to detect anomalies and
irregularities in financial data. For example, within large datasets such as invoices,
sales records, expense reports, and other financial statements. If the data has been
fabricated, then the person tampering with it would probably have done so
“randomly”. This means the first digits would be uniformly distributed and thus, not
follow Benford’s Law.
Below are some real-world examples where Benford’s Law has been applied:
Detecting election fraud – Benford’s Law was used as evidence of fraud in the 2009
Iranian elections and was also used for auditing data from the 2009 German federal
elections. Benford’s Law has also been used in multiple US presidential elections.
Analysis of price digits – When the euro was introduced, all the different exchange
rates meant that, while the “real” price of goods stayed the same, the “nominal” price
(the monetary value) of goods was distorted. Research carried out across Europe
showed that the first digits of nominal prices followed Benford’s Law. However,
deviation from this occurred for the second and third digits. Here, trends more
commonly associated with psychological pricing could be observed.