0% found this document useful (0 votes)
24 views16 pages

DAFD UNit-2

Uploaded by

navyasrireddy009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views16 pages

DAFD UNit-2

Uploaded by

navyasrireddy009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit - 2

Data Analytics Lifecycle :


The Data analytic lifecycle is designed for Big Data problems and data science
projects. The cycle is iterative to represent real project. To address the distinct
requirements for performing analysis on Big Data, step – by – step methodology
is needed to organize the activities and tasks involved with acquiring,
processing, analyzing, and repurposing data.

Phase 1: Discovery –

 The data science team learn and investigate the problem.


 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –

 Steps to explore, preprocess, and condition data prior to modeling and


analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in
predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner,
Open Refine, etc.
Phase 3: Model Planning –

 Team explores data to learn about relationships between variables and


subsequently, selects key variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and
production purposes.
 Team builds and executes models based on the work done in the model
planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –

 Team develops datasets for testing, training, and production purposes.


 Team also considers whether its existing tools will suffice for running the
models or if they need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –

 After executing model team need to compare outcomes of modeling to


criteria established for success and failure.
 Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize –

 The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
 This approach enables team to learn about performance and related
constraints of the model in production environment on small scale , and
make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.
Obtaining Data Files:

Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is to
be analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.

Data collection starts with asking some questions such as what type of data is to
be collected and what is the source of collection. Most of the data collected are
of two types known as “qualitative data“ which is a group of non-numerical data
such as words, sentences mostly focus on behavior and actions of the group and
another one is “quantitative data” which is in numerical forms and can be
calculated using different scientific tools and sampling data.

The actual data is then further divided mainly into two types known as:

1. Primary data
2. Secondary data

1.Primary data:

The data which is Raw, original, and extracted directly from the official sources
is known as primary data. This type of data is collected directly by performing
techniques such as questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the target audience on
which analysis is performed otherwise it would be a burden in the data
processing.

Few methods of collecting primary data:

1. Interview method:

The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is
known as the interviewee. Some basic business or product related questions are
asked and noted down in the form of notes, audio, or video and this data is stored
for processing. These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face, email, etc.

2. Survey method:
ADVERTISING

The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing data.
Examples are online surveys or surveys through social media polls.

3. Observation method:

The observation method is a method of data collection in which the researcher


keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video,
or any raw formats. In this method, the data is collected directly by posting a few
questions on the participants. For example, observing a group of customers and
their behavior towards the products. The data obtained will be sent for
processing.

4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.

 CRD- Completely Randomized design is a simple experimental design used


in data analytics which is based on randomization and replication. It is mostly
used for comparing the experiments.
 RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique
known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD
and RBD blocks but contains rows and columns. It is an arrangement of NxN
squares with an equal amount of rows and columns which contain letters that
occurs only once in a row. Hence the differences can be easily found with
fewer errors in the experiment. Sudoku puzzle is an example of a Latin
square design.
 FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.

2. Secondary data:

Secondary data is the data which has already been collected and reused again for
some valid purpose. This type of data is previously recorded from primary data
and it has two types of sources named internal source and external source.

Internal source:

These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc. The
cost and time consumption is less in obtaining internal sources.

External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and time
consumption is more because this contains a huge amount of data. Examples of
external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.

What is Data Analytics?

Data analytics in the context of this write-up could be seen as the act
of sourcing, processing, analysing, interpreting and visualizing data
with the primary objective of extracting actionable insights from the
results of the analysis.

The International Auditing and Assurance Standards Board (IAASB)


defines data analytics as the science and art of discovering and
analysing patterns, deviations and inconsistencies, and extracting
other useful information in the data underlying or related to the
subject matter of an audit through analysis, modelling and
visualisation for the purpose of planning and performing the audit.
{1}

There are various data analytic tools available to auditors. In fact,


audit firms are now integrating data analytic capability into their
audit workflow and this allows them to run all their analytics within
a single system. Common data analytic tools include excel data
analysis tool, python, SQL query, IDEA analytical software, R
Programming, tableau etc.

Impact of Data Analytics on Audit

Data analytics has positively impacted audit in several ways some of


which are;

1) Fraud and Error Detection: Data analytic tools can help to


discover unusual pattern in large data set which sometimes may be
suggestive of fraud or error.

2) Analytic tools usually contain features that assist to discover


invalid, missing or erroneous data and could sometimes assist in
confirming the completeness of a population to be tested. Likewise,
through the application of computer assisted audit techniques
(“CAATs”), audit analytic tools assist in verifying the completeness
and accuracy of system generated reports.

3) CAAT enables auditors to easily take care of the highly


automated ledger balances thereby allowing an increased focus on
high risk areas that require deep professional judgement.

4) It enables effective managerial decision making through


superior business intelligence provided by data analytic tools such as
Microsoft Power BI

5) Increases efficiency thereby reducing the time spent mundane


task on an audit;
6) The traditional auditing methodology is based on sampling
technique; selecting a set of data for sampling out of a large
population. However, with data analytics, it is possible for auditors
to test the entire population and provide audit evidences in more
detailed granular form.

7) Also, advanced machine learning and artificial intelligence has


led to the invention of warehouse drones which can be used to carry
out inventory count. Inventory count involves taking a record count
of the stock being held by a company at a particular point in time.
Inventory counts are usually done manually and may be very
strenuous and time-consuming.

What is a File Format


File formats are designed to store specific types of information, such as CSV,
XLSX etc. The file format also tells the computer how to display or process its
content. Common file formats, such as CSV, XLSX, ZIP, TXT etc.
If you see your future as a data scientist so you must understand the different
types of file format. Because data science is all about the data and it’s processing
and if you don’t understand the file format so may be it’s quite complicated for
you. Thus, it is mandatory for you to be aware of different file formats.
Different type of file formats:
CSV: the CSV is stand for Comma-separated values. as-well-as this name CSV
file is use comma to separated values. In CSV file each line is a data record and
Each record consists of one or more than one data fields, the field is separated by
commas.
How do you prepare your data?
Data preparation follows a series of steps that starts with collecting the right data, followed by
cleaning, labeling, and then validation and visualization.

Collect data

Collecting data is the process of assembling all the data you need for ML. Data collection can be
tedious because data resides in many data sources, including on laptops, in data warehouses, in the
cloud, inside applications, and on devices. Finding ways to connect to different data sources can
be challenging. Data volumes are also increasing exponentially, so there is a lot of data to search
through. Additionally, data has vastly different formats and types depending on the source. For
example, video data and tabular data are not easy to use together.

Clean data

Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you
have clean data, you will need to transform it into a consistent, readable format. This process can
include changing field formats like dates and currency, modifying naming conventions, and
correcting values and units of measure so they are consistent.

Label data

Data labeling is the process of identifying raw data (images, text files, videos, and so on) and
adding one or more meaningful and informative labels to provide context so an ML model can
learn from it. For example, labels might indicate if a photo contains a bird or car, which words
were mentioned in an audio recording, or if an X-ray discovered an irregularity. Data labeling is
required for various use cases, including computer vision, natural language processing, and speech
recognition.

Validate and visualize

After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and
ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and
bar charts are all useful tools to confirm data is correct. Additionally, visualizations also help data
science teams complete exploratory data analysis. This process uses visualizations to discover
patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does
not require formal modeling; instead, data science teams can use visualizations to decipher the
data.

How to do data organization?

Data organization is the process of putting data into groups and categories to
make it easier to use so that it can be accessed, processed, and analyzed
more quickly.

You’ll need to organize your data in the most logical and orderly way
possible, similar to how we collect critical papers in file folders, so you and
anybody who has access to it can quickly find what they’re searching for.

It enables us to organize the information so that it is simple to read and use.


Working with or performing any analytics on raw data is challenging. So, to
properly portray the data, we must arrange it.

In a world where data sets are some of the most valuable assets possessed by
businesses across many different sectors, companies use this method to use
their data assets better.

Executives and other professionals may put a lot of effort into organizing
data as part of a larger plan to streamline business processes, get better
business intelligence, and improve a business model in general.

What is data sampling?

Data sampling is a statistical analysis technique used to select, manipulate


and analyze a representative subset of data points to identify patterns and
trends in the larger data set being examined. It enables data scientists,
predictive modelers and other data analysts to work with a small,
manageable amount of data about a statistical population to build and run
analytical models more quickly, while still producing accurate findings.

Why is data sampling important?

Data sampling is a widely used statistical approach that can be applied in


various use cases, including opinion, web analytics or political polls. For
example, a researcher doesn't need to speak with every American to
discover the most common method of commuting to work in the U.S.
Instead, they can choose 1,000 participants as a representative sample in the
hopes that this number will be sufficient to produce accurate results.

Therefore, data sampling enables data scientists and researchers to


extrapolate knowledge about a broad population from a smaller sample of
data. By taking a data sample, predictions about the larger population can be
made with a certain level of confidence without having to collect and
analyze data from each member of the population.

Advantages and challenges of data sampling

Data sampling is an effective approach for data analysis that comes with
various benefits and also a few challenges.

Benefits of data sampling

 Time savings. Sampling can be particularly useful with data sets that are
too large to efficiently analyze in full -- for example, in big data
analytics applications or surveys. Identifying and analyzing a
representative sample is more efficient and less time-consuming than
surveying the entirety of the data or population.

 Cost savings. Data sampling is often more cost-effective than collecting


data from the entire population.
 Accuracy. Correct sampling techniques can produce reliable findings.
Researchers can accurately interpret information about the total
population by selecting a representative sample.

 Flexibility. Data sampling provides researchers with the flexibility to


choose from a variety of sampling methods and sample sizes to best
address their research questions and make use of their resources.

 Bias elimination. Sampling can help to eliminate bias in data analysis,


as a well-designed sample can limit the influence of outliers, errors and
other kinds of bias that may impair the analysis of the entire population.

What are Descriptive Statistics?


In Descriptive statistics, we are describing our data with the help of various
representative methods using charts, graphs, tables, excel files, etc. In descriptive
statistics, we describe our data in some manner and present it in a meaningful
way so that it can be easily understood. Most of the time it is performed on small
data sets and this analysis helps us a lot to predict some future trends based on
the current findings. Some measures that are used to describe a data set are
measures of central tendency and measures of variability or dispersion.

Types of Descriptive Statistics


 Measures of Central Tendency
 Measure of Variability
 Measures of Frequency Distribution
Measures of Central Tendency

It represents the whole set of data by a single value. It gives us the location of
the central points. There are three main measures of central tendency:

 Mean
 Mode
 Median

Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.

Mode
It is the value that has the highest frequency in the given data set. The data set
may have no mode if the frequency of all data points is the same. Also, we can
have more than one mode if we encounter two or more data points having the
same frequency.

Median
It is the middle value of the data set. It splits the data into two halves. If the
number of elements in the data set is odd then the center element is the median
and if it is even then the median would be the average of two central elements.

Measure of Variability

Measures of variability are also termed measures of dispersion as it helps to gain


insights about the dispersion or the spread of the observations at hand.

Unit - 3

What is Benford’s Law?


Benford’s law is named after a physicist called Frank Benford and was first
discovered in the 1880s by an astronomer named Simon Newcomb. Newcomb was
looking through logarithm tables (used before pocket calculators were invented to
find the value of the logarithms of numbers), when he spotted that the pages which
started with earlier digits, like 1, were significantly more worn than other pages.

Given a large set of numerical data, Benford’s Law asserts that the first digit of these
numbers is more likely to be small. If the data follows Benford’s Law, then
approximately 30% of the time the first digit would be a 1, whilst 9 would only be the
first digit around 5% of the time. If the distribution of the first digit was uniform, then
they would all occur equally often (around 11% of the time). It also proposes a
distribution of the second digit, third digit, combinations of digits, and so
on. According to Benford’s Law, the probability that the first digit in a dataset is d is
given by P(d) = log10(1 + 1/d).

Why is it useful?
There are plenty of data sets that have proven to have followed Benford’s Law,
including stock prices, population numbers, and electricity bills. Due to the large
availability of data known to follow Benford’s Law, checking a data set to see if it
follows Benford’s Law can be a good indicator as to whether the data has been
manipulated. While this is not definitive proof that the data is erroneous or fraudulent,
it can provide a good indication of problematic trends in your data.

In the context of fraud, Benford’s law can be used to detect anomalies and
irregularities in financial data. For example, within large datasets such as invoices,
sales records, expense reports, and other financial statements. If the data has been
fabricated, then the person tampering with it would probably have done so
“randomly”. This means the first digits would be uniformly distributed and thus, not
follow Benford’s Law.

Below are some real-world examples where Benford’s Law has been applied:

Detecting fraud in financial accounts – Benford’s Law can be useful in its


application to many different types of fraud, including money laundering and large
financial accounts. Many years after Greece joined the eurozone, the economic data
they provided to the E.U. was shown to be probably fraudulent using this method.

Detecting election fraud – Benford’s Law was used as evidence of fraud in the 2009
Iranian elections and was also used for auditing data from the 2009 German federal
elections. Benford’s Law has also been used in multiple US presidential elections.

Analysis of price digits – When the euro was introduced, all the different exchange
rates meant that, while the “real” price of goods stayed the same, the “nominal” price
(the monetary value) of goods was distorted. Research carried out across Europe
showed that the first digits of nominal prices followed Benford’s Law. However,
deviation from this occurred for the second and third digits. Here, trends more
commonly associated with psychological pricing could be observed.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy