0% found this document useful (0 votes)
41 views8 pages

Research Paper

Data Science

Uploaded by

sam24maverick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Research Paper

Data Science

Uploaded by

sam24maverick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Title: Data Science

Abstract
Data science plays a vital role in the research field of computer science and engineering which involves collection of
data, transformation, processing, describing, and modelling. In this article, fundamental theory of Data Science,
Machine learning and Deep Learning with the scope and opportunities has been discussed. This helps the researchers
to get a clarity on data science and its importance. Data science encompasses a set of principles, problem definitions,
algorithms, and processes for extracting nonobvious and useful patterns from large data sets. Many of the elements of
data science have been developed in related fields such as machine learning and data mining.

Introduction

Data Science is the area of study which involves extracting insights from vast amounts of data using various scientific
methods, algorithms, and processes. It helps you to discover hidden patterns from the raw data. The term Data Science
has emerged because of the evolution of mathematical statistics, data analysis, and big Data.
Data Science is an interdisciplinary field that allows you to extract knowledge from structured or unstructured data.
Data science enables you to translate a business problem into a research project and then translate it back into a
practical solution.

Data Science is a buzzword in the technology world right now and for good reason, it represents a major step forward
in how computers can learn. The need for Data Scientists are high in demand and this surge is due to evolving
technology and generation of huge amounts of data aka Big Data.

With the help of data science technologies, banking companies are now able to divide and conquer data through past
expenditures, customer profiling together with other variables for analysing the probabilities of default and risk.
They're also leveraging the role of data science to prevent financial fraud.

This data needs to be analysed to enhance decision making. But there are some challenges of Big Data encountered by
companies. These include data quality, storage, lack of data science professionals, validating data, and accumulating
data from different sources.
Techniques/Methodology

 Regression
Regression analysis is a mathematical method of determining which of those has an effect. It provides answers to
the following questions: Which factors are most important? Which of these can we ignore? What is the
relationship between those variables? And, perhaps most importantly, how confident are we in each of these
variables?

 Classification
The process of identifying a function that divides a dataset into classes based on different parameters is known as
classification. A computer programme is trained on the training dataset and then uses that training to categorise
the data into different classes. The classification algorithm’s goal is to discover a mapping function that converts a
discrete input into a discrete output.

 Linear regression
One of the predictive modelling methods is linear regression. It’s the relation between the dependent and
independent variables. Regression assists in the discovery of associations between two variables.

 Jackknife regression
The jackknife method, also known as the “leave one out” procedure, is a cross-validation technique invented by
Quenouille to measure an estimator’s bias. A parameter’s jackknife estimation is an iterative method. The
parameter is first calculated from the entire sample. Then, one by one, each factor is extracted from the sample,
and the parameter of interest is determined using this smaller sample.

 Anomaly detection

Anomaly identification necessitates a more in depth understanding of the Data’s original behaviour over time, as
well as a comparison of the new behaviour to see whether it fits.
When I compare Anomaly to Outlier, it’s the same as finding the odd one out in the data, or data that doesn’t fit in
with the rest of the data.

Review of literature process


Categorical review is the main element in the research process. It should be included in literature survey after doing
sufficient reviews of research papers and started after completing five stage review process discussed in previous
section.
The span of research papers that were covered in these categories is 2018-2022.

Review of Literature
Summary paper 1:
A. Scientific Understanding of Learning, Especially Deep Learning Algorithms.
Salma Sassi, Mirjana Ivanovic, et-al, 2022 has found that as much as we admire the astonishing successes of deep
learning, we still lack a scientific understanding of why deep learning works so well, though we are making headway.
We do not understand the mathematical properties of deep learning algorithms or of the models they produce. We do
not know how to explain why a deep learning model produces one result and not another. We do not understand how
robust or fragile models are to perturbations to input data distributions. We do not understand how to verify that deep
learning will perform the intended task well on new input data. We do not know how to characterize or measure the
uncertainty of a model’s results. We do not know deep learning’s fundamental computational limits; at what point
does more data and more compute not help? Deep learning is an example of where experimentation in a field is far
ahead of any kind of complete theoretical understanding. And, it is not the only example in learning: random forests
and high-dimensional sparse statistics enjoy widespread applicability on large-scale data, where gaps remain between
their performance in practice and what theory can explain.

Summary paper 2:
A. Causal Reasoning
Alejandro Rodríguez-González, et-al, 2012 has found that Machine learning is a powerful tool to find patterns and to
examine associations and correlations, particularly in large data sets. While the adoption of machine learning has
opened many fruitful areas of research in economics, social science, public health, and medicine, these fields require
methods that move beyond correlational analyses and can tackle causal questions. A rich and growing area of current
study is revisiting causal inference in the presence of large amounts of data. Economists are devising new methods
that incorporate the wealth of data now available into their mainstay causal reasoning techniques, for example, the use
of instrumental variables; these new methods make causal inference estimation more efficient and flexible. Data
scientists are beginning to explore multiple causal inference, not just to overcome some of the strong assumptions of
univariate causal inference, but because most real-world observations are due to multiple factors that interact with
each other. Inspired by natural experiments used in economics and the social sciences, as more government agency
and commercial data becomes publicly available, data scientists are using synthetic control for novel applications in
public health, retail, and sports.

Summary paper 3:
B. Precious Data
Longbing CaoQiang YangPhilip S. Yu, et-al, 2020 has found that Data can be precious for one of three reasons: the
data set is expensive to collect; the data set contains a rare event; or the data set is artisanal—small, task-specific,
and/or targets a limited audience. A good example of expensive data comes from large, one-off, expensive scientific
instruments, for example, the Large Synoptic Survey Telescope, the Large Hadron Collider, and the Ice Cube
Neutrino Detector at the South Pole. A good example of rare event data is data from sensors on physical
infrastructure, such as bridges and tunnels; sensors produce a lot of raw data, but the disastrous event they are used to
predict is rare. Rare data can also be expensive to collect. A good example of artisanal data is the tens of millions of
court judgments that China has released online to the public since or the two-plus-million U.S. government
declassified documents collected by Columbia’s History Lab (Connelly et al., 2019). For each of these different kinds
of precious data, we need new data science methods and algorithms, taking into consideration the domain and the
intended uses and users of the data.

Summary paper 4:
C. Multiple, Heterogeneous Data Sources
Diana Inkpen, Mathieu Roche & Maguelonne Teisseire et-al, 2019 had collect lots of data from different data sources
to improve our models and to increase knowledge. For example, to predict the effectiveness of a specific cancer
treatment for a human, we might build a model based on 2-D cell lines from mice, more expensive 3-D cell lines from
mice, and the costly DNA sequence of the cancer cells extracted from the human. As another example, multiscale,
spatiotemporal climate models simulate the interactions among multiple physical systems, each represented by
disparate data sources drawn from sensing: the ocean, the atmosphere, the land, the biosphere, and humans. Many of
these data sources might be precious data. State-of-the-art data science methods cannot as yet handle combining
multiple, heterogeneous sources of data to build a single, accurate model. Bounding the uncertainty of a data model is
exacerbated when built from multiple, possibly unrelated data sources. More pragmatically, standardization of data
types and data formats could reduce undesired or unnecessary heterogeneity. Focused research in combining multiple
sources of data will provide extraordinary impact.
Summary paper 5:
D. Trustworthy AI
Chavan, V and Penev, L, et-al, 2018 [4] , have seen rapid deployment of systems using artificial intelligence and
machine learning in critical domains such as autonomous vehicles, criminal justice, health care, hiring, housing,
human resource management, law enforcement, and public safety, where decisions taken by AI agents directly impact
human lives. Consequently, there is an increasing concern if these decisions can be trusted to be correct, fair, ethical
(see Challenge no. 10), interpretable, private (see Challenge no. 9), reliable, robust, safe, and secure, especially under
adversarial attacks. Many of these properties borrow from a long history of research on Trustworthy Computing
(National Research Council, 1999), but AI raises the ante (Wing, 2020): reasoning about a machine learning model
seems to be inseparable from reasoning about the available data used to build it and the unseen data on which it is to
be used; and these models are inherently probabilistic. One approach to building trust is through providing
explanations of the outcomes of a machine learned model (Adadi & Berrada, 2018; Chen et al., 2018; Murdoch et al.,
2019; Turek, 2016). If we can interpret the outcome in a meaningful way, then the end user can better trust the model.
Another approach is through formal methods, where one strives to prove once and for all a model satisfies a certain
property. New trust properties yield new tradeoffs for machine learned models, for example, privacy versus accuracy;
robustness versus efficiency; fairness versus robustness. There are multiple technical audiences for trustworthy
models: model developers, model users (human and machine), and model customers; as well as more general
audiences: consumers, policymakers, regulators, the media, and the public.

Common Findings
Data scientists today draw largely from extensions of the “analyst” of years past trained in traditional disciplines. As
data science becomes an integral part of many industries and enriches research and development, there will be an
increased demand for more holistic and more nuanced data science roles.
Data science programs that strive to meet the needs of their students will likely evolve to emphasize certain skills
and capabilities.

Gaps
Preparation of Data for Smart Enterprise AI:
Finding and cleaning up the proper data is a data scientist's priority. Nearly 80% of a data scientist's day is spent on
cleaning, organizing, mining, and gathering data, according to a CrowdFlower poll. In this stage, the data is double-
checked before undergoing additional analysis and processing. Most data scientists (76%) agree that this is one of
the most tedious elements of their work. As part of the data wrangling process, data scientists must efficiently sort
through terabytes of data stored in a wide variety of formats and codes on a wide variety of platforms, all while
keeping track of changes to such data to avoid data duplication. Adopting AI-based tools that help data scientists
maintain their edge and increase their efficacy is the best method to deal with this issue. Another flexible workplace
AI technology that aids in data preparation and sheds light on the topic at hand is augmented learning.
Data Security:
Due to the need to scale quickly, businesses have turned to cloud management for the safekeeping of their sensitive
information. Cyberattacks and online spoofing have made sensitive data stored in the cloud exposed to the outside
world. Strict measures have been enacted to protect data in the central repository against hackers. Data scientists
now face additional challenges as they attempt to work around the new restrictions brought forth by the new rules.
Organizations must use cutting-edge encryption methods and machine learning security solutions to counteract the
security threat. In order to maximize productivity, it is essential that the systems be compliant with all applicable
safety regulations and designed to deter lengthy audits.

Problem Statement

Business processes represent analytical objects with continuous growing complexity. The information to analyse
comes from different sources of data and in different formats that require analysis as quickly as possible. In this
project phase a key factor is the collaboration between data scientists and business users that, at the end of the day, are
the ones with the widest business knowledge and are therefore, the ones who are going to set the path to success. In
our experience, this collaboration is greatly facilitated by data visualization tools.

Data visualization tools like Qlik or Tableau typically have capabilities to directly access several kinds of structured
and unstructured data sources, so they can be applied on top of raw data and are extremely effective in identifying
trends, anomalies, outliers in analysed data with a productivity level not comparable to a classical tabular approach.

As we said before, we must keep in mind that a Data Science project is definitely a business project, so it must always
be oriented on achieving results focused on the business and have a global vision aligned with the business strategy.
Traditional BI projects were typically set on long-term objectives so that the client often did not see results until the
total completion; this, in many cases, produced deviations, both in terms of cost and in scope. Machine Learning
projects must set short-term objectives and must be managed via agile approach, the loop between business questions,
hypothesis and data evidence must be a continuous one, new findings must be used to drive and improve subsequent
project waves and results, even when partial, need to be shared with business people to keep their commitment always
at high level. In Techedge experience we have found the use of Notebooks (Jupiter is the most commonly known but
many others are available) as an effective tool to explain to business users what technical people are doing, what data
are telling us and which results were obtained by applying models and algorithms - essentially creating a sort of
“common ground” where we can mix technical info and business concepts in order to maintain vital project alignment.

Objective
The objective of the data scientist is to explore, sort and analyze megadata from various sources in order to take
advantage of them and reach conclusions to optimize business processes or for decision support.

The goal of data science is to construct the means for extracting business-focused insights from data. This requires an
understanding of how value and information flows in a business, and the ability to use that understanding to identify
business opportunities.

Justification
 Modern Technology
 Safe and Secure
 Mathematical and statistical knowledge.
 Well-versed in data visualization, data analytics, data cleaning, and big data.
 Good communication skill.
 Excellent organizational skills.

Work Flow

Work Plan
S.no. Activity Nov Dec Jan Feb Mar Apr
1. Data collection / Review of Literature/ Prior Art Search

2. Algorithm selections & data collections and resource availability of proposal.

3. Analysis & design of independent resource of proposal

4. Design of improved algorithms and procedures.

5. Preparation of hardware Prototype/ Development of product/ Solutions

6. Comparison, Estimation & feasibility of the hardware/software testing using


Standard
To finalize appropriate designs of single resource based and hybrid systems for
adoption proposed a structured approach for designing such systems Setting
milestones

7. Report Submission and Final Presentation

Cost Benefit Analysis


Total cost of ownership is a purchasing tool and philosophy which is aimed at understanding the true cost of buying a
particular good or service from a particular supplier. Data Science helps organizations knowing how and when their
products sell best and that’s why the products are delivered always to the right place and right time. Faster and better
decisions are taken by the organization to improve efficiency and earn higher profits.

Milestone
The main duration of this project will be approximately 5-7 months after the testing we can launch in real time
operations and user experiences and also the installations process would take 1-2 months overall time is nearly 1 year
for implementation in real time environment.

Simulation / Implementation / Targeted Learning


 We use the technologies like Anaconda, Google Colab, Heroku, TensorFlow.
 Provides essential data analysis tools for answering complex big data questions based on real world data
 Contains machine learning estimators that provide inference within data science
 Offers applications that demonstrate:
 1. the translation of the real-world application into a statistical estimation problem and
 2. the targeted statistical learning methodology to answer scientific questions of interest based on real data.

Conclusion
This paper focuses mainly the fundamental theory of data sciences in a detailed approach which helps researchers to
get clarity in collecting data, data storage, data processing, describing, data modelling. Machine learning and deep
learning algorithms.

Team Name
Team Name – Akatsuki

 Vaibhav Pal (Team Leader)


References
[1] ―Data Science and Analytics‖, Department of Industrial and Systems Engineering Lehigh University
Technical Report 15T-009

[2] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules.
In: Proceedings of ICDT. pp. 398–416. Springer (1999)

[3] Aden, C, Kleppin, L, Schmidt, G and Schröder, W. 2009. Consolidation, visualisation and analysis of forest
condition relevant data in the WebGISWaldIS. In: Strobl, J, Blaschke, Th and Griesebner, G (eds.),
AngewandteGeoinformatik 2009, 506–515

[4] Chavan, V and Penev, L. 2011. The data paper: a mechanism to incentivize data publishing in biodiversity
science. BMC Bioinformatics, 12(Supl. 15): S2. DOI: https://doi.org/10.1186/1471-2105-12-S15-S2

[5] Tseng, V.S., Wu, C.W., Fournier-Viger, P., Philip, S.Y.: Efficient algorithms for mining the concise and
lossless representation of high utility itemsets. IEEE Trans. Knowl. Data Eng. 27(3), 726–739 (2015)

[6] Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.:
Knowledge vault: a web- scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–610. ACM (2014)

[7] Gramener: Gramener—a data science company. https://gramener.com/ (2018). Accessed Dec 2018

[8] Polyzou, A., Karypis, G.: Grade prediction with models specific to students and courses. Int. J. Data Sci.
Anal. 2(3 –4), 159–171 (2016)

[9] Lam Ky Nhan, Phuong Hoang Yen "The Effects of Using Infographics-based Learning on EFL Learners’
Grammar Retention"

International Journal of Science and Management Studies (IJSMS) V4.I4 (2021): 255-265

[10] Nadia Ulfa Agustin Hilman, Maya Ariyanti, AstriGhina "Social Media Marketing Effect towards Purchase
Decision at the Embroidery MSMEs in Tasikmalaya" International Journal of Science and Management Studies
(IJSMS) V4.I4 (2021): 202 -209.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy