0% found this document useful (0 votes)
19 views46 pages

189y1a05d4 Internship

This internship report by S. Narasimha details the completion of a Bachelor of Technology in Computer Science and Engineering at KSRM College of Engineering, under the supervision of Brain O Vision Solutions Pvt. Ltd. The report discusses the application of Python in data science, particularly in analyzing large datasets for HR workforce analysis and predictions. It highlights the importance of practical experience in internships and the skills developed during the process.

Uploaded by

innovits2511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views46 pages

189y1a05d4 Internship

This internship report by S. Narasimha details the completion of a Bachelor of Technology in Computer Science and Engineering at KSRM College of Engineering, under the supervision of Brain O Vision Solutions Pvt. Ltd. The report discusses the application of Python in data science, particularly in analyzing large datasets for HR workforce analysis and predictions. It highlights the importance of practical experience in internships and the skills developed during the process.

Uploaded by

innovits2511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

INTERNSHIP REPORT

A report submitted in partial fulfilment of the requirements for the Award


of Degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
S.NARASIMHA

Regd. No.:189Y1A05D4

Under Supervision of

BRAIN O VISION SOLUTIONS PVT . LTD, India .

(Duration: 15th April, 2021 to 15thMay, 2021)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KSRM COLLEGE OF ENGINEERING
(An Autonomous Institution)
Approved by AICTE, permanently affiliated to JNTU, Anantapuram .
KADAPA, ANDHRA PRADESH
2020– 2021

1
VISION AND MISSION

Vision

To offer advanced subjects in a flexible curriculum which will evolve our graduates to be
competent, responding successfully to career opportunities to meet the ongoing needs of the
industry.

To progress as a centre of excellence adapting itself to the rapid developments in the field of
computer science by performing a high-impact research and teaching environment .

Mission

 To impart high quality professional training in postgraduate and undergraduate level


with strong emphasis on basic principles of Computer Science and Engineering.
 To provide our students state-of-the-art academic environment and make unceasing
attempts to in still the values that will prepare them for continuous learning.
 To empower the youth in surrounding rural area with basics of computer education
making them self-sufficient individuals.
 To create teaching-learning environment that emphasizes depth, originality and
critical thinking fostering leading-edge research in the ever-changing field of
computer science.

2
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
KSRM COLLEGE OF ENGINEERING
(An Autonomous Institution)
KADAPA

CERTIFICATE

This is to certify that the “Internship Report” submitted by S.NARASIMHA


(Regd. No.: 189Y1A05D4) is work done by him and submitted during 2020 – 2021
academic year, in partial fulfilment of the requirements for the award of the degree
of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND
ENGINEERING, at Brain O Vision Solutions Pvt. Ltd, India.

Internship Coordinator Head of the Department

Sri S. Khaja Khizar,M.Tech Dr..M.Sreenivasulu,M.Tech,P.H.D


Assistant Professor, CSE. Professor & HOD, CSE.
CERTIFICATION

3
ACKNOWLEDGEMENT
4
First, I would like to thank the Director of Brain O Vision
Solutions Pvt. Ltd,India for giving me the opportunity to do an
internship within the organization.

I also would like all the people that worked along with me
in Brain O Vision Solutions Pvt. Ltd, India with their patience and
openness they created an enjoyable working environment.

It is indeed with a great sense of pleasure and immense sense of


gratitude that I acknowledge the help of these individuals.

I am highly indebted to Director Prof .A Mohan and Principal


Dr.V.S.S. Murthy, for the facilities provided to accomplish this internship.

I would like to thank my Head of the Department


Dr. M. Sreenivasulu for his constructive criticism throughout my
internship.

I would like to thank Sri. S. Khaja Khizar, Internship


coordinator, Department of CSE for their support and advices to get and
complete internship in above said organization.

I am extremely great full to my department staff members and


friends who helped me in successful completion of this internship.

S.NARASIMHA
(189Y1A05D4)

5
ABSTRACT

As each and every sector of the market is growing, data is building up


day by day, we need to keep the record of the data which can be helpful for
the analytics and evaluation. Now we don’t have data in gigabyte or
terabyte but in zetta byte and petabyte and this data can not be handled
with the day to day software such as Excel or Matlab. Therefore in this
report we will be dealing with large data sets with the high-level
programming language ’Python’.

The main goal of this project is to aggregate and analyze the data
collected from the different data sources available on the internet. This
project mainly focuses on the usage of the python programming language
in the field of HR WORKSPACE. This language has not only it’s
application in the field analyzing the data but also for the prediction of the
upcoming scenarios.

The purpose of using this specific language is due to its versatility, vast
libraries (Pandas, Numpy, Matplotlib, etc.), speed limitations, and ease of
learning. We will be analyzing large energy data sets in this project which
can not be easily analyzed in other tools as compared to python. Python
does not have it’s limitation to only data analytics but also in many other
fields such as Artificial intelligence, Machine learning, and many more.

6
Programmes and opportunities:

BRAINOVISION is a top I.T. rethinking organization with upbeat and fulfilled clients across
the globe. They have effectively completed 1000+ projects. Internships are designed with
much practical knowledge and incorporate in Projects. They provide great industrial
exposure to the interns through making them work on projects, daily assessment with
assignments and serve them with Industrial knowledge. Internships here are designed with
the cover of expertise training and mentorship from industrial experts. They act as a path
between you and the corporate culture.

Organization Information:
BRAIN O VISION SOLUTIONS PVT. LTD is a Private incorporated on 21 July 2014. It is
classified as Non-govt company and is registered at Registrar of Companies, Hyderabad. A
huge number of new companies to huge endeavours have accomplished business objectives
with their exceptionally successful software, web, versatile application advancement and plan
services.
Different B2B stages have given them top evaluations as they take a stab at 100% customer
fulfilment. They have four improvement habitats and a major group of affirmed designers
across different advancements. Their organizations with tech monsters have engaged them to
satisfy the total necessities, all things considered.
They cling to CMMI Level 3 guidelines and have great involvement with overseeing easy to
complex undertakings. Their accomplished experts are prepared to deal with tight cutoff
times. Organizations from each side of the world incline toward them as they give ensured
reaction inside 24 hours. Valuing plans (support, fixed, hourly and man-month) of
BRAINOVISION are intended to suit the different necessities of customers. Customer
support is the essence of their business. Tending to questions and worries of customers is a
high need for them. Customers consistently appreciate their specialized skill, advancement
measure, polished methodology, exacting adherence to worldwide quality guidelines,
proactive conduct, and so forth their examples of overcoming adversity persuade them to
convey better.

7
Benefits to the company/Institution through report:
During internship interns will be given an opportunity to apply acquired knowledge to real
work experiences, so that they can experiment their innovative thoughts which inturn may
help the company to upgrade their image. In addition to this it reduces the drain on company
resources from recruiting and hiring by cultivating a pool of star interns to fill their positions
as they open up.

8
Learning Objectives/Internship Objectives

 Application of concepts theories of major and liberal arts education and/or


development of new knowledge and understanding.
 Skill development related to your major or an occupation; and/or general skills such
as oral and written communication, critical thinking, organization, problem solving,
decision making, leadership, interpersonal relationships, technical, etc.
 An objective for this position should emphasize the skills you already possess in the
area and your interest in learningmore.
 Personal development such as self-awareness, self-confidence, sensitivity and
appreciation for diversity, clarification of work and personal values, Career and Post-
Graduate Development.
 Utilizing internship is a great way to build your resume and develop skills
that can be emphasized in your resume for future jobs .When you are
applying for a Training Internship, make sure to highlight any special skills
or talents that can make you stand apart from the rest of the applicants so that
you have an improved chance of landing the position.

9
INDEX

S.No. Contents

1. Introduction

2. History

3. Basics of python

4. Data Science structure

5. Advantages & Disadvantages

6. Project: HR workforce analysis & prediction of MNC

7. System Requirements

 Software Requirements

 Hardware Requirements

8. Project implementation

9. Source code

10. Output

11. Conclusion

12. Bibliography & References

10
1. INTRODUCTION TO DATASCIENCE

DATASCIENCE

Data science is the field of data analytics and data visualization in which raw data or the un-
structured data is cleaned and made ready for the anlysis purpose. Data scientists use this data
to get the required information for the future purpose.[1] ”Data science uses many processes
and methods on the big data, the data may be structured or unstructured”. Data frames avail-
able on the internet is the raw data we get. It may be either in unstructured or semi structured
format.This data is further filtered, cleaned and then number of required task are performed
for the analysis with the use of the high programming language. This data is further analyzed
and then presented for our better understanding and evaluation.

One must be clear that data science is not about making complicated models or making awe-
some visualization neither it is about writing code but about using the data to create an impact
for your company, for this impact we need tools like complicated data models and data
visualization.

STAGES OF DATASCIENCE

There are many tools used to handle the big data available to us.”Data scientists use pro-
gramming tools such as Python, R, SAS, Java, Perl, and C/C++ to extract knowledge from
prepared data”.
Data scientists use many algorithms and mathematical models on the data.
Following are the stages and their cycle performed on the unstructured data.

• Identifying the problem.

• Identify available data sources

• Identify if additional data sources are needed.

• Statistical analysis
11
• Implementation, development Communicate results

• Maintenance

Fig : 7 steps that together constitute this life-cycle model of Data science

12
Data science finds its application in many fields. With the assistance of data science it is
easy to get the search query on search engines in plenty of time. A role of the data scientist is
to have a deep understanding of the data as well as a good command on the programming lan-
guage, he should also know how to work with the raw data extracted from the data source.
Many programming languages are used to analyze and evaluate the data such as Python, Java,
MAT- LAB, Scala, Julia, R., SQL and TensorFlow. Among which python is the most user
friendly and vastly used programming language in the field of data science.

This life cycle is applied in each and every field, in this project we will be considering all this
seven stages of data science to analyze the data. The process will be starting from data collec-
tion, data preparation, data modeling and finally data evaluation. For instance, As we have
huge amount of data we can create an energy model for a particular country by collecting its
previous energy data, we can also predict the future requirement of it with the same data.

Fig : Role of Data Science and Big Data Analytics in the Renewable Energy Sector

13
2.HISTORY

In 1962, John Tukey described a field he called “data analysis,” which resembles modern
data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C.F.
Jeff Wu used the term Data Science for the first time as an alternative name for
statistics.Later, attendees at a 1992 statistics symposium at the University of Montpellier II
acknowledged the emergence of a new discipline focused on data of various origins and
forms, combining established concepts and principles of statistics and data analysis with
computing.

The term “data science” has been traced back to 1974, when Peter Naur proposed it as an
alternative name for computer science.In 1996, the International Federation of Classification
Societies became the first conference to specifically feature data science as a topic. However,
the definition was still in flux. After the 1985 lecture in the Chinese Academy of Sciences in
Beijing, in 1997 C.F. Jeff Wu again suggested that statistics should be renamed data science.
He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being
synonymous with accounting, or limited to describing data. In 1998, Hayashi Chikio argued
for data science as a new, interdisciplinary concept, with three aspects: data design,
collection,and,analysis.

During the 1990s, popular terms for the process of finding patterns in datasets (which were
increasingly large) included “knowledge discovery” and “data mining”.

14
3.BASICS OF PYTHON

WHY ONLY PYTHON ??

Python is a general-purpose programming language, so it can be used for many things.


Python is used for web development, AI, machine learning, operating systems, mobile
application development, and video games.

A successor to the ABC programming language, Python is a high level, dynamically typed
language developed by Guido Van Rossum in the early 1980s. In the intervening years,
Python has become a favourite of the tech industry, and it has been used in a wide variety of
domains.

 The language developed into its modern form in the early 2000s with the introduction
of Python 2.0. But its basic operational principles remain the same. Python code uses
the ‘object-oriented’ paradigm, which makes it ideal for writing both large-scale
projects and smaller programs.

 Python is a relatively easy programming language to learn and follows an organized


structure. This, combined with its versatility and simple syntax, makes it a fantastic
programming language for all sorts of projects.

 Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
 Python has a simple syntax similar to the English language.
 Python has syntax that allows developers to write programs with fewer lines than
some other programming languages.
 Python runs on an interpreter system, meaning that code can be executed as soon as it
is written. This means that prototyping can be very quick.
 Python can be treated in a procedural way, an object-oriented way or a functional
way.
 Python is the most famous language used in the field of data science.
 Python is mostly used and easy among all other programming languages.

15
FIG : PYTHON Data Structure Tree

16
MODULES AND PACKAGES

 MODULE

Modules are Python files which has extension as .py. The name of
the module will be the name of the file. A Python module can have a
set of functions, classes or variables defined and implemented.

Module has some python codes, this codes can define the classes,
functions and variables.The reason behind using the module is that it
organizes your python code by group- ing the python code so that it
is easier to use.

 PACKAGE

A package consist of the collection of modules in which python


codes are written with name init.py. It means that each python code
inside of the python path, which contains a file named init.py, will be
treated as a package by Python. Packages are used for organiz- ing
the module by using dotted names.

for example -

We have a package named simple package which consist of two


modules a and b. We will import the module from package as
follows

17
Libraries in Python

Python library is vast. There are built in functions in the library which are
written in C language. This library provide access to system functionality such
as file input output and that is not accessible to Python programmers. This
modules and library provide solution to the many problems in programming.

Following are some Python libraries.


 Matplotlib
 Pandas
 Tensorflow
 Numpy
 Keras
 Pytorch
 Scipy
 LightGBM

List of Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be
applied to almost any data problem:

 Linear Regression
 Logistic Regression
 Decision Tree
 SVM
 Naive Bayes
 Random Forest

4. DATA SCIENCE STRUCTURE

18
What is Data Science ?

So what is exactly data science? How does it relate to AI, for instance MachineLearning
and DeepLearning ?

 Data Science
Data science is an area where data is examined for patterns and characteristics. This includes a
combination of methods and techniques from mathematics, statistics and computer science.
Visualization techniques are frequently used in order to make the data understandable. The
focus is on the understanding and usage of the data with the aim of obtaining insights which
can further contribute to the organization.

 Deep Learning
Deep Learning is part of Machine Learning that uses artificial neural networks. This type of
model is inspired by the structure and function of the human brain.

 Artificial Intelligence
Artificial Intelligence is a discipline which enables computers to mimic human behaviour and
intelligence.

 Machine Learning
Machine Learning is part of Artificial Intelligence which focuses on ‘learning’. Algorithms
and statistical models are being designed for autonomous learning of tasks from data, without
giving explicit instructions upfront.

19
Types of Machine Learning and Deep Learning

There are different types of Machine Learning approaches for different types of problems.
The three most known are:

 Supervised Learning
Supervised Learning is the most widely used type of Machine Learning in data science.
While inspecting damages on concrete material, photos of previous examples are being used of
which it is known whether they show damage or not. Each of these photos has been given a
label: ‘damaged’ or ‘not damaged’, which helps in further classification

 Unsupervised Learning

In Unsupervised Learning labels are not used,but the model itself tries to discover reations in
the data. This is mainly used for grouping (clustering) the examples from the data. For
example, creating different customer groups where customers with similar characteristics are
in the same group. Before hand it is unknown, which group so customers there are and which
characteristics they meet, but the algorithm can distinguish the different groups with enough
data available.

 Reinforcement Learning

Finally, a Reinforcement Learning model learns on the basis of trial and error. By
rewarding good choices and punishing bad ones, the model learns to recognize patterns. This
technique is mainly used in understanding (computer) games (such as Go) and in robotics (a
robot which learns to walk through falling and standing up again). This type of Machine
Learning usually falls outside of data science, because the purpose of ‘learning’ a task is the
goal and not understanding and using the underlying data.

20
Life cycle of a Data Science project

The lifecycle of a data science project is consists of two phases,

which contains 7 steps :

FOCUS ON BEST MODEL: FROM


BUSINESS CASE TO PROOF-OF-CONCEPT

In this phase the focus is on the development of the best model for the specific
business case. This is the reason why the definition of a good business case is
essential. The data science will then work towards a working prototype (proof-of-concept).

The First phase consists of 4 steps:

1. A good business case

2. Obtain the correct data

3. Clean and explore the data

4. Development and evaluation of models

FOCUS ON CONTINUITY: FROM PROOF-OF-


CONCEPT TO SCALABLE END PRODUCT

In the second phase the focus is on continuity and a working prototype will be
developed to an operational end product.

The Second phase consists of 3 steps:


5. From proof-of-concept to implementation

6. Manage model in operation

7. From model management to business case

21
Together, these 7 steps (in two phases) form the life cycle
of a data science project

Step 1: A good Business case

The ultimate value of a data science project is dependent on the clear business case. What
do you want to achieve with the project? What is the added value for the organisation and
how will the information from an algorithm eventually be used? Here are some
guidelines for defining and testing the business case.

Involve end users


A business case is generally strong when it comes from practice and from end-users / domain
experts, as they are usually the people who have to use and rely on the information from the
models. It is therefore important that they especially understand the added value of the data
science solution.

Does AI give the best solution for the problem?


After defining the business case, it is wise to assess whether AI, in the form of self learning
algorithms, is the best solution for the problem. AI is very suitable for finding patterns in data
which is too large and complex for people to oversee. AI has proven itself valuable in a
number of areas, such as image and speech recognition. AI helps to automate these tasks,
however in some cases, AI is less useful:

22
The life cycle of a data science project consists of two phases, which
contains 7 s

AI models learn from (large amounts of) data. AI is not good in unpredictable
If little relevant data is available or if necessary situations which require creativity
contextual information is missing in the data, and intuition to solve them.
it is better to not use an AI solution.

This also applies if transparency of the There may simply be more effective
and/or algorithm is of great importance, because cheaper solutions than data
science
models (in particular Deep Learning) are solutions for a specific business case,
such.
often difficult to understand why they show as traditional softwares
certain behaviour / give certain results.

23
THE MANAGER AN EXECUTIVE EXPERT DATASCIENCE TEAM

Step 2: Obtain the correct data

Developing an AI algorithm often requires large amounts of high-quality data, since a


flexible model will extract its intelligence from the information contained in the data. How
do you obtain this good data and information?

Make sure that data is available

Before a data science team can start building a model, the correct data must be available.
Data science teams depend on data engineers and database administrators, since these
employees have the appropriate permissions to access the required data. Executive
experts/end-users do not always know how the data is stored in the databases. Finding the
required data is an interaction between the executive experts, the data science team and the
data engineers and/or database administrators. It is also possible that the required data is not

24
available within the organisation. In that situation one can choose to collect data from
external sources, if available.

Data Dump for the first phase

 In the first phase of the data science lifecycle, a one-off collected


data dump is usually sufficient.

When multiple data science teams compete for the same business case, all teams should have
the same data set. This is to ensure that all teams have an equal chance of finding the best
model. The performance of the various models can be compared.

It is wise to separate an extra ‘test dataset’, which has not been shared with the data science
teams beforehand. Based on the predictions of each model on this test data set, it can then be
compared which model performs best. A good example of this is how the platform Kaggle
operates, which organises public competitions for data science teams on behalf of companies.

Automatically retrieving data in the second phase

 In the Second phase of the Data science lifecycle, once model becomes
operational, a one-off data dump is no longer sufficient.

The data must then automatically reach the model for computing. In practice, this is often
rather complicated, because there are so-called data silos: closed databases which are difficult to
integrate into an application. This is due to the fact because many systems are not designed to
easily communicate with each other. An internal IT security measure makes communication
more difficult. That is why it is recommended to think about the second phase already in the first
phase.

Start setting up infrastructure in time

25
Investing in a good infrastructure for the storage and exchange of data is essential for a data
science project to become a success. A data engineer can facilitate this process, where a robust
infrastructure is set up. Start with this process early and keep security,access rights,,protection of
personal data mind

Step 3: Clean and Explore data

 When the dataset is available, the data science team can start developing the solution.
An important step is to first clean and then explore the obtained data. The data must
meet a number of requirements to be suitable for usage in developing AI models. The
data must be representative and good quality.

REPRESENTATIVE DATA

It is important that the data which is used for developing models is as good a representation
of the reality as possible. A self-learning algorithm is getting smarter by learning from
examples in the given data. Requirements for the data are the diversity and completeness of
the data points.
Furthermore, the data must be up-to-date for many business cases, because old data might not
be relevant anymore for the current situation. Be aware that no unintended bias is included in
the data provided.

26
EXAMPLE

When developing a model for classifying images, one must take diversity in the images into
account. An example for this is searching for damage in concrete through images. Imagine all
photos with damage were taken on a cloudy day and all photos without damage on a sunny day.
It might be that the data science model will base its choice on the background colours. A new
image of concrete damage on a sunny day can therefore be classified incorrectly as
undamaged.

Quality of the data


Besides the variety and completeness of the data, the quality is also of great importance. A
good data structure supports this: the data should be as complete, consistent and obvious as
possible. It is also essential to prevent human (input) errors as much as possible. Mandatory fields,
checks and categories instead of open boxes of text can provide a solution, when entering the
data.

27
Confidence
Representativeness and the quality of the data have a positive impact on the confidence in the
project and the end solution. This confidence also contributes to the acceptance and usage by
end users/executive experts.

Short feedback cycle


When exploring the data, it is crucial that the data is correctly interpreted. This happens through
asking feedback from the implementing experts. A short feedback cycle between the data
science team and the implementing experts is required for this. This can be executed for
instance in every few weeks by giving a presentation of the findings to the implementing experts
and /or the manager.

Step 4: Development and Evaluation of models

 At this stage, the data scientist has a lot of freedom to explore. The goal is to create a
lot of business value as soon as possible.
 As a result, no data science solution is exactly the same and data science has a strong
experimental character. What do you have to think of here and what does
experimenting involve?
 On one hand this involves the type of algorithm and its parameters, on the other
hand, the variables constructed from the data (the features). Depending on the problem
and the available data, different categories of AI models can be used. During the
development and evaluation of the models, there are a couple of points which you
need to pay attention to.

28
Labels of the dataset
The most commonly used models are in the category of Supervised Learning, where the model
learns from a set of data with associated known annotations or labels. Consider, for example, the
aforementioned recognition of concrete damage. The model is trained on a set of photos of
concrete structures which are known to be damaged. If trained, the model can be used to
classify new photos.

Evaluation of the model


Labels are important for developing the model as well as for assessing the model. It is
possible that the actual value will automatically appear in the data. When it becomes
known. In that case, a direct comparison between the actual value and the model
prediction can be made.If the actual value does not automatically appear in the data, as in the
example of the concrete damage classification, a feedback loop can be built into the end
solution. Then the user will be asked to provide feedback about the correctness of the
classification. This information is important for monitoring the quality of the data science model.

Optimization of the algorithm


An algorithm must be optimized and tuned to the problem and the related data. By adjusting hyper
parameters, “the model’s rules”, the algorithmis adapted to the application. This optimization step
often follows from a grid search, in which a large set of different values is tried and the best are
chosen.

Choosing evaluation method


In order to determine the best model, an evaluation method must be defined which reflects the
purpose of the data science model. For example, it is important in the medical field that extreme
errors of the data science model do not occur, while in other fields extreme errors may have
been caused by extreme measuring points in the data, which people choose to give less value
to. Classification of models includes the balance between inclusivity (finding all concrete damage)
and precision (finding only concrete damage). The assessment method must correspond to the
business case and how the algorithm will be used in practice.

Monitoring the Models

Monitoring the quality of data science models is very important. This is due to the dependency of

29
the data.
Data can change over time, as a result of which a data science model may also show lower
performance over time. As soon as the outside world changes, the data changes with it. This
is how the seasons can have influence on photos and consequently on the results of the
model. It can be useful to add an explanation to the prediction of a data science model, for the
end user/ executive expert to understand what the prediction is based on. This provides more
insight and transparency into how the model reasons.

Explainability of the predictions


Making a data science model mainly consists of iterations: running and evaluating. Also here
the feedback from users is important. Do the predictions of the model make sense? Are there
essential variables missing which could have an impact? Do the identified relationships also
have a causal relationship? The transparency of the algorithm plays an important role in these
relationships. Some types of algorithms are very opaque, so that given results cannot be
traced back to the input data. This is amongst others, the case for neural networks. More
linear methods or traditional statistical approaches are often easier to interpret. The demand
for transparent algorithms can be seen in the recent developments regarding Explainable AI.
When choosing the model, take the practical requirements of transparency into account .
Step 5: From successful proof- of-concept to implementation

 In step 5, the model and the organization will be prepared in order to use the model
in practice. We have put all important elements of this step together for you.

Production-worthy code
Data science and software development are two surprisingly different worlds. Where data
scientists focus on devising, developing and conducting experiments with many iterations,
software developers are focused on building stable, robust and scalable solutions. These are
sometimes difficult to combine.

In the first phase the focus is on quickly realizing the best possible prototype. It is about the
performance of the model and the creation of great business value as quickly as possible.
That is why there is often little time and attention for the quality of the programming code. In
the second phase the quality of the code is also important. This concerns the handling of possible
future errors, clear documentation and efficiency of the implementation (speed of code).

30
That is why in practice the second phase usually starts with restructuring the code. The final
model is structured and detached from all experiments that have taken place during the
previous phase.

Integration into Existing processes


 The restructured prototype must be integrated into the existing processes in the
organisation. This mainly consists of two steps.

Automatization of data flows

 The retrieval and writing data back to a database must be done automatically by
means of queries. This can be a complex process which can be facilitated by a data
engineer.

Setting up an infrastructure hosting and managing the model

 The model must be available to all end users/ executive experts and it should scale
with its use.

“When a model is to be used intensively,scalability also becomes relevant.”

Management of the production environment


When the model produces error messages, they must be corrected. It is important to clearly
define who is responsible for the models which run in the production environment. This can
be the IT department or the data science team itself. In any case, it will benefit if the code is

31
structured and error messages are described in a clear log.

Dealing with multiple stakeholders: ownership of data and code


It often happens that an organisation cooperates with an external party for the development of
data science models, or for the delivery of data. Data handling and data ownership are serious
topics and many organisations attach great importance to protecting their data and controlling
who has access to it. In addition, this may also be required by legislation, such as the GDPR.
The same applies to the model code of a data scientist.

It is possible that discussions arise within collaboration between the parties: who owns the
data and/or who owns the algorithm. A neutral model-hosting platform can offer a solution in
order to be able to work with multiple parties. This platform can be reached by the data
scientist with his model code and by the data supplier with his data. Proper rights can be
derived from this neutral platform. When working with a neutral platform, one can test the
data scientist’s model without having access to its source code. For example, one could make
a data test set and compare several model variants.

Acceptance
Enthusiasm from the users for the new solution does not always come naturally. This is often
related to the idea that an algorithm poses a future threat to the employee’s role in an
organisation. Yet there is still a sharp contrast between where the power of people lies and
that of an algorithm. Where computers and algorithms are good at monitoring the continuous
incoming flow of data, people are much better at understanding the context in which the
outcome of an algorithm must be placed. This stimulates in the first instance to pursue
solutions that focus as much as possible on human- algorithm collaboration. The algorithm
can filter and enrich the total flow of data with discovered associations or properties, and the
expert/technician can spend time interpreting the results and making the ultimate decisions.

32
Liability
How the model is used in practice, partly determines who is responsible for the decisions
which are made. When the data science model is used as a filter, one will have to think about
what happens if the model makes a mistake in filtering. What are the effects of this and who
is held responsible for this? What about an autonomous data science solution, such as a self-
driving car, for example, who can be held accountable if the car causes an accident?
These are often tough ethical questions, but it isimportant to consider these issues prior to
integrating a model into a business process.

Step 6: Managing models in operation


A model which runs and is used in a production environment must be checked
frequently. It is necessary to agree on who is responsible for this, and who executes these
frequent checks. The data science team, or the end user /executive expert can assess
whether the model continues to work well and remains operational.

33
It may be necessary to regularly ‘retrain’ a Machine Learning model with new data. This can be a
manual task, but it can also be built into the end solution. In the latter case, monitoring the
performance of the model is essential.
Which model performance and code belongs to which set of training data has to be stored in a
structural way, so that changes in the performance of a model can be traced back to the data.
This is called ‘ data lineage ‘. More and more tooling is coming onto the market for this.
An example is Data Version Control (DVC).

Step 7: From model management to business case

Frequently checking the performance of a data science model, any degrading model output can
be discovered in good time.

34
 This occurs due to the strong dependence between the code of the algorithm, the data
used to train the algorithm and the continuous flow of new data, influenced by various
external factors. It may happen that environmental factors change, and as a result
certain assumption are no longer correct, or that new variables are measured which
were previously unavailable.

 The development of the algorithm therefore continues in the background. This creates
newer model versions for the same business case with software updates as a result. In
practice several model versions run in parallel for a while, so that the differences
become transparent. Each model has its own dependencies that must be taken into
account.

 When multiple models and projects coexist in the same production environment, it is
of paramount importance that all projects remain transparent and controllable.
Standardization is required to prevent fragmentation. E.g. a fixed location (all in the
same environment) and the same (code and data) structure.

 Ultimately, data science projects are a closed loop and improvements are always
possible. The ever-changing code, variables and data makes data science products
technically complex on the one hand, but on the other hand very flexible and
particularly powerful when properly applied.

5. ADVANTAGES &DISADVANTAGE OF
DATASCIENCE

35
Data Science is the study of data. It is about extracting, analyzing, visualizing, managing and
storing data to create insights. These insights help the companies to make powerful data-
driven decisions. Data Science requires the usage of both unstructured and structured data.

It is a multidisciplinary field that has its roots in statistics, math and computer science. It is
one of the most highly sought after jobs due to the abundance of data science position and a
lucrative pay-scale. So, this was a brief to data science, now let’s explore the pros and cons of
data science.

36
6.PROJECT : HR WORKFORCE ANALYSIS &
PREDICTION OF MNC
The project “HR WORKFORCE ANALYSIS AND PREDICTION OF MNC” has been built
in order provide in-depth analysis about the employees.Workforce planning is the process an
organization uses to analyse its workforce and determine the steps it must take to prepare for
future staffing needs.

Any business plan deals with resource requirements, and, just as financial requirements need
to be addressed, the business plan needs to ensure that the appropriate workforce mix is
available to accomplish plan goals and objectives. In workforce planning, an organization
conducts a systematic assessment of workforce content and composition issues and
determines what actions must be taken to respond to future needs. The actions to be taken
may depend on external factors (e.g., skill availability) as well as internal factors (e.g., age of
the workforce). These factors may determine whether future skill needs will be met by
recruiting, by training or by outsourcing the work.

Whether handled separately or as part of the business plan, workforce planning involves
working through four issues:

 The composition and content of the workforce that will be required to strategically
position the organization to deal with its possible futures and business objectives.

 The gaps that exist between the future "model" organization(s) and the existing
organization, including any special skills required by possible futures.

 The recruiting and training plans for permanent and contingent staff that must be put
in place to deal with those gaps.

 The determination of the outside sources that will be able to meet the skill needs for
functions or processes that are to be outsourced.

While many see workforce planning as purely a staffing tool for anticipating employment
needs, it can also be a critical tool for staff training and development and succession
planning. To be successful, organizations should conduct a regular and thorough workforce
planning assessment so that staffing needs can be measured, training and development goals
can be established, and contingent workforce options can be used to create an optimally
staffed and trained workforce able to respond to the needs.

37
7.SYSTEM REQUIREMENTS

Software Requirements:

 Python – 3.6 version

 PyCharm community

 Browser(such as Chrome, Mozilla Firefox, Brave)

Hardware Requirements:

 Processor: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz

 RAM of 8.00GB

 System type: 64-bit processor

 Operating system: Windows 10

8.PROJECT IMPLEMENTATION

After installing and configuring all the requirements, then openPyCharm, import pandas and

Django, type the code, then turn on WAMP server and open phpMyAdmin in browser and

create your own database.Finally after installing all the dependencies such as pip install

Django, pip install MySQL client etc.., then run the code using python manage.py runserver

command. If everything is fine we will get an IP address, copy that address and open in

browser so that the project would be displayed.

38
9.SOURCE CODE

manage.py

class Userregisters_Model(models.Model):
userid = models.CharField(max_length=10)
name = models.CharField(max_length=20)
password = models.CharField(max_length=10)
email = models.EmailField(max_length=50)
phoneno = models.CharField(max_length=10)
address = models.CharField(max_length=100)
gender = models.CharField(max_length=100)
class careerModel(models.Model):
os= models.IntegerField()
alg= models.IntegerField()
pro= models.IntegerField()
se= models.IntegerField()
cn= models.IntegerField()
es= models.IntegerField()
ca= models.IntegerField()
math= models.IntegerField()
coms= models.IntegerField()
work= models.IntegerField()
lq= models.IntegerField()
hack= models.IntegerField()
cods= models.IntegerField()
sp= models.IntegerField()
longtime= models.CharField(max_length=100)
self_lear=models.CharField(max_length=100)
extra=models.CharField(max_length=100)
cert=models.CharField(max_length=100)

39
worhshop=models.CharField(max_length=100)
tal=models.CharField(max_length=100)

rwskill=models.CharField(max_length=100)
memory=models.CharField(max_length=100)
intsub=models.CharField(max_length=100)
intcareer=models.CharField(max_length=100)
joborhigh=models.CharField(max_length=100)
company=models.CharField(max_length=100)
seni=models.CharField(max_length=100)
games=models.CharField(max_length=100)
book=models.CharField(max_length=100)
salary=models.CharField(max_length=100)
real=models.CharField(max_length=100)
gentle=models.CharField(max_length=100)
manaortech=models.CharField(max_length=100)
salaorwork=models.CharField(max_length=100)
hardorsmart=models.CharField(max_length=100)
team=models.CharField(max_length=100)
Introvert=models.CharField(max_length=100)
job=models.CharField(max_length=100)
class processingModel(models.Model):
dtree= models. FloatField()
svm = models. FloatField()
ann = models. FloatField()

views.py
import pandas as pd
import numpy as np
from django.db.models import Q
from sklearn import decomposition
def index(request):
if request.method == "POST":
userid = request.POST.get('userid')

40
password = request.POST.get('password')
try:

request.session["name"]= enter.id
return redirect('myaccounts')
except:
pass
if request.method == "POST":
userid = request.POST.get('userid')
name = request.POST.get('name')
password = request.POST.get('password')
email = request.POST.get('email')
phoneno = request.POST.get('phoneno')
address = request.POST.get('address')
gender = request.POST.get('gender')
Userregisters_Model.objects.create(userid=userid, name=name, password=password,
email=email,phoneno=phoneno, address=address, gender=gender)
return render(request, 'index.html')
def register(request):
if request.method == "POST":
userid= request.POST.get('userid')
name = request.POST.get('name')
password = request.POST.get('password')
email = request.POST.get('email')
phoneno = request.POST.get('phoneno')
address = request.POST.get('address')
gender = request.POST.get('gender')

Userregisters_Model.objects.create(userid=userid,name=name,password=password,email=e
mail,phoneno=phoneno,address=address,gender=gender)
return render(request, 'register.html')
def myaccounts(request):
name = request.session['name']
obj = Userregisters_Model.objects.get(id=name)

41
if request.method == "POST":
userid = request.POST.get('userid')
name = request.POST.get('name')

email = request.POST.get('email')
phoneno = request.POST.get('phoneno')
address = request.POST.get('address')
gender = request.POST.get('gender')
obj = get_object_or_404(Userregisters_Model, id=name)
obj.userid = userid
obj.name = name
obj.email = email
obj.password = password
obj.phoneno = phoneno
obj.gender = gender
obj.address = address
obj.save(update_fields=["userid","name", "email", "password",
"phoneno","address","gender"])
return render(request, 'myaccounts.html',{'form':obj})

42
10.SCREENSHOTS

43
44
11.CONCLUSION

Data science education is well into its formative stages of development; it is evolving into a
self-supporting discipline and producing professionals with distinct and complementary skills
relative to professionals in the computer, information, and statistical sciences. However,
regardless of its potential eventual disciplinary status, the evidence points to robust growth of
data science education that will indelibly shape the undergraduate students of the future. In
fact, fueled by growing student interest and industry demand, data science education will
likely become a staple of the undergraduate experience. There will be an increase in the
number of students majoring, minoring, earning certificates, or just taking courses in data
science as the value of data skills becomes even more widely recognized. The adoption of a
general education requirement in data science for all undergraduates will endow future
generations of students with the basic understanding of data science that they need to become
responsible citizens. Continuing education programs such as data science boot camps, career
accelerators, summer schools, and incubators will provide another stream of talent. This
constitutes the emerging watershed of data science education that feeds multiple streams of
generalists and specialists in society; citizens are empowered by their basic skills to examine,
interpret, and draw value from data.

Today, the nation is in the formative phase of data science education, where educational
organizations are pioneering their own programs, each with different approaches to depth,
breadth, and curricular emphasis

45
12.BIBLIOGRAPHY AND REFERENCES

BIBLIOGRAPHY

The following books are referred during the Data analysis and execute
 Statistical analysis of questionnaires: a unified approach based on R and Stata by
Francesco Bartolucci. Boca Raton: CRC Press, 2016.
 Data visualisation: a handbook for data driven design by Andy Kirk. Los Angeles:
Sage, 2016.
 Data design: visualising quantities, locations, connections by Per Mollerup. London:
Bloomsbury, 2015.

REFERENCES

 https://www.jetbrains.com/pycharm/download/#section=windows

 https://www.python.org/downloads/

 https://www.quora.com /

46

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy