2 Marks Foundations of Data Science
2 Marks Foundations of Data Science
UNIT I - INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research
goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model– presenting
findings and building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of
Data
PART A
1
What is Data Science?
• Data Science is the area of study which involves extracting insights from vast amounts of data
using various scientific methods, algorithms, and processes.
• It helps you to discover hidden patterns from the raw data.
• Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data. Data science enables you to translate a business problem into a research project
and then translate it back into a practical solution.
2 Why Data Science needed?
• It helps you to recommend the right product to the right customer to enhance your business
• Allows to build intelligence ability in machines
• It enables you to take better and faster decisions
• Data Science can help you to detect fraud using advanced machine learning algorithms
• It helps you to prevent any significant financial losses
3 What are the components of data science?
• Domain expertise
• Data engineering
• Statistics
• Visualization
• Advanced computing
4 List out the data science jobs.
Most prominent Data Scientist job titles are:
• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
4
• Business Analyst
• Data/Analytics Manager
5 List out the tools for Data Science.
Data Analysis – Python, R, Spark and SAS
Data Warehousing – Hadoop, SQL
Data Visualization - R, Tableau
Machine Learning – Spark, Azure ML studio
6 List out Some applications of Data Science.
• Internet Search Results (Google)
• Recommendation Engine (Spotify)
• Intelligent Digital Assistants (Google Assistant)
• Autonomous Driving Vehicle (Waymo, Tesla)
• Spam Filter (Gmail)
• Abusive Content and Hate Speech Filter (Facebook)
• Robotics (Boston Dynamics)
• Automatic Piracy Detection (YouTube)
7 What are the skills required to become the data scientist?
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum values.
5
12 What are the two operations used to combine information from different datasets?
• The first operation is joining: enriching an observation from one table with information
from another table.
• The second operation is appending or stacking: adding the observations of one table to those of
another table.
13 What do you mean by Exploratory data analysis?
▶ Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.
▶ Information becomes much easier to grasp when shown in a picture, therefore we mainly use
graphical techniques to gain an understanding of data and the interactions between variables.
▶ The visualization techniques used in this phase range from simple line graphs or histograms to
more complex diagrams such as Sankey and network graphs.
• Data mining provides tools to discover knowledge from data and it turns a large collection of data
into knowledge.
17 What is a data warehouse?
• A data warehouse is a repository of information collected from multiple sources stored under a
unified schema and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
18 What is a boxplot and what do we use it?
• Although data is considered an asset more valuable by certain companies, more and
more governments and organizations share their data for free with the world.
• This data can be of excellent quality and it depends on the institution that creates and manages it.
• The information they share covers a broad range of topics in a certain region and its demographics.
•
20 What is the need for basic statistical descriptions of data?
Basic statistical descriptions can be used to identify properties of the data.
It highlights which data values should be treated as noise or outliers.
PART B
1 Describe the Benefits and uses of data science?
2 Explain are the facets of data?
3 Describe the overview of the data science process
4 Explain the steps involved in the knowledge discovery process
5 Briefly describe the steps involved in Data Preparation.
6 What are the technologies used in data mining?
7 Explain in detail about Warehouse?
8 Explain the data exploration in detail.
9 What are the different sources of Warehouse?
10 Explain the Data Mining architecture.
11 Briefly discuss about the Internal and External Data?
12 Detail about Basic Statistical Data in Measuring the Central Tendency?
13 Explain in detail about build a model with example
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with Averages -
Describing Variability - Normal Distributions and Standard (z) Scores
PART A
1 What is qualitative data?
Qualitative data is defined as the data that approximates and characterizes. Qualitative data can be
observed and recorded. This data type is non-numerical in nature. This type of data is collected through
methods of observations, one-to-one interviews, conducting focus groups, and similar methods.
PART A
1 What is Correlation?
5 Define regression.
Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine
the strength and character of the relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).
6 List out some Real-world examples of linear regression models
• Forecasting sales: Organizations often use linear regression models to forecast future sales. ...
• Cash forecasting: Many businesses use linear regression to forecast how much cash they'll have on
hand in the future.
7 What is the use of regression line?
• A regression line indicates a linear relationship between the dependent variables on the y-axis
and the independent variables on the x-axis
• The regression line is plotted closest to the data points in a regression graph. This statistical tool
helps analyse the behaviour of a dependent variable y when there is a change in the independent
variable x—by substituting different values of x in the regression equation.
8 What is computational formula for correlation coefficient?
• There are several types of correlation coefficient formulas.
• One of the most commonly used formulas is Pearson’s correlation coefficient formula.
11
• Changing the shape of a given array
• Joining and splitting of arrays
• Combining multiple arrays into one, and splitting one array into many
3 What is the syntax for Numpy slicing?
The Numpy slicing syntax follows that of the standard Python list, to access a slice of an array x:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. We can
access sub-arrays in one dimension and in multiple dimensions.
4 What will be the output for the below code:
x2 = array([[12, 5, 2, 4], [ 7, 6, 8, 8], [ 1, 6, 7, 7]])
print(x2[0, :])
Output:
[12 5 2 4]
5 What do you mean by ufuncs?
Ufuncs are the universal functions. The Vectorized operations in Numpy are implemented via ufuncs whose
main purpose is to quickly execute repeated operations on values in Numpy arrays. NumPy's universal
functions can be used to vectorize operations and thereby remove slow Python loops.
6 What is the purpose of the axis keyword?
• The axis keyword specifies the dimension of the array that will be collapsed, rather than the
dimension that will be returned.
• So specifying axis=0 means that the first axis will be collapsed. For two-dimensional arrays, this
means that values within each column will be aggregated.
7 What are the rules for broadcasting?
Broadcasting in Numpy follows a strict set of rules to determine the interaction between the two arrays:
● Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with
fewer dimensions are padded with ones on its leading (left) side.
● Rule 2: If the shape of the two arrays does not match in any dimension, the array with
shape equal to 1 in that dimension is stretched to match the other shape.
● Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
8 What is fancy indexing?
• A style of array indexing is known as fancy indexing.
• Fancy indexing is like the simple indexing but we pass arrays of indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's values.
9 What is the difference between np.sort and np.argsort?
• np.sort is used to return a sorted version of the array without modifying the input.
• np.argsort is used to return the indices of the sorted elements.
10 What is the output of the given code?
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),'formats':('U10', 'i4', 'f8')})
print(data.dtype)
Output:
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
11 What is the difference between Numpy array and pandas series?
• While the Numpy Array has an implicitly defined integer index used to access the values, the Pandas
Series has an explicitly defined index associated with the values.
• This explicit index definition gives the Series object additional capabilities. For example, the index
need not be an integer but can consist of values of any desired type. For example we can use strings as
an index.
12 How the series object can be modified?
Series objects can be modified with a dictionary-like syntax. Just as we can extend a dictionary by
assigning to a new key, we can extend a Series by assigning to a new index value.
13 What is python none object?
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing
data in Python code. Because it is a Python object, None cannot be used in any arbitrary Numpy/Pandas
array, but only in arrays with data type 'object' i.e. arrays of Python objects.
14 What is the use of multi-indexing?
• Multi-indexing is used to represent two-dimensional data within a one-dimensional Series.
• We can also use it to represent data of three or more dimensions in a Series or Data Frame. Each
extra level in a multi-index represents an extra dimension of data.
15 What is pd.merge ( ) function?
The pd.merge ( ) function implements a number of types of joins: the one-to-one, many-to-one and many-
to-many joins. All three types of joins are accessed via an identical call to the pd.merge () interface. The
type of join performed depends on the form of the input data.
16 What is describe ( ) method?
The method describe ( ) computes several common aggregates for each column and returns the result. We
can use this method on the dataset for dropping rows with missing values.
17 What is split, apply and combine?
• The split step involves breaking up and grouping a data frame depending on the value of the specified
key.
• The apply step involves computing some function usually an aggregate, transformation, or filtering
within the individual groups.
• The combine step merges the results of these operations into an output array.
18 What is the use of get ( ) and slice ( ) operations?
• The get () and slice () operations enable vectorized element access from each array.
• For example, we can get a slice of the first three characters of each array using str.slice (0, 3).
• get () and slice() methods also let us access elements of arrays returned by split().
• For example, to extract the last name of each entry, we can combine split () and get().
19 What do you mean by datetime and dateutil?
The datetime type is used to manually build a date. Using the dateutil module, we can parse dates from a
variety of string formats. With datetime object, we can print the day of the week.
20 What is the advantage of using numexpr library?
● The Numexpr library gives the ability to compute compound expressions element by element
without the need to allocate full intermediate arrays.
● Numexpr evaluates the expression in a way that does not use full-sized temporary arrays and can
be much more efficient than Numpy, especially for large arrays.
● The Pandas eval() and query() tools are conceptually similar and depend on the Numexpr package
PART B
1 Explain all the array manipulation functions with examples in Numpy.
2 Write short notes on Computation on Arrays.
3 Explain Aggregation Functions and Fancy Indexing with examples in Numpy.
4 Explain selection sort and other sorting methods used in Numpy with Examples
5 What are the Data Manipulation Techniques in Pandas.
6 Explain in detail the steps involved in constructing a pandas data frame
7 What are the steps involved in handling missing data in pandas.
8 Explain in detail about the aggregate, filter, transform and apply operations of the GroupBy object
9 Write short notes on dates and times in pandas with examples.
10 Explain in detail about the Pivot table?
UNIT V - DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn
1 What is Matplotlib?
• Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
• It has a module named pyplot which makes things easy for plotting by providing feature to control line
styles, font properties, formatting axes etc.
• It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error
charts etc.
2 What is the line plot?
• A Line plot can be defined as a graph that displays data as points or check marks above a number line,
13
showing the frequency of each value.
3 Define Scatter plots.
• Scatter plots are the graphs that present the relationship between two variables in a data-set.
• It represents data points on a two-dimensional plane or on a Cartesian system.
• The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted
on the Y-axis.
• These plots are often called scatter graphs or scatter diagrams.
4 Define Error bars.
• Error bars function used as graphical enhancement that visualizes the variability of the plotted data on a
Cartesian graph.
• Error bars can be applied to graphs to provide an additional layer of detail on the presented data. As you
can see in below graphs.
5 How do you visualize error bars?
• Error bars are used to display either the standard deviation, standard error, confidence intervals or the
minimum and maximum values in a ranged dataset.
• To visualise this information, Error Bars work by drawing cap-tipped lines that extend from the centre
of the plotted data point
6 What is density plot?
• Density Plot is a type of data visualization tool.
• It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a
continuous and smooth version of a histogram inferred from a data.
7 What are Contour plots?
• Contour plots (sometimes called Level Plots) are a way to show a three-dimensional surface on a two-
dimensional plane.
• It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. These
contours are sometimes called the z-slices or the iso-response values.
8 Define histogram
• A histogram is the graphical representation of data where data is grouped into continuous number
ranges and each range corresponds to a vertical bar.
• The horizontal axis displays the number range.
• The vertical axis (frequency) represents the amount of data that is present in each range.
9 What are legends in data visualization?
• A legend is used to identify data in visualizations by its color, size, or other distinguishing features.
• Legends identify the meaning of various elements in a data visualization and can be used as an
alternative to labeling data directly
10 Why is color important in data visualization?
• Color is important in data visualization because it allows you to highlight certain pieces of information
and promote information recall.
• Using different colors can separate and define different data points within visualization so that viewers
can easily distinguish significant differences or similarities in values.
PART B