Technical Question Bank
Technical Question Bank
Filter is used to segregate the data based on dimensions and reduce the number
of records present in a dataset for faster processing.
There are six types of filters in Tableau:
● Extract Filters- They are used to filter the extracted data from the data sources.
These filters are used only if the user extracts the data from the data source. It
also helps to lower the queries to the data source.
● Context Filter- It creates datasets based on the original data sheet and the
presets chosen for compiling the data. It helps in applying a relevant, actionable
context to the entire data analysis in Tableau.
● Dimension Filter- Filters that are applied on dimensions are called dimension
filters. With the help of these filters we can select or deselect the values, or we
can perform wildcard selection or condition based selection where we can use
complex formulas or simple conditions to filter out data.
● Measure Filter- Filters that are applied on measurable or quantitative data are
called measure filters. The Measure filter has a range of values- At Least, At
Most and Special sub filters.
● Table Filter- This filter can look through data quickly without adding any
additional filter to the hidden data.
Tree Map
● It is used to show a huge amount of hierarchical structured data.
● The levels in the hierarchy of the tree map are visualised as rectangles
containing other rectangles which represent a category in a column.
● A bigger rectangle represents a high frequency category in a column, while a
smaller rectangle represents a low frequency category.
Heat Map
● It is a graphical representation of data where values are depicted by colour.
● Heat maps make it easy to visualise complex data and understand it at a
glance.
● It uses colour to communicate relationships between data values, which is
much harder to understand if presented numerically in a spreadsheet.
Joining Blending
It has LEFT JOIN, RIGHT JOIN, INNER It has only LEFT JOIN.
JOIN and FULL OUTER JOIN.
It is used when the data set is from the It is used when the data set is from different
same source. sources.
The Rank function in Tableau accepts two arguments- aggregated measure and
ranking order. The ranking order can be ascending or descending. The ranking order
is optional and by default assigned as descending. For example- If the values are
3,5,6,7,7,9 then their corresponding ranks would be 1,2,3,4,4,6 in ascending order.
The Dense_rank function works in a similar manner as the Rank function except it
won’t skip the next rank when assigning the same rank to identical values. For
example- If the values are 3,5,6,7,7,9 then their corresponding dense ranks would
be 1,2,3,4,4,5 in ascending order.
The Tableau Rank_UNIQUE function will assign unique ranks to identical values. For
example, if we have 3,5,6,7,7,9 then the function will return the ranks as 1,2,3,4,5,6,7
in ascending order.
Level of Detail (LOD) functions are used to run queries which are complex, and involve
many dimensions at the data source level.
In Tableau, Measures represent quantitative data such as integer, string etc. and are
used and analysed by dimensions. While dimensions represent qualitative values to
define a particular category. Examples of dimensions are geographical data, product
details, countries etc.
14. What are the different types of connections that you can make with your
dataset?
The different connections in Tableau are:
● File Systems such as .csv, Excel, etc.
● Relational Systems such as Oracle, SQL Server, DB2, etc.
● Cloud Systems such as Windows Azure, Google BigQuery, etc.
● Other Sources using ODBC.
Sets Groups
It is dynamic i.e. it updates data on a daily It is static i.e. it does not update data on a
basis. daily basis.
In sets, you can group data across In groups, you can group data only within
multiple dimensions. one dimension.
It is used to form subsets of data based It puts dimensions together and create a
on the conditions chosen. hierarchy of multiple dimension levels.
Can choose “IN/OUT” or “Show Members There is no such option. The only option
in Set”. available is group/ungroup.
SQL for Data Science
A JOIN clause is used to combine rows from two or more tables, based on a related
column between them.
There are six types of JOIN clauses:
● INNER JOIN- It returns rows that have matching values in both tables.
● LEFT JOIN- It returns all the rows from the left table with corresponding rows
from the right table. If there are no matching rows, NULL is returned as a value
from the second table.
● RIGHT JOIN- It returns all the rows from the right table with corresponding rows
from the left table. If there are no matching rows, NULL is returned as a value
from the first table, which can also be called as the left table.
● FULL OUTER JOIN- It returns all the rows from both the tables. If there are no
matching rows in the tables, NULL is returned.
● CROSS JOIN- It returns all the possible combinations of rows from both
tables.
● SELF JOIN- It will join the table with itself. Example- Finding employees
who are managers in the employee table.
A table is a set of data that is organised in a model with columns and rows. In a table,
columns are placed vertically while rows are placed horizontally. A table has a
specified number of columns called fields, but it can have any number of rows, which
are called records.
The Rank() function ranks within partitions with gaps and gives the same ranking for
tied values. Example- If the values are 3,5,6,7,7,9 then their corresponding ranks would
be 1,2,3,4,4,6 in ascending order.
The Dense_rank() function works in a similar manner as the rank function, except it
won’t skip the next rank when assigning the same rank to identical values. Example- if
the values are 3,5,6,7,7,9 then their corresponding dense ranks would be 1,2,3,4,4,5 in
ascending order.
Row_number provides unique numbers for each row within the partition, with different
numbers for tied values.
Constraints are the rules enforced on the data columns of a table. These are used to
limit the type of data that can go into a table. This ensures the accuracy and reliability
of the data in the database. Constraints could be either on a column level or on a table
level. The column level constraints are applied only to one column, whereas the table
level constraints are applied to the whole table. Commonly used constraints are: NOT
NULL constraint, UNIQUE constraint, DEFAULT constraint, PRIMARY KEY constraint,
FOREIGN KEY constraint.
A view is a virtual table which consists of a subset of data contained in a table. Since,
views are virtually present it takes less space to store them. A view can have data of
one or more tables combined, and it depends on the relationship between views and
tables.
● Super Key- It can contain multiple attributes that might not be able to
independently identify tuples in a table. But when grouped with certain
keys, they can identify tuples uniquely.
● Candidate Key- A Candidate key is a subset of Super key and is devoid of any
unnecessary attributes that are not important for uniquely identifying tuples. The
value for the Candidate key is unique and non-NULL for all tuples. And every
table has to have at least one Candidate key. But there can be more than one
Candidate key too.
● Primary Key- Primary key is the Candidate key selected by the database
administrator to uniquely identify tuples in a table. There can be only one
Primary key for a table.
● Alternate Key- There can be only one Primary key for a table. Therefore, all
the remaining Candidate keys are known as Alternate or Secondary keys.
● Foreign Key- Foreign key is an attribute which is a Primary key in its parent
table but is included as an attribute in another host table.
● Composite Key- A Composite key is a Candidate key or Primary key that
consists of more than one attribute.
7. What is Indexing?
Indexes are special lookup tables that the database search engine can use to speed
up data retrieval. In simple words, an index is a pointer to data in a table. An index in a
database is very similar to an index at the back of a book.
For example, if you want to add reference for all the pages in a book that discuss a
certain topic, you first refer to the index, which lists all the topics alphabetically and are
then referred to one or more specific page numbers.
Drop- It is a Data Definition Language (DDL) command. It is used to drop the whole
table. With the help of the “DROP” command we can drop (delete) the whole structure
in one go, i.e. it removes the named elements of the schema. By using this command,
the whole table ceases to exist.
Truncate- It is also a Data Definition Language (DDL) command. It is used to delete all
the rows of a relation (table) in one go. With the help of the “TRUNCATE” command, we
can’t delete a single row since the WHERE clause is not used here. By using this
command the existence of all the rows of the table is lost. It is comparatively faster than
the delete command as it deletes all the rows quickly.
An aggregate function in SQL returns one value after calculating multiple values of a
column. We often use aggregate functions with the GROUP BY and HAVING clauses of
the SELECT statement.
Types of Aggregate functions are: COUNT(), SUM(), AVG(), MIN(), MAX().
A subquery in MySQL is a query, which is nested into another SQL query and
embedded with SELECT, INSERT, UPDATE or DELETE statement along with the
various operators. The different types of subquery are-
● Single Value- It returns exactly one column and exactly one row. It can be used
with comparison operators such as =, <, >, <=, >=.
● Non-Correlated- It does not depend on the outer query and can be run
independently from the outer query.
Lead Function- It is used to access data from subsequent rows along with data from
the current row.
Lag Function- It is used to access data from previous rows along with data from the
current row.
The difference between Union and Union All is that Union extracts the rows that are
being specified in the query, while Union All extracts all the rows including the
duplicates (repeated values) from both the queries.
14. What is Normalization?
The Pickle module accepts a Python object, converts it into a string representation, and
dumps it into a file by using a dump process. This entire process is called pickling. On
the other hand, unpickling is the process of retrieving original Python objects from the
stored string representations.
Python is an interpreted language. The Python program runs directly from the source
code and converts the source code written by the programmer into an intermediate
language, which is again translated into the machine language that is executed.
The difference between a list and a tuple is that a list is mutable while a tuple is
immutable.
print(coordinates)
OUTPUT: {(0, 0): 100, (1, 1): 200, (1, 0): 150, (0, 1): 125}
In the above example, we used coordinates (which is a tuple) as a key for our
dictionary.
PEP 8 stands for Python Enhancement Proposal. It is a document that provides the
guidelines on how to write the Python code. It is a set of rules that specifies how to
format the Python code for maximum readability. It was written by Guido van Rossum,
Barry Warsaw, and Nick Coghlan in 2001.
The zip() function in Python returns a zip object, which maps a similar index of multiple
containers. It takes an iterable and converts it into an iterator and aggregates the
elements based on iterables passed, and finally, it returns an iterator of tuples.
11. What are the different file processing modes that are supported by Python?
Python provides four modes to open files- read-only (r), write-only (w), read-write
(rw) and append mode (a).
● Read-only mode (r) - It is used to open a file in read-only mode for reading.
It is the default mode.
● Write-only mode (w) - It is used to open a file in the write-only mode for
writing. It overwrites the file if the file exists. If the file does not exist, it
creates a new file for writing. If the file being replaced contains some data,
the data would be lost.
● Read-Write mode (rw) - It is used to open a file for reading and writing. It
can also be referred to as updating mode.
● Append mode (a) - It is used to open a file for writing. The file pointer is at
the end of the file if the file exists. If the file does not exist, it creates a
new file for writing.
In Python, iterators are used to iterate a group of elements and containers, like a list.
Iterators are collections of items, that can be a list, a tuple, or a dictionary. Python
iterator implements itr and next() methods to iterate the stored elements. In Python, we
generally use loops to iterate over the collections (i.e. list and tuple). In simple words,
iterators are objects which can be traversed through or iterated upon.
The Python docstring is a string literal that occurs as the first statement in a module,
function, class, or method definition and it provides a convenient way to associate the
documentation. String literals that occur immediately after a simple assignment at the
top are called ‘attribute docstrings’. String literals that occur immediately after another
docstring are called ‘additional docstrings’. Python uses triple quotes to create
docstrings even though the string fits in one line. The docstring phrase ends with a
period (.). It can include multiple lines and may consist of spaces and other special
characters.
The enumerate() function is used to iterate through the sequence and retrieve the
index position and its corresponding value at the same time.
For example -
list_1 = ["A","B","C"]
s_1 = "Javatpoint" # creating enumerate objects
object_1 = enumerate(list_1)
object_2 = enumerate(s_1)
print ("Return type:",type(object_1))
print (list(enumerate(list_1)))
print (list(enumerate(s_1)))
Output: Return type: [(0, 'A'), (1, 'B'), (2, 'C')] [(0, 'J'), (1, 'a'), (2, 'v'), (3, 'a'), (4, 't'), (5,
'p'), (6, 'o'), (7, 'i'), (8, 'n'), (9, 't')]
Exploratory Data Analysis and Machine Learning
Decision Trees are prone to overfitting. Random Forests do not overfit by using
multiple trees.
The whole training set is trained alone on N decision trees are trained, each one on
a single tree. the subset of the original training set.
According to the central limit theorem, the mean of all the given samples of a
population is the same as the mean of the population (approx.) if the sample size
is sufficiently large enough with a finite variation.
5. What is the difference between bagging and boosting?
Bagging Boosting
Bagging has less variance but high bias. Boosting has high variance but low bias.
Advantages
● It works well when there is a clear margin of separation between classes.
● It is more effective in high-dimensional spaces.
● It is effective in cases where the number of dimensions is greater than
the number of samples.
● It is relatively more memory efficient.
Disadvantages
● SVM algorithm is not suitable for large data sets.
● It does not perform very well when the data set has more noise, i.e. when target
classes are overlapping.
● In cases where the number of features for each data point exceeds the
number of training data samples, the SVM will underperform.
● The support vector classifier works by putting data points above and below
the classifying hyperplane, and there is no probabilistic explanation for the
classification.
Correlation Regression
It measures the strength or degree of It measures how one variable affects
relationship between the two variables. another variable. It is about model fitting.
One-tailed test - A one-tailed test is a statistical test in which the critical area of a
distribution is one-sided so that it is either greater than or less than a certain value,
but not both. If the sample that is being tested, falls into the one-sided critical area,
the alternative hypothesis will be accepted instead of the null hypothesis.
Two-tailed test - A two-tailed test is a statistical test in which the critical area of a
distribution is two-sided, and it tests whether a sample is greater than or less than a
certain range of values. If the sample being tested falls into either of the critical areas,
the alternative hypothesis is accepted instead of the null hypothesis.
The p-value is the probability of observing a value of the test statistic that is, as or more
extreme than what was observed in the sample, assuming that the null hypothesis is
true. If the p-value is less than 0.05, the statistical significance will be greater and it will
give high evidence to reject the null hypothesis.
13. What are the things that should be kept in mind while choosing the value of K
in the KNN algorithm?
If K is small, then results might not be reliable because the noise will have a higher
influence on the result. If K is large, then we have to do a lot of processing, which may
adversely impact the performance of the algorithm.
So, the following things must be considered while choosing the value of K-
● K should be the square root of n (number of data points in the training dataset).
● K should be chosen as an odd value so that there are no ties. If the square root
is even, then add or subtract 1 to it. So, if the value is on the higher end we
make a subtraction and if it is on the lower end then we make an addition.
14. Why is the odd value of K preferred over the even values in the KNN
algorithm?
The odd value of K is preferred over even values in order to ensure that there are no
ties in the voting. If the square root of a number of data points is even, then we add or
subtract 1 to it to make it odd.
Type 1 error or false positive occurs when we reject a hypothesis when it is actually
true. For example- The jury decides a person is guilty even though a person is
innocent.
Type 2 error or false negative occurs when we accept a hypothesis when it is actually
false. For example- A test for a disease may report a negative result, when the patient
is, in fact, infected.