Datascience Assignment 1696326298
Datascience Assignment 1696326298
1) Business Problem: A retail store wants to analyze the sales data of a particular
product category to understand the typical sales performance and make strategic
decisions.
Data:
Let's consider the weekly sales data (in units) for the past month for a specific product
category:
Week 1: 50 units
Week 2: 60 units
Week 3: 55 units
Week 4: 70 units
Question:
1. Mean: What is the average weekly sales of the product category?
2. Median: What is the typical or central sales value for the product category?
3. Mode: Are there any recurring or most frequently occurring sales figures for the
product category?
By answering these questions using the mean, median, and mode, the retail store can
gain insights into the sales performance of the product category, identify any patterns or
outliers, and make informed decisions regarding stock management, marketing
strategies, and product placement.
Data:
Let's consider the waiting times (in minutes) for the past 20 customers:
15, 10, 20, 25, 15, 10, 30, 20, 15, 10,
10, 25, 15, 20, 20, 15, 10, 10, 20, 25
Question:
1. Mean: What is the average waiting time for customers at the restaurant?
By answering these questions using the mean, median, and mode, the restaurant can
gain insights into the average waiting time, identify any common or peak waiting periods,
and make informed decisions to optimize the customer service process, such as
adjusting staffing levels, streamlining operations, or implementing strategies to reduce
waiting times.
3) Business Problem: A car rental company wants to analyze the rental durations of
its customers to understand the typical rental period and optimize its pricing and
fleet management strategies.
Data:
Let's consider the rental durations (in days) for a sample of 50 customers:
3, 2, 5, 4, 7, 2, 3, 3, 1, 6,
4, 2, 3, 5, 2, 4, 2, 1, 3, 5,
6, 3, 2, 1, 4, 2, 4, 5, 3, 2,
7, 2, 3, 4, 5, 1, 6, 2, 4, 3,
5, 3, 2, 4, 2, 6, 3, 2, 4, 5
Question:
1. Mean: What is the average rental duration for customers at the car rental company?
3. Mode: Are there any recurring or most frequently occurring rental durations for
customers?
By answering these questions using the mean, median, and mode, the car rental
company can gain insights into the average rental duration, understand the most
common rental periods, and make informed decisions regarding pricing, fleet size, and
availability. Additionally, this analysis can help the company optimize resource allocation,
plan for peak demand periods, and enhance customer satisfaction by aligning service
offerings with customers' typical rental needs.
Questions on measure of dispersion
Data:
Let's consider the number of units produced per hour by the machine for a sample of 10
working days:
Question:
1. Range: What is the range of the production output for the machine?
2. Variance: What is the variance of the production output for the machine?
3. Standard Deviation: What is the standard deviation of the production output for the
machine?
Data:
Let's consider the daily sales (in dollars) for the past 30 days:
$500, $700, $400, $600, $550, $750, $650, $500, $600, $550,
$800, $450, $700, $550, $600, $400, $650, $500, $750, $550,
$700, $600, $500, $800, $550, $650, $400, $600, $750, $550
Questions:
1. Range: What is the range of the daily sales?
By answering these questions using different measures of dispersion, the retail store can
gain insights into the variability in daily sales, assess the consistency of demand, and
make informed decisions regarding inventory stocking levels, sales forecasting, and
pricing strategies.
Data:
Let's consider the delivery times (in days) for a sample of 50 shipments:
3, 5, 2, 4, 6, 2, 3, 4, 2, 5,
7, 2, 3, 4, 2, 4, 2, 3, 5, 6,
3, 2, 1, 4, 2, 4, 5, 3, 2, 7,
2, 3, 4, 5, 1, 6, 2, 4, 3, 5,
3, 2, 4, 2, 6, 3, 2, 4, 5, 3
Questions:
1. Range: What is the range of the delivery times?
$120, $150, $110, $135, $125, $140, $130, $155, $115, $145, $135, $130
Questions:
1. Measure of Central Tendency: What is the average monthly revenue for the product?
2. Measure of Dispersion: What is the range of monthly revenue for the product?
By answering these questions, the company can gain insights into the average revenue
generated by the product and understand the range or variability in its monthly revenue,
which can help with financial planning, forecasting, and evaluating the product's
performance.
Data:
Let's consider the satisfaction ratings from 50 customers:
8, 7, 9, 6, 7, 8, 9, 8, 7, 6,
8, 9, 7, 8, 7, 6, 8, 9, 6, 7,
8, 9, 7, 6, 7, 8, 9, 8, 7, 6,
9, 8, 7, 6, 8, 9, 7, 8, 7, 6,
9, 8, 7, 6, 7, 8, 9, 8, 7, 6
Questions:
1. Measure of Central Tendency: What is the average satisfaction rating?
2. Measure of Dispersion: What is the standard deviation of the satisfaction ratings?
By answering these questions, the company can gain insights into the average
satisfaction rating of customers and understand the spread or variability in their ratings.
This information can help identify areas for improvement, evaluate customer perception,
and make informed decisions to enhance the service quality.
6) Problem :A company wants to analyze the customer wait times at its call center to
assess the efficiency of its customer service operations.
Data:
Let's consider the wait times (in minutes) for a sample of 100 randomly selected
customer calls:
10, 15, 12, 18, 20, 25, 8, 14, 16, 22,
9, 17, 11, 13, 19, 23, 21, 16, 24, 27,
13, 10, 18, 16, 12, 14, 19, 21, 11, 17,
15, 20, 26, 13, 12, 14, 22, 19, 16, 11,
25, 18, 16, 13, 21, 20, 15, 12, 19, 17,
14, 16, 23, 18, 15, 11, 19, 22, 17, 12,
16, 14, 18, 20, 25, 13, 11, 22, 19, 17,
15, 16, 13, 14, 18, 20, 19, 21, 17, 12,
15, 13, 16, 14, 22, 21, 19, 18, 16, 11,
17, 14, 12, 20, 23, 19, 15, 16, 13, 18
Questions:
1. Measure of Central Tendency: What is the average wait time for customers at the call
center?
2. Measure of Dispersion: What is the range of wait times for customers at the call
center?
3. Measure of Dispersion: What is the standard deviation of the wait times for customers
at the call center?
By answering these questions, the company can gain insights into the average wait time
experienced by customers, assess the variability or spread in the wait times, and make
informed decisions regarding staffing levels, call center efficiency, and customer
satisfaction.
Data:
Let's consider the fuel efficiency (in miles per gallon, mpg) for a sample of 50 vehicles:
Model A: 30, 32, 33, 28, 31, 30, 29, 30, 32, 31,
Model B: 25, 27, 26, 23, 28, 24, 26, 25, 27, 28,
Model C: 22, 23, 20, 25, 21, 24, 23, 22, 25, 24,
Model D: 18, 17, 19, 20, 21, 18, 19, 17, 20, 19,
Model E: 35, 36, 34, 35, 33, 34, 32, 33, 36, 34
Questions:
1. Measure of Central Tendency: What is the average fuel efficiency for each vehicle
model?
2. Measure of Dispersion: What is the range of fuel efficiency for each vehicle model?
3. Measure of Dispersion: What is the variance of the fuel efficiency for each vehicle
model?
Data:
Let's consider the ages of 100 employees:
28, 32, 35, 40, 42, 28, 33, 38, 30, 41,
37, 31, 34, 29, 36, 43, 39, 27, 35, 31,
39, 45, 29, 33, 37, 40, 36, 29, 31, 38,
35, 44, 32, 39, 36, 30, 33, 28, 41, 35,
31, 37, 42, 29, 34, 40, 31, 33, 38, 36,
39, 27, 35, 30, 43, 29, 32, 36, 31, 40,
38, 44, 37, 33, 35, 41, 30, 31, 39, 28,
45, 29, 33, 38, 34, 32, 35, 31, 40, 36,
39, 27, 35, 30, 43, 29, 32, 36, 31, 40,
38, 44, 37, 33, 35, 41, 30, 31, 39, 28
Questions:
1. Frequency Distribution: Create a frequency distribution table for the ages of the
employees.
2. Mode: What is the mode (most common age) among the employees?
3. Median: What is the median age of the employees?
4. Range: What is the range of ages among the employees?
By answering these questions using frequency distribution and other measures, the
company can gain insights into the age distribution of its workforce, identify the most
common age group, understand the central tendency, and assess the spread of ages.
This information can be useful for workforce planning, diversity initiatives, and
understanding the generational dynamics within the organization.
Data:
Let's consider the purchase amounts (in dollars) for a sample of 50 customers:
56, 40, 28, 73, 52, 61, 35, 40, 47, 65,
52, 44, 38, 60, 56, 40, 36, 49, 68, 57,
52, 63, 41, 48, 55, 42, 39, 58, 62, 49,
59, 45, 47, 51, 65, 41, 48, 55, 42, 39,
58, 62, 49, 59, 45, 47, 51, 65, 43, 58
Questions:
1. Frequency Distribution: Create a frequency distribution table for the purchase
amounts.
2. Mode: What is the mode (most common purchase amount) among the customers?
3. Median: What is the median purchase amount among the customers?
4. Interquartile Range: What is the interquartile range of the purchase amounts?
By answering these questions using frequency distribution and other measures, the retail
store can gain insights into the spending habits of its customers, identify the most
common purchase amount
10) Problem : A manufacturing company wants to analyze the defect rates of its
production line to identify the frequency of different types of defects.
Data:
Let's consider the types of defects and their corresponding frequencies observed in a
sample of 200 products:
Defect Type: A, B, C, D, E, F, G
Frequency: 30, 40, 20, 10, 45, 25, 30
Questions:
1. Bar Chart: Create a bar chart to visualize the frequency of different defect types.
2. Most Common Defect: Which defect type has the highest frequency?
3. Histogram: Create a histogram to represent the defect frequencies.
By answering these questions using a bar chart and histogram, the manufacturing
company can visually understand the distribution of defect types, identify the most
common defect, and prioritize quality control efforts to address the prevalent issues.
11) Problem : A survey was conducted to gather feedback from customers about their
satisfaction levels with a specific service on a scale of 1 to 5.
Data:
Let's consider the satisfaction ratings from 100 customers:
Ratings: 4, 5, 3, 4, 4, 3, 2, 5, 4, 3,
5, 4, 2, 3, 4, 5, 3, 4, 5, 3,
4, 3, 2, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4,
3, 4, 5, 4, 2, 3, 4, 5, 3, 4,
5, 4, 3, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4,
3, 4, 5, 4, 2, 3, 4, 5, 3, 4,
5, 4, 3, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4
Questions:
1. Histogram: Create a histogram to visualize the distribution of satisfaction ratings.
2. Mode: Which satisfaction rating has the highest frequency?
3. Bar Chart: Create a bar chart to display the frequency of each satisfaction rating.
By answering these questions using a histogram and bar chart, the organization can
gain insights into the distribution of satisfaction ratings, identify the most common
satisfaction level, and assess overall customer satisfaction.
12) Problem : A company wants to analyze the monthly sales figures of its products to
understand the sales distribution across different price ranges.
Data:
Let's consider the monthly sales figures (in thousands of dollars) for a sample of 50
products:
Sales: 35, 28, 32, 45, 38, 29, 42, 30, 36, 41,
47, 31, 39, 43, 37, 30, 34, 39, 28, 33,
36, 40, 42, 29, 31, 45, 38, 33, 41, 35,
37,
Questions:
1. Histogram: Create a histogram to visualize the sales distribution across different price
ranges.
2. Measure of Central Tendency: What is the average monthly sales figure?
3. Bar Chart: Create a bar chart to display the frequency of sales in different price
ranges.
By answering these questions using a histogram and bar chart, the company can
understand the distribution of sales figures, determine the average sales performance,
and identify the price ranges where sales are concentrated or lacking.
13) Problem : A study was conducted to analyze the response times of a website for
different user locations.
Data:
Let's consider the response times (in milliseconds) for a sample of 200 user requests:
Response Times: 125, 148, 137, 120, 135, 132, 145, 122, 130, 141,
118, 125, 132, 136, 128, 123, 132, 138, 126, 129,
136, 127, 130, 122, 125, 133, 140, 126, 133, 135,
130, 134, 141, 119, 125, 131, 136, 128, 124, 132,
136, 127, 130, 122, 125, 133, 140, 126, 133, 135,
130, 134, 141, 119, 125, 131, 136, 128, 124, 132,
136, 127, 130, 122, 125, 133, 140, 126, 133, 135,
130, 134, 141, 119, 125, 131, 136, 128, 124, 132,
136, 127, 130, 122, 125, 133, 140, 126, 133, 135,
130, 134, 141, 119, 125, 131, 136, 128, 124, 132
Questions:
1. Histogram: Create a histogram to visualize the distribution of response times.
2. Measure of Central Tendency: What is the median response time?
3. Bar Chart: Create a bar chart to display the frequency of response times within
different ranges.
By answering these questions using a histogram and bar chart, the study can gain
insights into the distribution of response times, understand the typical response time
experienced by users, and assess the performance of the website.
14) Problem : A company wants to analyze the sales performance of its products
across different regions.
Data:
Let's consider the sales figures (in thousands of dollars) for a sample of 50 products in
three regions:
Region 1: 45, 35, 40, 38, 42, 37, 39, 43, 44, 41,
Region 2: 32, 28, 30, 34, 33, 35, 31, 29, 36, 37,
Region 3: 40, 39, 42, 41, 38, 43, 45, 44, 41, 37
Questions:
1. Bar Chart: Create a bar chart to compare the sales figures across the three regions.
2. Measure of Central Tendency: What is the average sales figure for each region?
3. Measure of Dispersion
By answering these questions using a bar chart and measures of central tendency and
dispersion, the company can compare the sales performance across different regions,
identify the average sales figures, and understand the variability in sales within each
region. This information can be used for regional sales analysis, resource allocation, and
decision-making processes.
Data:
Let's consider the monthly returns (%) for the portfolio over a one-year period:
Returns: -2.5, 1.3, -0.8, -1.9, 2.1, 0.5, -1.2, 1.8, -0.5, 2.3,
-0.7, 1.2, -1.5, -0.3, 2.6, 1.1, -1.7, 0.9, -1.4, 0.3,
1.9, -1.1, -0.4, 2.2, -0.9, 1.6, -0.6, -1.3, 2.4, 0.7,
-1.8, 1.5, -0.2, -2.1, 2.8, 0.8, -1.6, 1.4, -0.1, 2.5,
-1.0, 1.7, -0.9, -2.0, 2.7, 0.6, -1.4, 1.1, -0.3, 2.0
Questions:
1. Skewness: Calculate the skewness of the monthly returns.
2. Kurtosis: Calculate the kurtosis of the monthly returns.
3. Interpretation: Based on the skewness and kurtosis values, what can be said about
the distribution of returns?
By answering these questions using measures of skewness and kurtosis, the company
can understand the shape and symmetry of the return distribution, assess the level of
risk and potential outliers, and make informed decisions regarding portfolio management
and risk mitigation strategies.
Data:
Let's consider the monthly incomes (in thousands of dollars) of a sample of 100
individuals:
Incomes: 2.5, 4.8, 3.2, 2.1, 4.5, 2.9, 2.3, 3.1, 4.2, 3.9,
2.8, 4.1, 2.6, 2.4, 4.7, 3.3, 2.7, 3.0, 4.3, 3.7,
2.2, 3.6, 4.0, 2.7, 3.8, 3.5, 3.2, 4.4, 2.0, 3.4,
3.1, 2.9, 4.6, 3.3, 2.5, 4.9, 2.8, 3.0, 4.2, 3.9,
2.8, 4.1, 2.6, 2.4, 4.7, 3.3, 2.7, 3.0, 4.3, 3.7,
2.2, 3.6, 4.0, 2.7, 3.8, 3.5, 3.2, 4.4,
2.0, 3.4,
3.1, 2.9, 4.6, 3.3, 2.5, 4.9, 2.8, 3.0, 4.2, 3.9,
2.8, 4.1, 2.6, 2.4, 4.7, 3.3, 2.7, 3.0, 4.3, 3.7,
2.2, 3.6, 4.0, 2.7, 3.8, 3.5, 3.2, 4.4, 2.0, 3.4,
3.1, 2.9, 4.6, 3.3, 2.5, 4.9
Questions:
1. Skewness: Calculate the skewness of the income distribution.
2. Kurtosis: Calculate the kurtosis of the income distribution.
3. Interpretation: Based on the skewness and kurtosis values, what can be inferred
about the income inequality?
By answering these questions using measures of skewness and kurtosis, the research
study can assess the level of income inequality, determine the shape of the income
distribution, and make informed policy recommendations to address income disparities.
Data:
Let's consider the satisfaction ratings from 200 customers:
Ratings: 4, 5, 3, 4, 4, 3, 2, 5, 4, 3,
5, 4, 2, 3, 4, 5, 3, 4, 5, 3,
4, 3, 2, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4,
3, 4, 5, 4, 2, 3, 4, 5, 3, 4,
5, 4, 3, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4,
3, 4, 5, 4, 2, 3, 4, 5, 3, 4,
5, 4, 3, 4, 5, 3, 4, 5, 4, 3,
3, 4, 5, 2, 3, 4, 4, 3, 5, 4
Questions:
1. Skewness: Calculate the skewness of the satisfaction ratings.
2. Kurtosis: Calculate the kurtosis of the satisfaction ratings.
3. Interpretation: Based on the skewness and kurtosis values, what can be inferred
about the satisfaction ratings distribution?
By answering these questions using measures of skewness and kurtosis, the survey can
assess the skewness and peakedness of the satisfaction ratings, determine if the ratings
are skewed towards positive or negative evaluations, and understand the distribution
characteristics of customer satisfaction.
Data:
Let's consider the house prices (in thousands of dollars) for
House Prices: 280, 350, 310, 270, 390, 320, 290, 340, 310, 380,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290,
270, 350, 300, 330, 370, 310, 280, 320, 350, 290
Questions:
1. Skewness: Calculate the skewness of the house price distribution.
2. Kurtosis: Calculate the kurtosis of the house price distribution.
3. Interpretation: Based on the skewness and kurtosis values, what can be inferred
about the distribution of house prices?
By answering these questions using measures of skewness and kurtosis, the study can
assess the symmetry and peakedness of the house price distribution, identify any
outliers or extreme values, and gain insights into the market trends and pricing
dynamics.
Data:
Let's consider the waiting times (in minutes) for a sample of 100 customers:
Waiting Times: 12, 18, 15, 22, 20, 14, 16, 21, 19, 17,
22, 19, 13, 16, 21, 22, 17, 19, 22, 18,
14, 20, 19, 17, 22, 18, 15, 21, 20, 16,
12, 18, 15, 22, 20, 14, 16, 21, 19, 17,
22, 19, 13, 16, 21, 22, 17, 19, 22, 18,
14, 20, 19, 17, 22, 18, 15, 21, 20, 16,
12, 18, 15, 22, 20, 14, 16, 21, 19, 17,
22, 19, 13, 16, 21, 22, 17, 19, 22, 18,
14, 20, 19, 17, 22, 18, 15, 21, 20, 16,
12, 18, 15, 22, 20, 14, 16, 21, 19, 17
Questions:
1. Skewness: Calculate the skewness of the waiting time distribution.
2. Kurtosis
By answering these questions using measures of skewness and kurtosis, the company
can assess the symmetry and tail behavior of the waiting time distribution, identify any
patterns or anomalies in customer waiting times, and make improvements to streamline
the service process and enhance customer satisfaction.
Data:
Let's consider the monthly salaries (in thousands of dollars) of a sample of 200
employees:
Salaries: 40, 45, 50, 55, 60, 62, 65, 68, 70, 72,
75, 78, 80, 82, 85, 88, 90, 92, 95, 100,
105, 110, 115, 120, 125, 130, 135, 140, 145, 150,
155, 160, 165, 170, 175, 180, 185, 190, 195, 200,
205, 210, 215, 220, 225, 230, 235, 240, 245, 250,
255, 260, 265, 270, 275, 280, 285, 290, 295, 300,
305, 310, 315, 320, 325, 330, 335, 340, 345, 350,
355, 360, 365, 370, 375, 380, 385, 390, 395, 400,
405, 410, 415, 420, 425, 430, 435, 440, 445, 450,
455, 460, 465, 470, 475, 480, 485, 490, 495, 500
Questions:
1. Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3) of the
salary distribution.
2. Percentiles: Calculate the 10th percentile, 25th percentile, 75th percentile, and 90th
percentile of the salary distribution.
3. Interpretation: Based on the quartiles and percentiles, what can be inferred about the
income distribution of the employees?
By answering these questions using quartiles and percentiles, the company can
understand the income levels at different points in the distribution, identify the median
salary and the spread of salaries, and make informed decisions related to compensation,
employee benefits, and salary structures.
Data:
Let's consider the weights (in kilograms) of a sample of 100 individuals:
Weights: 55, 60, 62, 65, 68, 70, 72, 75, 78, 80,
82, 85, 88, 90, 92, 95, 100, 105, 110, 115,
120, 125, 130, 135, 140, 145, 150, 155, 160, 165,
170, 175, 180, 185, 190, 195, 200, 205, 210, 215,
220, 225, 230, 235, 240, 245, 250, 255, 260, 265,
270, 275, 280, 285, 290, 295, 300, 305, 310, 315,
320, 325, 330, 335, 340, 345, 350, 355, 360, 365,
370, 375,
Questions:
1. Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3) of the
weight distribution.
2. Percentiles: Calculate the 15th percentile, 50th percentile, and 85th percentile of the
weight distribution.
3. Interpretation: Based on the quartiles and percentiles, what can be inferred about the
weight distribution of the individuals?
By answering these questions using quartiles and percentiles, the research study can
understand the weight distribution and identify the weight ranges at different percentiles,
such as underweight, normal weight, overweight, and obese categories. This information
can be used for evaluating health risks, designing appropriate interventions, and
providing personalized recommendations for weight management.
Data:
Let's consider the purchase amounts (in dollars) of a sample of 150 customers:
Purchase Amounts: 20, 25, 30, 35, 40, 45, 50, 55, 60, 65,
70, 75, 80, 85, 90, 95, 100, 105, 110, 115,
120, 125, 130, 135, 140, 145, 150, 155, 160, 165,
170, 175, 180, 185, 190, 195, 200, 205, 210, 215,
220, 225, 230, 235, 240, 245, 250, 255, 260, 265,
270, 275, 280, 285, 290, 295, 300, 305, 310, 315,
320, 325, 330, 335, 340, 345, 350, 355, 360, 365,
370, 375, 380, 385, 390, 395, 400, 405, 410, 415,
420, 425, 430, 435, 440, 445, 450, 455, 460, 465,
470, 475, 480, 485, 490, 495, 500, 505, 510, 515,
520, 525, 530, 535, 540, 545, 550, 555, 560, 565
Questions:
1. Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3) of the
purchase amount distribution.
2. Percentiles: Calculate the 20th percentile, 40th percentile, and 80th percentile of the
purchase amount distribution.
3. Interpretation: Based on the quartiles and percentiles, what can be inferred about the
spending patterns of the customers?
By answering these questions using quartiles and percentiles, the retail store can
understand the distribution of purchase amounts, identify the spending ranges at
different percentiles, analyze customer segments based on their spending behavior, and
tailor marketing strategies to target specific customer groups.
Data:
Let's consider the commute times (in minutes) of a sample of 250 employees:
Commute Times: 15, 20, 25, 30, 35, 40, 45, 50, 55, 60,
65, 70, 75, 80, 85, 90, 95, 100, 105, 110,
115, 120, 125, 130, 135, 140, 145, 150, 155, 160,
165, 170, 175, 180, 185, 190, 195, 200, 205, 210,
215, 220, 225, 230, 235, 240, 245, 250, 255, 260,
265, 270, 275, 280, 285, 290, 295, 300, 305, 310,
315, 320, 325, 330, 335, 340, 345, 350, 355, 360,
365, 370, 375, 380, 385, 390, 395, 400, 405, 410,
415, 420, 425, 430, 435, 440, 445, 450, 455, 460,
465, 470, 475, 480, 485, 490, 495, 500, 505, 510,
515, 520, 525, 530, 535, 540, 545, 550, 555, 560,
565, 570, 575, 580, 585, 590, 595, 600, 605, 610
Questions:
1. Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3) of the
commute time distribution.
2. Percentiles: Calculate the 30th percentile, 50th percentile, and 70th percentile of the
commute time distribution.
3. Interpretation: Based on the quartiles and percentiles, what can be inferred about the
average commute time of the employees?
By answering these questions using quartiles and percentiles, the study can determine
the typical commute times, understand the spread of commute times, identify any
outliers or extreme values, and provide insights for transportation planning, scheduling,
and employee well-being initiatives.
Data:
Let's consider the defect rates (in percentage) for a sample of 300 products:
Defect Rates: 0.5, 1.0, 0.2, 0.7, 0.3, 0.9, 1.2, 0.6, 0.4, 1.1,
0.8, 0.5, 0.3, 0.6, 1.0, 0.4, 0.5, 0.7, 0.9, 1.3,
0.8, 0.6, 0.4, 0.7, 0.9, 0.5, 0.2, 1.0, 0.8, 0.3,
0.6, 0.4, 0.7, 0.9, 1.2, 0.8, 0.3, 0.6, 0.5, 0.4,
0.7, 0.9, 1.1, 0.3, 1.4, 0,9, 0.6, 0.2, 1.5, 1.0
0.6, 0.4, 0.7, 1.0, 0.8, 0.3, 0.5, 0.8, 0.6, 0.3, 0.9
0.4, 0.7, 0.9, 1.0, 0.8, 0.3, 0.5, 0.6, 0.4, 0.7,
0.9, 1.1, 0.8, 0.3, 0.5, 0.6, 0.4, 0.7, 0.9, 1.0,
0.8, 0.3, 0.5, 0.6, 0.4, 0.7, 0.9, 1.1, 0.8, 0.3,
0.5, 0.6, 0.4, 0.7, 0.9, 1.0, 0.8, 0.3, 0.5, 0.6,
0.4, 0.7, 0.9, 1.1, 0.8, 0.3, 0.5, 0.6, 0.4, 0.7,
0.9, 1.0, 0.8, 0.3, 0.5, 0.6, 0.4, 0.7, 0.9, 1.1
Questions:
1. Quartiles: Calculate the first quartile (Q1), median (Q2), and third quartile (Q3) of the
defect rate distribution.
2. Percentiles: Calculate the 25th percentile, 50th percentile, and 75th percentile of the
defect rate distribution.
3. Interpretation: Based on the quartiles and percentiles, what can be inferred about the
quality of the products?
Data:
Let's consider the monthly advertising expenditure (in thousands of dollars) and
corresponding sales revenue (in thousands of dollars) for a sample of 12 months:
Advertising Expenditure: 10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35, 38
Sales Revenue: 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105
Question:
Calculate the correlation coefficient between advertising expenditure and sales revenue.
Interpret the value of the correlation coefficient and explain the nature of the relationship
between advertising expenditure and sales revenue.
By analyzing the correlation coefficient, the marketing department can determine the
strength and direction of the relationship between advertising expenditure and sales
revenue. This information can help them make informed decisions about allocating their
advertising budget and optimizing their marketing strategies.
Data:
Let's consider the daily closing prices (in dollars) of Company A and Company B for a
sample of 20 trading days:
Company A: 45, 47, 48, 50, 52, 53, 55, 56, 58, 60, 62, 64, 65, 67, 69, 70, 72, 74, 76, 77
Company B: 52, 54, 55, 57, 59, 60, 61, 62, 64, 66, 67, 69, 71, 73, 74, 76, 78, 80, 82, 83
Question:
Calculate the covariance between the stock prices of Company A and Company B.
Interpret the value of the covariance and explain the nature of the relationship between
the two stocks.
By analyzing the covariance, the investment analyst can determine whether the stock
prices of Company A and Company B move together (positive covariance) or in opposite
directions (negative covariance). This information can assist in identifying potential
investment opportunities and understanding the diversification benefits of combining
these stocks in a portfolio.
Data:
Let's consider the number of hours spent studying and the corresponding exam scores
for a sample of 30 students:
Hours Spent Studying: 10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35, 38, 40, 42, 45, 48, 50,
52, 55, 58, 60, 62, 65, 68, 70, 72, 75, 78, 80, 82
Exam Scores: 60, 65, 70, 75, 80, 82, 85, 88, 90, 92, 93, 95, 96, 97, 98, 99, 100, 102,
105, 106, 107, 108, 110, 112, 114, 115, 116, 118, 120, 122
Question:
Calculate the correlation coefficient between the hours spent studying and the exam
scores. Interpret the value of the correlation coefficient and explain the nature of the
relationship between studying hours and exam scores.
By analyzing the correlation coefficient, the researcher can determine the strength and
direction of the relationship between studying hours and exam scores. This information
can provide insights into the effectiveness of studying and help students and educators
make informed decisions about study habits and academic performance.
Questions on discrete and continuous random variable
1. Problem: A fair six-sided die is rolled 100 times. What is the probability of rolling
exactly five 3's?
Data: Number of rolls (n) = 100
2. Problem: In a deck of 52 playing cards, five cards are randomly drawn without
replacement. What is the probability of getting two hearts?
Data: Number of hearts in the deck (N) = 13, Number of cards drawn (n) = 5
4. Problem: A bag contains 30 red balls, 20 blue balls, and 10 green balls. Three balls
are drawn without replacement. What is the probability that all three balls are blue?
Data: Number of blue balls in the bag (N) = 20, Number of balls drawn (n) = 3
5. Problem: In a football match, a player scores a goal with a 0.3 probability per shot. If
the player takes 10 shots, what is the probability of scoring exactly three goals?
Data: Number of shots (n) = 10, Probability of scoring per shot (p) = 0.3
1. Problem: The heights of students in a class are normally distributed with a mean of
165 cm and a standard deviation of 10 cm. What is the probability that a randomly
selected student is taller than 180 cm?
Data: Mean height (μ) = 165 cm, Standard deviation (σ) = 10 cm, Height threshold (x)
= 180 cm
2. Problem: The waiting times at a coffee shop are exponentially distributed with a mean
of 5 minutes. What is the probability that a customer waits less than 3 minutes?
Data: Mean waiting time (μ) = 5 minutes, Waiting time threshold (x) = 3 minutes
3. Problem: The lifetimes of a certain brand of light bulbs are normally distributed with a
mean of 1000 hours and a standard deviation of 100 hours. What is the probability that
a randomly selected light bulb lasts between 900 and 1100 hours?
Data: Mean lifetime (μ) = 1000 hours, Standard deviation (σ) = 100 hours, Lifetime
range (lower limit x1, upper limit x2)
4. Problem: The weights of apples in a basket follow a uniform distribution between 100
grams and 200 grams. What is the probability that a randomly selected apple weighs
between 150 and 170 grams?
Data: Weight range (lower limit x1, upper limit x2)
5. Problem: The time taken to complete a task is exponentially distributed with a mean
of 20 minutes. What is the probability that the task is completed in less than 15
minutes?
Data: Mean time (μ) = 20 minutes, Time threshold (x) = 15 minutes
Discrete Distribution:
1. Problem: A company sells smartphones, and the number of defects per batch follows
a Poisson distribution with a mean of 2 defects. What is the probability of having exactly
3 defects in a randomly selected batch?
Data: Mean number of defects (λ) = 2, Number of defects (x) = 3
2. Problem: In a game, a player has a 0.3 probability of winning each round. If the
player plays 10 rounds, what is the probability of winning exactly 3 rounds?
Data: Probability of winning (p) = 0.3, Number of rounds (n) = 10, Number of wins (x)
=3
Continuous Distribution:
1. Problem: The weights of apples in a basket follow a normal distribution with a mean
of 150 grams and a standard deviation of 10 grams. What is the probability that a
randomly selected apple weighs between 140 and 160 grams?
Data: Mean weight (μ) = 150 grams, Standard deviation (σ) = 10 grams, Weight range
(lower limit x1, upper limit x2)
2. Problem: The lifetimes of a certain brand of light bulbs are exponentially distributed
with a mean of 1000 hours. What is the probability that a randomly selected light bulb
lasts more than 900 hours?
Data: Mean lifetime (μ) = 1000 hours, Lifetime threshold (x) = 900 hours
Explanation: In this problem, we use a sample to estimate the population mean height.
By calculating a confidence interval, we provide a range of plausible values for the
population mean. The 95% confidence level indicates that we are 95% confident that
the true population mean height falls within the calculated interval.
Explanation: In this problem, we aim to estimate the population proportion based on the
sample proportion. By constructing a confidence interval, we provide a range of
plausible values for the population proportion. The 90% confidence level indicates that
we are 90% confident that the true population proportion falls within the calculated
interval.
Explanation: In this problem, we are interested in comparing the means of two groups
(new method vs. traditional method). The null hypothesis (H0) states that there is no
significant difference between the means, while the alternative hypothesis (Ha)
suggests that there is a significant difference.
4. Problem: A manufacturing company claims that the average weight of its product is
500 grams. To test this claim, a random sample of 25 products is selected, and their
weights are measured. The sample mean weight is found to be 510 grams with a
sample standard deviation of 20 grams. Perform a hypothesis test to determine if there
is evidence to support the company's claim.
Data: Sample size (n) = 25, Sample mean (x̄) = 510 grams, Sample standard
deviation (s) = 20 grams, Population mean (μ) = 500 grams
Explanation: In this problem, we are conducting a hypothesis test to assess whether the
sample mean weight provides evidence to support the company's claim about the
population mean weight. The null hypothesis (H0) assumes that the population mean
weight is equal to the claimed value, while the alternative hypothesis (Ha) suggests
otherwise.
whereas this phrase is only used in 1% of non-spam emails. A new email has just arrived,
which does mention “free money”. What is the probability that it is spam?
13. A crime is committed by one of two suspects, A and B. Initially, there is equal evidence
against both of them. In further investigation at the crime scene, it is found that the guilty
party had a blood type found in 10% of the population. Suspect A does match this blood
type, whereas the blood type of Suspect B is unknown. (a) Given this new information, what
is the probability that A is the guilty party? (b) Given this new information, what is the
probability that B’s blood type matches that found at the crime scene?
14. You are going to play 2 games of chess with an opponent whom you have never played
against before (for the sake of this problem). Your opponent is equally likely to be a
beginner, intermediate, or a master. Depending on
(a) What is your probability of winning the first game?
(b) Congratulations: you won the first game! Given this information, what is the probability
that you will also win the second game
(c) Explain the distinction between assuming that the outcomes of the games are
independent and assuming that they are conditionally independent given the opponent’s
skill level. Which of these assumptions seems more reasonable, and why?
15. A chicken lays n eggs. Each egg independently does or doesn’t hatch, with probability p
of hatching. For each egg that hatches, the chick does or doesn’t survive (independently of
the other eggs), with probability s of survival. Let N ⇠ Bin(n, p) be the number of eggs which
hatch, X be the number of chicks which survive, and Y be the number of chicks which hatch
but don’t survive (so X + Y = N). Find the marginal PMF of X, and the joint PMF of X and Y .
Are they independent ?
Module
Module2 (Excel)
(2 and 3)
1) Use the average function and calculate the average of all the three category of
weight. (for this question use excel file named average 1 ).
2) The excel file named Average 3 , the table below contains precipitation
measurement as measured in the Rochester NY area last year and we sampled 3
days in each of the first three months of 2018. Complete all the question in the file
given .
3) In excel file named Count 1, The table below shows survey responses; the
respondents could use any value for their answers. Answer all the questions using
COUNT and COUNTA function.
4) In excel file named COUNT 2 , The following table represents a bank statement of
ExcelMaster company. Column E shows the total dollar value amount of each of the
accounts. Answer all the questions using COUNT and COUNTA function.
5) In excel file named COUNT 3 , Solve all the question by using formulas COUNT,
COUNTA and COUNTBLANK:
6) In excel file named HLOOKUP, Solve all the question using HLOOKUP only.
7) In excel file named IF 1 , Table A contains names and their respective grades for Excel
101 Course. Complete column C using only IF formula.
8) In excel file named IF 2, The following table is an extract from an accounting system
that contains four journal entries. Check if column A's cells match column B's cell. if
they match - return "match", otherwise return "no match".
9) In excel file named IF 3, The table below contains details of high school students
names and ages, use IF formula to complete columns D and E.If the student's age is
16 or above, he/she is eligible for a driver's license. Check if they are eligible or not.
Answer in column D. If the student is younger than 18 years old he/she is a minor.
Check whether the student is a minor or not. for Minor return "Minor" and non
minor = "Adult" anwswer in column E.
10) In excel file named IF 4, An A+ student gets 100% scholarship and non A+ gets 50%
scholarship , The following table contains the names of students from 2024 class. Use
IF function to calculate the scholarships' amounts each of them will get.
11) In excel file named Math 1, Use the following guidelines to calculate the statements
given the file.
12) In excel file named MAX MIN 1, Use max, min and average formulas to answer all
the following questions given in the file.
13) In the file named MAX MIN 2 , The following table contains details about the scores
of 4 students in a driving theory test. If a student fails at least one test - she or he
needs to retake the course. Use IF and MAX/MIN to check if a student passed the
test.
14) In the file named MAX MIN 3, IF at least one student got 99 points or more in a test -
the test considered easy, Use MAX and IF to create a logic that checks if the test was
"Easy" or not.
15) In the file named Nested IF 1, The school decided to use the following grade system:
2. Display only the hire date and employee name for each employee.
3. Display the ename concatenated with the job ID, separated by a comma and space, and
name the column Employee and Title
4. Display the hire date, name and department number for all clerks.
5. Create a query to display all the data from the EMP table. Separate each column by a
comma. Name the column THE OUTPUT
6. Display the names and salaries of all employees with a salary greater than 2000.
7. Display the names and dates of employees with the column headers "Name" and "Start
Date"
8. Display the names and hire dates of all employees in the order they were hired.
9. Display the names and salaries of all employees in reverse salary order.
10. Display 'ename" and "deptno" who are all earned commission and display salary in
reverse order.
11. Display the last name and job title of all employees who do not have a manager
12. Display the last name, job, and salary for all employees whose job is sales representative
or stock clerk and whose salary is not equal to $2,500, $3,500, or $5,000
Database Name: HR
1) Display the maximum, minimum and average salary and commission earned.
2) Display the department number, total salary payout and total commission payout for
each department.
4) Display the department number and total salary of employees in each department.
5) Display the employee's name who doesn't earn a commission. Order the result set
without using the column name
8) Display the employee's name, department id who have the first name same as another
employee in the same department
9) Display the sum of salaries of the employees working under each Manager.
10) Select the Managers name, the count of employees working under and the department
ID of the manager.
11) Select the employee name, department id, and the salary. Group the result with the
manager name and the employee last name should have second letter 'a!
12) Display the average of sum of the salaries and group the result with the department id.
Order the result with the department id.
13) Select the maximum salary of each department along with the department id
14) Display the commission, if not null display 10% of salary, if null display a default value 1
Database Name: HR
1. Write a query that displays the employee's last names only from the string's 2-5th
position with the first letter capitalized and all other letters lowercase, Give each column an
appropriate label.
2. Write a query that displays the employee's first name and last name along with a " in
between for e.g.: first name : Ram; last name : Kumar then Ram-Kumar. Also displays the
month on which the employee has joined.
3. Write a query to display the employee's last name and if half of the salary is greater than
ten thousand then increase the salary by 10% else by 11.5% along with the bonus amount of
1500 each. Provide each column an appropriate label.
4. Display the employee ID by Appending two zeros after 2nd digit and 'E' in the end,
department id, salary and the manager name all in Upper case, if the Manager name
consists of 'z' replace it with '$!
5. Write a query that displays the employee's last names with the first letter capitalized and
all other letters lowercase, and the length of the names, for all employees whose name
starts with J, A, or M. Give each column an appropriate label. Sort the results by the
employees' last names
6. Create a query to display the last name and salary for all employees. Format the salary to
be 15 characters long, left-padded with $. Label the column SALARY
9. From LOCATIONS table, extract the word between first and second space from the
STREET ADDRESS column.
10. Extract first letter from First Name column and append it with the Last Name. Also add
"@systechusa.com" at the end. Name the column as e-mail address. All characters should
be in lower case. Display this along with their First Name.
11. Display the names and job titles of all employees with the same job as Trenna.
12. Display the names and department name of all employees working in the same city as
Trenna.
13. Display the name of the employee whose salary is the lowest.
14. Display the names of all employees except the lowest paid.
Database Name: HR
1. Write a query to display the last name, department number, department name for all
employees.
2. Create a unique list of all jobs that are in department 4. Include the location of the
department in the output.
3. Write a query to display the employee last name,department name,location id and city of
all employees who earn commission.
4. Display the employee last name and department name of all employees who have an 'a'
in their last name
5. Write a query to display the last name,job,department number and department name for
all employees who work in ATLANTA.
6. Display the employee last name and employee number along with their manager's last
name and manager number.
7. Display the employee last name and employee number along with their manager's last
name and manager number (including the employees who have no manager).
8. Create a query that displays employees last name,department number,and all the
employees who work in the same department as a given employee.
10. Display the names and hire date for all employees who were hired before their
managers along withe their manager names and hire date. Label the columns as Employee
name, emp_hire_date,manager name,man_hire_date
Database Name: AdventureWorks
1. Write a query to display employee numbers and employee name (first name, last name)
of all the sales employees who received an amount of 2000 in bonus.
2. Fetch address details of employees belonging to the state CA. If address is null, provide
default value N/A.
3. Write a query that displays all the products along with the Sales OrderID even if an order
has never been placed for that product.
4. Find the subcategories that have at least two different prices less than $15.
5. A. Write a query to display employees and their manager details. Fetch employee id,
employee first name, and manager id, manager name.
B. Display the employee id and employee name of employees who do not have manager.
6. A. Display the names of all products of a particular subcategory 15 and the names of their
vendors.
B. Find the products that have more than one vendor.
8. Find sales prices of product 718 that are less than the list price recommended for that
product.
9. Display product number, description and sales of each product in the year 2001.
10. Build the logic on the above question to extract sales for each category by year. Fetch
Product Name, Sales_2001, Sales_2002, Sales_2003.
2. Create a query to display the employee numbers and last names of all employees who
earn more than the average salary. Sort the results in ascending order of salary.
3. Write a query that displays the employee numbers and last names of all employees who
work in a department with any employee whose last name contains a' u
4. Display the last name, department number, and job ID of all employees whose
department location is ATLANTA.
5. Display the last name and salary of every employee who reports to FILLMORE.
6. Display the department number, last name, and job ID for every employee in the
OPERATIONS department.
7. Modify the above query to display the employee numbers, last names, and salaries of all
employees who earn more than the average salary and who work in a department with any
employee with a 'u'in their name.
8. Display the names of all employees whose job title is the same as anyone in the sales
dept.
10. Write a query to display the top three earners in the EMPLOYEES table. Display their last
names and salaries.
11. Display the names of all employees with their salary and commission earned. Employees
with a null commission should have O in the commission column
12. Display the Managers (name) with top three salaries along with their salaries and
department information.
Date Function
1) Find the date difference between the hire date and resignation_date for all the
employees. Display in no. of days, months and year(1 year 3 months 5 days).
Emp_ID Hire Date Resignation_Date
1 1/1/2000 7/10/2013
2 4/12/2003 3/8/2017
3 22/9/2012 21/6/2015
4 13/4/2015 NULL
5 03/06/2016 NULL
6 08/08/2017 NULL
7 13/11/2016 NULL
3) Calcuate experience of the employee till date in Years and months(example 1 year and 3
months)
6) Fetch the financial year's 15th week's dates (Format: Mon DD YYYY)
7) Find out the date that corresponds to the last Saturday of January, 2015 using with
clause.
1) Create a pivot table of average prices for transactions. See and follow the steps in file
“power_pivot_question_01”.
2) From the MAM database, import 6 tables and use them to show the quantity sold by
town. See and follow the steps in file “power_pivot_question_02”.
3) Import tables from the Make-a-Mammal database, then hide tables and columns to
create a clean data model. See and follow the steps in file
“power_pivot_question_03”.
4) Import a table then amend it and import others to create data model. See and follow
the steps in file “power_pivot_question_04”.
5) Import tables into PowerPivot, hide tables and columns and create a pivot table and
slicer. See and follow the steps in file “power_pivot_question_05”.
6) Create a pivot table with slicer based on ten different tables. See and follow the steps
in file “power_pivot_question_06”.
7) Create two pivot tables, and two timelines which control both of the pivot tables. See
and follow the steps in file “power_pivot_question_07”.
8) Use slicers to control 2 pivot tables, and Quick Explore to drill down. See and follow
the steps in file “power_pivot_question_08”.
9) Use a timeline to restrict a pivot table to 4 specific quarters. See and follow the steps
in file “power_pivot_question_09”.
10) Create a linked Excel workbook in PowerPivot and use it in relationships. See and
follow the steps in file “power_pivot_question_10”.
11) Import data from Access, Word and Excel, and link an Excel table, to create a
PowerPivot data model. See and follow the steps in file “power_pivot_question_11”.
12) Link to Excel, Access and the clipboard (via Word) to import and link 4 tables. See and
follow the steps in file “power_pivot_question_12”.
13) Link to two Excel workbooks and one SQL Server table in PowerPivot. See and follow
the steps in file “power_pivot_question_13”.
14) Create an aggregator column to sum transaction values by weekday in a pivot table.
See and follow the steps in file “power_pivot_question_14”.
15) Create two new calculated columns in a table, using RELATED and CONCATENATE. See
and follow the steps in file “power_pivot_question_15”.
16) Total sales by weekday, using simple two calculated columns. See and follow the steps
in file “power_pivot_question_16”.
17) Calculate age bands for different dates using the SWITCH function. See and follow the
steps in file “power_pivot_question_17”.
18) Divide shopping centres into the circles of hell, using the IF and the SWITCH functions.
See and follow the steps in file “power_pivot_question_18”.
19) Divide years into bands using SWITCH and calculated columns. See and follow the steps
in file “power_pivot_question_19”.
20) Summarise sales by status of animal, using calculated columns. See and follow the
steps in file “power_pivot_question_20”.
21) Calculate total and average transaction values using measures. See and follow the
steps in file “power_pivot_question_21”.
22) Calculate ratio of area to units for shopping centres using AVERAGEX. See and follow
the steps in file “power_pivot_question_22”.
23) Divide sales by 4 legs and other for a measure, using the FILTER function. See and
follow the steps in file “power_pivot_question_23”.
24) Use ALL and CALCULATE to get proportions of totals for regions. See and follow the
steps in file “power_pivot_question_24”.
25) Use the CALCULATE function to pick out only transactions whose price is a given
amount. See and follow the steps in file “power_pivot_question_25”.
26) Use the CALCULATE function to show percentages of row and column totals in a pivot
table. See and follow the steps in file “power_pivot_question_26”.
27) Create a ratio of sales between two different habitats, using the CALCULATE and SUMX
functions. See and follow the steps in file “power_pivot_question_27”.
28) Create measure using CALCULATE and ALL to get ratios against total. See and follow
the steps in file “power_pivot_question_28”.
29) Exclude a month from totals using the VALUES function to retain context. See and
follow the steps in file “power_pivot_question_29”.
30) Use CALCULATE to work out the ratio of total sales to sales for a specific type of animal.
See and follow the steps in file “power_pivot_question_30”.
31) Use FILTER and ALL to show sales as a proportion of one region's total. See and follow
the steps in file “power_pivot_question_31”.
32) Use the CALCULATE function to show total sales for Northern powerhouse shopping
centres. See and follow the steps in file “power_pivot_question_32”.
33) Use the RANKX function to order total sales over species. See and follow the steps in
file “power_pivot_question_33”.
34) Filter calculated sums to compare two stores' sales figures. See and follow the steps in
file “power_pivot_question_34”.
35) Use AVERAGEX to find average ratios, then CALCULATE to avoid divide-by-zero errors.
See and follow the steps in file “power_pivot_question_35”.
36) Create two measures, showing average price for the South and all other regions. See
and follow the steps in file “power_pivot_question_36”.
37) Exclude a single animal from a pivot table, using CALCULATE combined with the
VALUES function. See and follow the steps in file “power_pivot_question_37”.
38) Sorting total sales into ascending order, using the RANKX function. See and follow the
steps in file “power_pivot_question_38”.
39) Group purchases into shopping centre size bands, using the EARLIER and FILTER
functions. See and follow the steps in file “power_pivot_question_39”.
40) Group sales into size bands using the EARLIER function. See and follow the steps in file
“power_pivot_question_40”.
41) Create a basic report to show a simple table of Abba songs. See and follow the steps
in file “power_pivot_question_41”.
42) Create a matrix and return some appropriate images above. See and follow the steps
in file “power_pivot_question_42”.
43) Create a report listing Game of Thrones episodes, importing two tables. See and follow
the steps in file “power_pivot_question_43”.
44) Load a webpage of the best films, and use this to create a table. See and follow the
steps in file “power_pivot_question_44”.
45) Load FTSE data, and create a report with a table, shape and image. See and follow the
steps in file “power_pivot_question_45”.
46) Use a matrix to compare the number of websites by country and type. See and follow
the steps in file “power_pivot_question_46”.
47) Compare Oscars won by genre and certificate for films using a matrix. See and follow
the steps in file “power_pivot_question_47”.
48) Count the number of world events for each country and year. See and follow the steps
in file “power_pivot_question_48”.
49) Load example tables from a SQL Server database, and use them to create a matrix. See
and follow the steps in file “power_pivot_question_49”.
50) Create relationships between tables using two methods. See and follow the steps in
file “power_pivot_question_50”.
51) Load an Excel workbook of Disney princesses, and create a table from this. See and
follow the steps in file “power_pivot_question_51”.
52) Importing, tidying up and filtering skyscraper data to create a column chart. See and
follow the steps in file “power_pivot_question_52”.
53) Use Query Editor to import and tidy up a list of the richest people. See and follow the
steps in file “power_pivot_question_53”.
54) Use Query Editor to load and tidy up a list of FTSE share prices. See and follow the
steps in file “power_pivot_question_54”.
55) Use the query editor to transform a rubbish data file into something useful. See and
follow the steps in file “power_pivot_question_55”.
56) Use Query Editor to cleanse a list of imported top websites. See and follow the steps
in file “power_pivot_question_56”.
57) Use Query Editor to rename and split columns in a Game of Thrones worksheet. See
and follow the steps in file “power_pivot_question_57”.
58) Load some pivoted forecast data, unpivot it and much more!. See and follow the steps
in file “power_pivot_question_58”.
59) Use Query Editor to remove, transform and add columns to a tall buildings list. See and
follow the steps in file “power_pivot_question_59”.
60) Apply a filter and a slicer by continent to a list of most-visited websites. See and follow
the steps in file “power_pivot_question_60”.
61) Apply a page filter to a list of films, then create a slicer by category. See and follow the
steps in file “power_pivot_question_61”.
62) Allow a user to choose pizzas by calorie count and type using slicers. See and follow
the steps in file “power_pivot_question_62”.
63) Create a date, numeric, dropdown and horizontal slicer on a report. See and follow the
steps in file “power_pivot_question_63”.
64) Create a slicer and chart to choose which whale sightings dataset you want to see. See
and follow the steps in file “power_pivot_question_64”.
65) Import skyscraper data, creating a new column and showing this in a chart controlled
by a slicer. See and follow the steps in file “power_pivot_question_65”.
66) Use hidden synced slicers to filter all pages with a single slicer. See and follow the steps
in file “power_pivot_question_66”.
67) Create date and normal slicers on one page to affect visuals on other pages. See and
follow the steps in file “power_pivot_question_67”.
68) Create linked slicers to show a chart of crime statistics. See and follow the steps in file
“power_pivot_question_68”.
69) Enable drill-through for a report to show a breakdown of tests taken. See and follow
the steps in file “power_pivot_question_69”.
70) Compare Pizza Express pizza calories using pie and doughnut charts. See and follow
the steps in file “power_pivot_question_70”.
71) Compare the number of Abba songs released by year using a column chart. See and
follow the steps in file “power_pivot_question_71”.
72) Create a donut chart of population data, and morph this into a tree chart. See and
follow the steps in file “power_pivot_question_72”.
73) Analyse 2018 crime figures for the Manchester area using various visuals. See and
follow the steps in file “power_pivot_question_73”.
74) Compare the heights of skyscrapers by country and city, and create a KPI. See and
follow the steps in file “power_pivot_question_74”.
75) Create a column chart of record sales, and drill-down to a pie chart. See and follow the
steps in file “power_pivot_question_75”.
76) Use grouping in charts to show viewing figures by genre for BBC1. See and follow the
steps in file “power_pivot_question_76”.
77) Compare skills test results using a waterfall chart with breakdown. See and follow the
steps in file “power_pivot_question_77”.
78) Create a bubble chart comparing two sets of numbers, and play it over time to show
changes. See and follow the steps in file “power_pivot_question_78”.
79) Show a chart comparing films when you click on each genre in a tree map. See and
follow the steps in file “power_pivot_question_79”.
80) Compare sales of goods across the UK for large shopping centres. See and follow the
steps in file “power_pivot_question_80”.
81) Create a map showing passenger numbers for UK stations, with drill-down. See and
follow the steps in file “power_pivot_question_81”.
82) Use ArcGIS to generate a heat map showing train passengers in the UK. See and follow
the steps in file “power_pivot_question_82”.
83) Create a map comparing house price sales for expensive houses across the UK. See
and follow the steps in file “power_pivot_question_83”.
84) Create a map to show sales by town for selected regions. See and follow the steps in
file “power_pivot_question_84”.
85) Analyse Brexit voting patterns for the countries of the UK, using Electoral Commission
data. See and follow the steps in file “power_pivot_question_85”.
Module 7 (Numpy)
Module 8 (Pandas and Matplotlib)
Step 10. For the most-ordered item, how many items were
ordered?
Step 13.b. Create a lambda function and change the type of item price
Step 14. How much was the revenue for the period in the
dataset?
Step 15. How many orders were made in the period?
Step 9. How many times did someone order more than one
Canned Soda?
Ex2 - Filtering and Sorting Data
This time we are going to pull data directly from the internet.
Step 7. View only the columns Team, Yellow Cards and Red
Cards and assign them to a dataframe called discipline
Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.
Step 10. Select every row after the fourth row and all columns
Step 11. Select every row up to the 4th row and all columns
Step 17. Select the third cell in the row named Arizona
Step 18. Select the third cell down in the column named
deaths
Ex - GroupBy
Introduction:
GroupBy can be summarized as Split-Apply-Combine.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Step 8. Print the mean, min and max values for spirit
consumption.
This time output a DataFrame
Occupation
Introduction:
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.
Have you noticed that the type of Year is int64. But pandas has a different type to work
with Time Series. Let's see it now.
Special thanks to Chris Albon for sharing the dataset and materials. All the credits to this
exercise belongs to him.
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
Step 8. Merge only the data that has the same 'subject_id' on
both data1 and data2
Step 13. Get a summary with the mean, min, max, std and
quartiles.
Wind Statistics
Introduction:
The data have been modified to contain some missing values, identified by NaN.
Using pandas should make this exercise easier, in particular for the bonus question.
You should be able to perform all of these operations without using a for loop or other
looping construct.
In [434… """
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
"""
'\nYr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MA
Out[434]:
L\n61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.0
4\n61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.8
3\n61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.7
1\n'
The first three columns are year, month and day. The remaining 12 columns are average
windspeeds in knots at 12 locations in Ireland on that day.
Step 5. Set the right dates as the index. Pay attention at the
data type, it should be datetime64[ns].
Step 15. Calculate the min, max and mean windspeeds and
standard deviations of the windspeeds across all locations for
each week (assume that the first week starts on January 2
1961) for the first 52 weeks.
Visualizing Chipotle's Data
This time we are going to pull data directly from the internet. Special thanks to:
https://github.com/justmarkham for sharing the dataset and materials.
(But feel free to jump right ahead into Section 8 if you want; it doesn't require that you
finish this section.)
Step 7.1 Look at the first line of code in Step 6. And try to figure out if it
leads to any kind of problem.
Step 7.1.1 Display the first few rows of that DataFrame.
Step 7.1.2 Think about what that piece of code does and display the dtype of UnitPrice
Step 7.1.3 Pull data from online_rt for CustomerID s 12346.0 and 12347.0.
But "top 3 countries" with respect to what? Two answers suggest themselves: Total sales
volume (i.e. total quantity sold) or total sales (i.e. revenue). This exercise goes for sales
volume, so let's stick to that.
Step 7.2.1 Find out the top 3 countries in terms of sales volume.
Step 7.2.2
Now that we have the top 3 countries, we can focus on the rest of the problem:
"Quantity per UnitPrice by CustomerID".
We need to unpack that.
"by CustomerID" part is easy. That means we're going to be plotting one dot per
CustomerID's on our plot. In other words, we're going to be grouping by CustomerID.
We will use this later to figure out an average price per customer.
Step 7.3.2 Group by CustomerID and Country and find out the average price
( AvgPrice ) each customer spends per unit.
But we shouldn't despair! There are two things to realize: 1) The data seem to be skewed
towaards the axes (e.g. we don't have any values where Quantity = 50000 and AvgPrice = 5).
So that might suggest a trend. 2) We have more data! We've only been looking at the data
from 3 different countries and they are plotted on different graphs.
So: we should plot the data regardless of Country and hopefully see a less scattered
graph.
Step 7.4.1 Plot the data for each CustomerID on a single graph
Nevertheless the rate of drop in quantity is so drastic, it makes me wonder how our revenue
changes with respect to item price. It would not be that surprising if it didn't change that
much. But it would be interesting to know whether most of our revenue comes from
expensive or inexpensive items, and how that relation looks like.
8.3 Plot.
Step 2. Create the DataFrame that should look like the one
below.
In [2]:
0 Jason Miller 42 0 4 25
1 Molly Jacobson 52 1 24 94
2 Tina Ali 36 1 31 57
3 Jake Milner 24 0 2 62
4 Amy Cooze 73 1 3 70
This time the size should be 4.5 times the postTestScore and
the color determined by sex
Step 9. Create a scatter plot with the day as the y-axis and tip
as the x-axis, differ the dots by sex
Step 10. Create a box plot presenting the total_bill per day
differetiation the time (Dinner or Lunch)
Step 11. Create two histograms of the tip value based for
Dinner and Lunch. They must be side by side.
Step 12. Create two scatterplots graphs, one for Male and
another for Female, presenting the total_bill value and tip
relationship, differing by smoker or no smoker
Step 6. Create a scatterplot with the Fare payed and the Age,
differ the plot color by gender
Step 5. Add another column called place, and insert what you
have in mind.
Step 8. Ops...it seems the index is from the most recent date.
Make the first entry the oldest date.
Step 10. What is the difference in days between the first day
and the oldest
Step 12. Plot the 'Adj Close' value. Set the size of the figure to
13.5 x 9 inches
Step 2. Create your time range (start and end variables). The
start date should be 01/01/2015 and the end should today
(whatever your today is).
Step 3. Get an API key for one of the APIs that are supported
by Pandas Datareader, preferably for AlphaVantage.
If you do not have an API key for any of the supported APIs, it is easiest to get one for
AlphaVantage. (Note that the API key is shown directly after the signup. You do not receive it
via e-mail.)
(For a full list of the APIs that are supported by Pandas Datareader, see here. As the APIs are
provided by third parties, this list may change.)
Step 4. Use Pandas Datarader to read the daily time series for
the Apple stock (ticker symbol AAPL) between 01/01/2015 and
today, assign it to df_apple and print it.
Step 6. Repeat the two previous steps for a few other stocks,
always creating a new dataframe: Tesla, IBM and Microsoft.
(Ticker symbols TSLA, IBM and MSFT.)
Step 9. You will notice that it filled the dataFrame with months
that don't have any data with NaN. Let's drop these rows.
Step 10. Good, now we have the monthly data. Now change
the frequency to year.
Step 6. Set the values of the first 3 rows from alcohol as NaN
Setting Up
in your console. If you want GPU support (and have an appropriate GPU), you will need to follow extra
steps (see the website). For now, there should be no need since you will usually use your own machine
only for development and small tests.
If you want to do everything in Colab (see below), you don't need to install Tensorflow yourself.
Google Colab
Google Colab (https://colab.research.google.com) is a platform to facilitate teaching of machine
learning/deep learning. There are tutorials available on-site. Essentially, it is a Jupyter notebook
environment with GPU-supported Tensorflow available.
If you want to, you can develop your assignments within this environment. See below for some notes.
Notebooks support Markup, so you can also write some text about what your code does, your observations
etc. This is a really good idea!
Running code on Colab should be fairly straightforward; there are tutorials available in case you are not
familiar with notebooks. There are just some caveats:
You can check which TF version you are running via tf.__version__ . Make sure this is 2.x!
localhost:8888/notebooks/Untitled.ipynb# 1/4
5/15/23, 11:41 PM Untitled - Jupyter Notebook
You will need to get external code (like datasets.py , see below) in there somehow. One option
would be to simply copy and paste the code into the notebook so that you have it locally available.
Another would be to run a cell with from google.colab import files; files.upload() and
choose the corresponding file, this will load it "into the runtime" to allow you to e.g. import from it.
Unfortunately you will need to redo this every time the runtime is restarted.
Later you will need to make data available as well. Since the above method results in temporary files,
the best option seems to be to upload them to Google Drive and use from google.colab import
drive; drive.mount('/content/drive') . You might need to "authenticate" which can be a bit
fiddly. After you succeed, you have your drive files available like a "normal" file system. If you find
better ways to do this (or the above point), please share them with the class!
In newer Colab versions, there is actually a button in the "Files" tab to mount the drive -- should be a bit
simpler than importing drive . The following should work now:
1. Find the folder in your Google Drive where the notebook is stored, by default this should be Colab
Notebooks .
2. Put your data, code (like datasets.py linked in one of the tutorials further below) etc. into the
same folder (feel free to employ a more sophisticated file structure, but this folder should be your
"root").
3. Mount the drive via the button, it should be mounted into /content .
4. Your working directory should be content , verify this via os.getcwd() .
5. Use os.chdir to change your working directory to where the notebook is (and the other files as
well, see step 2), e.g. /content/drive/My Drive/Colab Notebooks .
6. You should now be able to do stuff like from datasets import MNISTDataset in your
notebook (see MNIST tutorial further below).
Tensorflow Basics
NOTE: The Tensorflow docs went through significant changes recently. In particular, most introductory
articles were changed from using low-level interfaces to high-level ones (particularly Keras). We believe it's
better to start with low-level interfaces that force you to program every step of building/training a model
yourself. This way, you actually need to understand what is happening in the code. High-level interfaces do
a lot of "magic" under the hood. We will proceed to these interfaces after you learn the basics.
Get started with Tensorflow. There are many tutorials on diverse topics on the website, as well as an API
documentation (https://www.tensorflow.org/api_docs/python/tf). The following should suffice for now:
localhost:8888/notebooks/Untitled.ipynb# 2/4
5/15/23, 11:41 PM Untitled - Jupyter Notebook
Play around with the example code snippets. Change them around and see if you can predict what's going
to happen. Make sure you understand what you're dealing with!
Next, you should explore this model: Experiment with different hidden layer sizes, activation functions or
weight initializations. See if you can make any observations on how changing these parameters affects the
model's performance. Going to extremes can be very instructive here. Make some plots!
Also, reflect on the Tensorflow interface: If you followed the tutorials you were asked to, you have been
using a very low-level approach to defining models as well as their training and evaluation. Which of these
parts do you think should be wrapped in higher-level interfaces? Do you feel like you are forced to provide
any redundant information when defining your model? Any features you are missing so far?
Bonus
There are numerous ways to explore your model some more. For one, you could add more hidden layers
and see how this affects the model. You could also try your hand at some basic visualization and model
inspection: For example, visualize some of the images your model classifies incorrectly. Can you find out
why your model has trouble with these?
You may also have noticed that MNIST isn't a particularly interesting dataset -- even very simple models
can reach very high accuracy and there isn't much "going on" in the images. Luckily, Zalando Research has
developed Fashion MNIST (https://github.com/zalandoresearch/fashion-mnist). This is a more interesting
dataset with the exact same structure as MNIST, meaning you can use it without changing anything about
your code. You can get it by simply using tf.keras.datasets.fashion_mnist instead of regular MNIST.
You can attempt pretty much all of the above suggestions for this dataset as well!
If you work on Colab, make sure to save your notebooks with outputs! Under Edit -> Notebook settings,
make sure the box with "omit code cell output..." is
not* ticked.
localhost:8888/notebooks/Untitled.ipynb# 3/4
5/15/23, 11:41 PM Untitled - Jupyter Notebook
You can form groups to do the assignments (up to three people). However, each group member needs to
upload the solution separately (because the Moodle group feature is a little broken). At the very top of the
notebook, include the names of all group members!
localhost:8888/notebooks/Untitled.ipynb# 4/4
5/15/23, 11:38 PM Untitled
Datasets
It should go without saying that loading numpy arrays and taking slices of these as batches (as
we did in the last assignment) isn't a great way of providing data to the training algorithm. For
example, what if we are working with a dataset that doesn't fit into memory?
The recommended way of handling datasets is via the tf.data module. Now is a good time to
take some first steps with this module. Read the Programmer's Guide section on this. You can
ignore the parts on high-level APIs as well as anything regarding TFRecords and tf.Example
(we will get to these later) as well as specialized topics involving time series etc. If this is still too
much text for you, here is a super short version that just covers building a dataset from numpy
arrays (ignore the part where they use Keras ;)). For now, the main thing is that you understand
how to do just that.
Then, try to adjust your MLP code so that it uses tf.data to provide minibatches instead of
the class in datasets.py . Keep in mind that you should map the data into the [0,1] range
(convert to float!) and convert the labels to int32 (check the old MNISTDataset class for
possible preprocessing)!
Here you can find a little notebook that displays some basic tf.data stuff (also for MNIST).
Note that the Tensorflow guide often uses the three operations shuffle , batch and
repeat . Think about how the results differ when you change the order of these operations
(there are six orderings in total). You can experiment with a simple Dataset.range dataset.
What do you think is the most sensible order?
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/3
5/15/23, 11:38 PM Untitled
During training, run summary ops for anything you are interested in, e.g.
Usually scalars for loss and other metrics (e.g. accuracy).
Distributions/histograms of layer activations or weights.
Images that show what the data looks like.
Run TensorBoard on the log directory.
Later, we will also see how to use TensorBoard to visualize the computation graph of a model.
Finally, check out the github readme for more information on how to use the TensorBoard app
itself (first part of the "Usage" section is outdated -- this is not how you create a file writer
anymore).
Note: You don't need to hand in any of the above -- just make sure you get TensorBoard to
work.
These scripts/models are relatively simple -- you should be able to run them on your local
machine as long as it's not too ancient. Of course you will need to have the necessary
libraries installed.
Please don't mess with the parameters of the network or learning algorithm before
experiencing the original. You can of course use any oddities you notice as clues as to what
might be going wrong.
Sometimes it can be useful to have a look at the inputs your model actually receives.
tf.summary.image helps here. Note that you need to reshape the inputs from vectors to
28x28-sized images and add an extra axis for the color channel (despite there being only
one channel). Check out tf.reshape and tf.expand_dims .
Otherwise, it should be helpful to visualize histograms/distributions of layer activations and
see if anything jumps out. Note that histogram summaries will crash your program in case
of nan values appearing. In this case, see if you can do without the histograms and use
other means to find out what is going wrong.
You should also look at the gradients of the network; if these are "unusual" (i.e. extremely
small or large), something is probably wrong. An overall impression of a gradient's size can
be gained via tf.norm(g) ; feel free to add scalar summaries of these values to
TensorBoard. You can pass a name to the variables when defining them and use this to
give descriptive names to your summaries.
Some things to watch out for in the code: Are the activation functions sensible? What about
the weight initializations? Do the inputs/data look "normal"?
Note: The final two scripts (4 and 5) may actually work somewhat, but performance should
still be significantly below what could be achieved by a "correct" implementation.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/3
5/15/23, 11:38 PM Untitled
What to Hand In
An MLP training script using tf.data .
A description of what the six shuffle/batch/repeat orderings do (on a conceptual
level) and which one you think is the most sensible for training neural networks.
For each "failed" script above, a description of the problem as well as how to fix it (there
may be multiple ways). You can just write some text here (markdown cells!), but feel free to
reinforce your ideas with some code snippets.
Anything else you feel like doing. :)
Note that you can only upload a single notebook on Moodle, so use the Markdown features
(text cells instead of code) to answer the text-based questions.
Bonus
Like last week, play around with the parameters of your networks. Use Tensorboard to get more
information about how some of your choices affect behavior. For example, you could compare
the long-term behavior of saturating functions such as tanh with relu, how the gradients vary for
different architectures etc.
If you want to get deeper into the data processing side of things, check the Performance Guide.
Peer Quizzes Follow the registration instructions provided on the theory exercise channel on
Mattermost. Contribute PeerQuiz platform with at least
Answer and rate as many questions that are posted from peers as you want.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/3
5/15/23, 11:44 PM Untitled
Keras
The low-level TF functions we used so far are nice to have full control over everything that is
happening, but they are cumbersome to use when we just need "everyday" neural network
functionality. For such cases, Tensorflow has integrated Keras to provide abstractions over many
common workflows. Keras has tons of stuff in it; we will only look at some of it for this
assignment and get to know more over the course of the semester. In particular:
tf.keras.layers provides wrappers for many common layers such as dense (fully
connected) or convolutional layers. This takes care of creating and storing weights, applying
activation functions, regularizers etc.
tf.keras.Model in turn wraps layers into a cohesive whole that allows us to handle
whole networks at once.
tf.optimizers make training procedures such as gradient descent simple.
tf.losses and tf.metrics allow for easier tracking of model performance.
Unfortunately, none of the TF tutorials are quite what we would like here, so you'll have to mix-
and-match a little bit:
This tutorial covers most of what we need, i.e. defining a model and using it in a custom
training loop, along with optimizers, metrics etc. You can skip the part about GANs. Overall,
the loop works much the same as before, except:
You now have all model weights conveniently in one place.
You can use the built-in loss functions, which are somewhat less verbose than
tf.nn.sparse_softmax_cross_entropy_with_logits .
You can use Optimizer instances instead of manually subtracting gradients from
variables.
You can use metrics to keep track of model performance.
There are several ways to build Keras models, the simplest one being Sequential . For
additional examples, you can look at the top of this tutorial, or this one, or maybe this one...
In each, look for the part model = tf.keras.Sequential... . You just put in a list of
layers that will be applied in sequence. Check the API to get an impression of what layers
there are and which arguments they take.
Later, we will see how to wrap entire model definitions, training loops and evaluations in a hand-
full of lines of code. For now, you might want to rewrite your MLP code with these Keras
functions and make sure it still works as before.
You should have seen that (with Keras) modifying layer sizes, changing activation functions etc.
is simple: You can generally change parts of the model without affecting the rest of the program
(training loop etc). In fact, you can change the full pipeline from input to model output without
having to change anything else (restrictions apply).
Replace your MNIST MLP by a CNN. The tutorials linked above might give you some ideas for
architectures. Generally:
Your data needs to be in the format width x height x channels . So for MNIST, make
sure your images have shape (28, 28, 1) , not (784,) !
Apply a bunch of Conv2D and possibly MaxPool2D layers.
Flatten .
Apply any number of Dense layers and the final classification (logits) layer.
Use Keras!
A reference CNN implementation without Keras can be found here!
Note: Depending on your machine, training a CNN may take much longer than the MLPs
we've seen so far. Here, using Colab's GPU support could be useful (Edit -> Notebook
settings -> Hardware Accelerator). Also, processing the full test set in one go for
evaluation might be too much for your RAM. In that case, you could break up the test set
into smaller chunks and average the results (easy using keras metrics) -- or just make the
model smaller.
You should consider using a better optimization algorithm than the basic SGD . One option is to
use adaptive algorithms, the most popular of which is called Adam. Check out
tf.optimizers.Adam . This will usually lead to much faster learning without manual tuning of
the learning rate or other parameters. We will discuss advanced optimization strategies later in
the class, but the basic idea behind Adam is that it automatically chooses/adapts a per-
parameter learning rate as well as incorporating momentum. Using Adam, your CNN should
beat your MLP after only a few hundred steps of training. The general consensus is that a well-
tuned gradient descent with momentum and learning rate decay will outperform adaptive
methods, but you will need to invest some time into finding a good parameter setting -- we will
look into these topics later.
If your CNN is set up well, you should reach extremely high accuracy results. This is arguably
where MNIST stops being interesting. If you haven't done so, consider working with Fashion-
MNIST instead (see Assignment 1). This should present more of a challenge and make
improvements due to hyperparameter tuning more obvious/meaningful. You could even try
CIFAR10 or CIFAR100 as in one of the tutorials linked above. They have 32x32 3-channel color
images with much more variation. These datasets are also available in tf.keras.datasets .
Note: For some reason, the CIFAR labels are organized somewhat differently -- shaped (n, 1)
instead of just (n,) . You should do something like labels = labels.reshape((-1,)) or
this will mess up the loss function.
What to Hand In
A CNN (built with Keras) trained on MNIST (or not, see below). Also use Keras losses,
optimizers and metrics, but do still use a "custom" training loop (with GradientTape ).
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/3
5/15/23, 11:44 PM Untitled
You are highly encouraged to move past MNIST at this point. E.g. switching to CIFAR takes
minimal effort since it can also be downloaded through Keras. You can still use MNIST as a
"sanity check" that your model is working, but you can skip it for the submission.
Document any experiments you try. For example:
Really do play with the model parameters. As a silly example, you could try increasing
your filter sizes up to the input image size -- think about what kind of network you are
ending up with if you do this! On the other extreme, what about 1x1 filters?
You can do the same thing for pooling. Or replace pooling with strided
convolutions. Or...
If you're bored, just try to achieve as high of a (test set) performance as you can on
CIFAR. This dataset is still commonly used as a benchmark today. Can you get ~97%
(test set)?
You could try to "look into" your trained models. E.g. the convolutional layers output
"feature maps" that can also be interpreted as images (and thus plotted one image per
filter). You could use this to try to figure out what features/patterns the filters are
recognizing by seeing for what inputs they are most active.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/3
5/15/23, 11:46 PM Untitled
Graph-based Execution
So far, we have been using so-called "eager execution" exclusively: Commands are run as they
are defined, i.e. writing y = tf.matmul(X, w) actually executes the matrix multiplication.
In Tensorflow 1.x, things used to be different: Lines like the above would only define the
computation graph but not do any actual computation. This would be done later in dedicated
"sessions" that execute the graph. Later, eager execution was added as an alternative way of
writing programs and is now the default, mainly because it is much more intuitive/allows for a
more natural workflow when designing/testing models.
Graph execution has one big advantage: It is very efficient because entire models (or even
training loops) can be executed in low-level C/CUDA code without ever going "back up" to
Python (which is slow). As such, TF 2.0 still retains the possibility to run stuff in graph mode if
you so wish -- let's have a look!
As expected, there is a tutorial on the TF website as well as this one which goes intro extreme
depth on all the subtleties. The basic gist is:
You can annotate a Python function with @tf.function to "activate" graph execution for
this function.
The first time this function is called, it will be traced and converted to a graph.
Any other time this function is called, the Python function will not be run; instead the traced
graph is executed.
The above is not entirely true -- functions may be retraced under certain (important)
conditions, e.g. for every new "input signature". This is treated in detail in the article linked
above.
Beware of using Python statements like print , these will not be traced so the statement
will only be called during the tracing run itself. If you want to print things like tensor values,
use tf.print instead. Basically, traced TF functions only do "tensor stuff", not general
"Python stuff".
Go back to some of your pevious models and sprinkle some tf.function annotations in
there. You might need to refactor slightly -- you need to actually wrap things into a function!
The most straightforward target for decoration is a "training step" function that takes a
batch of inputs and labels, runs the model, computes the loss and the gradients and applies
them.
In theory, you could wrap a whole training loop (including iteration over a dataset) with a
tf.function . If you can get this to work on one of your previous models and actually get
a speedup, you get a cookie. :)
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/3
5/15/23, 11:46 PM Untitled
DenseNet
Previously, we saw how to build neural networks in a purely sequential manner -- each layer
receives one input and produces one output that serves as input to the next layer. There are
many architectures that do not follow this simple scheme. You might ask yourself how this can
be done in Keras. One answer is via the so-called functional API. There is an in-depth guide here.
Reading just the intro should be enough for a basic grasp on how to use it, but of course you
can read more if you wish.
Next, use the functional API to implement a DenseNet. You do not need to follow the exact same
architecture, in fact you will probably want to make it smaller for efficiency reasons. Just make
sure you have one or more "dense blocks" with multiple layers (say, three or more) each. You
can also leave out batch normalization (this will be treated later in the class) as well as
"bottleneck layers" (1x1 convolutions) if you want.
Bonus: Can you implement DenseNet with the Sequential API? You might want to look at how to
implement custom layers (shorter version here)...
What to Hand In
DenseNet. Thoroughly experiment with (hyper)parameters. Try to achieve the best
performance you can on CIFAR10/100.
For your model(s), compare performance with and without tf.function . You can also do
this for non-DenseNet models. How does the impact depend on the size of the models?
The next two parts are just here for completeness/reference, to show other ways of working with
Keras and some additional TensorBoard functionalities. Check them out if you want -- we will
also (try to) properly present them in the exercise later.
The gist is covered in the beginner quickstart: Build the model, compile with an optimizer, a
loss and optional metrics and then run fit on a dataset. That's it!
They also have the same thing with a bit more detail.
The above covers the bare essentials, but you could also look at how to build a CNN for
CIFAR10.
There are also some interesting overview articles in the "guide" section but this should suffice
for now. Once again, go back to your previous models and re-make them with these high-level
training loops! Also, from now on, feel free to run your models like this if you want (and can get
it to work for your specific case).
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/3
5/15/23, 11:46 PM Untitled
To look at computation graphs, you need to trace computations explicitly. See the last part of
this guide for how to trace tf.function -annotated computations. Note: It seems like you
have to do the trace the first time the function is called (e.g. on the first training step).
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/3
5/15/23, 11:49 PM Untitled
In part 1, we are mainly concerned with implementing RNNs at the low level so that we
understand how they work in detail. The models themselves will be rather rudimentary. We will
also see the kinds of problems that arise when working with sequence data, specifically text.
Next week, we will build better models and deal with some of these issues.
The notebook associated with the practical exercise can be found here.
The Data
We will be using the IMDB movie review dataset. This dataset comes with Keras and consists of
50,000 movie reviews with binary labels (positive or negative), divided into training and testing
sets of 25,000 sequences each.
A first look
The data can be loaded the same way as MNIST or CIFAR --
tf.keras.datasets.imdb.load_data() . If you print the sequences, however, you will see
that they are numbers, not text. Recall that deep learning is essentially a pile of linear algebra. As
such, neural networks cannot take text as input, which is why it needs to be converted to
numbers. This has already been done for us -- each word has been replaced by a number, and
thus a movie review is a sequence of numbers (punctuation has been removed).
Representing words
Our sequences are numbers, so they can be put into a neural network. But does this make
sense? Recall the kind of transformations a layer implements: A linear map followed by a
(optional) non-linearity. But that would mean, for example, that the word represented by index
10 would be "10 times as much" as the word represented by index 1. And if we simply swapped
the mapping (which we can do, as it is completely arbitrary), the roles would be reversed!
Clearly, this does not make sense.
A simple fix is to use one-hot vectors: Replace a word index by a vector with as many entries as
there are words in the vocabulary, where all entries are 0 except the one corresponding to the
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/5
5/15/23, 11:49 PM Untitled
Thus, each word gets its own "feature dimension" and can be transformed separately. With this
transformation, our data points are now sequences of one-hot vectors, with shape
(sequence_length, vocabulary_size) .
In the notebook, this is done in a rather crude way: All sequences are padded to the length of
the longest sequence in the dataset.
Food for thought #1: Why is this wasteful? Can you think of a smarter padding scheme that is
more efficient? Consider the fact that RNNs can work on arbitrary sequence lengths, and that
training minibatches are pretty much independent of each other.
1. Some sequences are very long. This increases our computation time as well as massively
hampering gradient flow. It is highly recommended that you limit the sequence length (200
could be a good start). You have two choices:
A. Truncate sequences by cutting off all words beyond a limit. Both load_data and
pad_sequences have arguments to do this. We recommend the latter as you can
choose between "pre" or "post" truncation.
B. Remove all sequences that are longer than a limit from the dataset. Radical!
2. Our vocabulary is large, more than 85,000 words. Many of these are rare words which only
appear a few times. There are two reasons why this is problematic:
A. The one-hot vectors are huge, slowing down the program and eating memory.
B. It's difficult for the network to learn useful features for the rare words.
load_data has an argument to keep only the n most common words and replace less
frequent ones by a special "unknown word" token (index 2 by default). As a start, try
keeping only the 20,000 most common words or so.
Food for thought #2: Between truncating long sequences and removing them, which option do
you think is better? Why?
Food for thought #3: Can you think of a way to avoid the one-hot vectors completely? Even if
you cannot implement it, a conceptual idea is fine.
With these issues taken care of, we should be ready to build an RNN!
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/5
5/15/23, 11:49 PM Untitled
For this assignment, you are asked not to use the RNNCell classes nor any related Keras
functionality. Instead, you should study the basic RNN equations and "just" translate these into
code. You can still use Keras optimizers, losses etc. You can also use Dense layers instead of
low-level ops, but make sure you know what you are doing. You might want to proceed as
follows:
On a high level, nothing about the training loop changes! The RNN gets an input and
computes an output. The loss is computed based on the difference between outputs and
targets, and gradients are computed and applied to the RNN weights, with the loss being
backpropagated trough time.
The differences come in how the RNN computes its output. The basic recurrency can be
seen in equation 10.5 of the deep learning book, with more details in equations 10.8-10.11.
The important idea is that, at each time step, the RNN essentially works like an MLP with a
single hidden layer, but two inputs (last state and current input). In total, you need to "just":
Loop over the input, at each time step taking the respective slice. Your per-step input
should be batch x features just like with an MLP!
At each time step, compute the new state based on the previous state as well as the
current input.
Compute the per-step output based on the new state.
What about comparing outputs to targets? Our targets are simple binary labels. On the
other hand, we have one output per time step. The usual approach is to discard all outputs
except the one for the very last step. Thus, this is a "many-to-one" RNN (compare figure
10.5 in the book).
For the output and loss, you actually have two options:
1. You could have an output layer with 2 units, and use sparse categorical cross-entropy
as before (i.e. softmax activation). Here, whichever output is higher "wins".
2. You can have a single output unit and use binary cross-entropy (i.e. sigmoid activation).
Here, the output is usually thresholded at 0.5.
Food for thought #4: How can it be that we can choose how many outputs we have, i.e. how
can both be correct? Are there differences between both choices as well as (dis)advantages
relative to each other?
Open Problems
Initial state
To compute the state at the first time step, you would need a "previous state", but there is none.
To fix this, you can define an "initial state" for the network. A common solution is to simply use a
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/5
5/15/23, 11:49 PM Untitled
tensor filled with zeros. You could also add a trainable variable and learn an initial state instead!
Food for thought #5: All sequences start with the same special "beginning of sequence" token
(coded by index 1). Given this fact, is there a point in learning an initial state? Why (not)?
Food for thought #6: pad_sequences allows for pre or post padding. Try both to see the
difference. Which option do you think is better? Recall that we use the final time step output
from our model.
Food for thought #7: Can you think of a way to prevent the RNN from computing new states
on padded time steps? One idea might be to "pass through" the previous state in case the
current time step is padding. Note that, within a batch, some sequences might be padded for a
given time step while others are not.
Slow learning
Be aware that it might take several thousand steps for the loss to start moving at all, so don't
stop training too early if nothing is happening. Experiment with weight initializations and
learning rates. For fast learning, the goal is usually to set them as large as possible without the
model "exploding".
A major issue with our "last output summarizes the sequence" approach is that the information
from the end has to backpropagate all the way to the early time steps, which leads to extreme
vanishing gradient issues. You could try to use the RNN output more effectively. Here are some
ideas:
Instead of only using the final output, average (or sum?) the logits (pre-sigmoid) of all time
steps and use this as the output instead.
Instead of the logits, average the states at all time steps and compute the output based on
this average state. Is this different from the above option?
Compute logits and sigmoids for each output, and average the per-step probabilities.
Food for thought #8: What could be the advantage of using methods like the above? What are
disadvantages? Can you think of other methods to incorporate the full output sequence instead
of just the final step?
What to hand in
A low-level RNN implementation for sentiment classification. If you can get it to move away
from 50% accuracy on the training set, that's a success. Be wary of overfitting, however, as
this doesn't mean that the model is generalizing! If the test (or validation) loss isn't moving,
try using a smaller network. Also note that you may sometimes get a higher test accuracy,
while the test loss is also increasing (how can this be?)!
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 4/5
5/15/23, 11:49 PM Untitled
Consider the various questions posed throughout the assignment and try to answer them!
You can use text cells to leave short answers in your notebook.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 5/5
5/15/23, 11:51 PM Untitled
The notebook associated with the practical exercise can be found here.
Thus, if we can find a way to delay the padding after we have formed batches, we can gain some
efficiency. Unfortunately, we cannot even create a tf.data.Dataset.from_tensor_slices
to apply batching to!
Luckily, there are other ways to create datasets. We will be using from_generator , which
allows for creating datasets from elements returned by arbitrary Python generators. Even better,
there is also a padded_batch transformation function which batches inputs and pads them to
the longest length in the batch (what would happen if we tried the regular batch method?).
See the notebook for a usage example!
Note: Tensorflow also has RaggedTensor . These are special tensors allowing different shapes
per element. You can find a guide here. You could directly create a dataset
from_tensor_slices by supplying a ragged tensor, which is arguably easier than using a
generator. Unfortunately, ragged tensors are not supported by padded_batch . Sad!
However, many tensorflow operations support ragged tensors, so padding can become
unnecessary in many places! You can check the guide for an example with a Keras model. You
can try this approach if you want, but for the rest of the assignment we will continue with the
padded batches (the ragged version will likely be very slow).
Level 2: Bucketing
There is still a problem with the above approach. In our dataset, there are many short sequences
and few very long ones. Unfortunately, it is very likely that all (or most) batches contain at least
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/3
5/15/23, 11:51 PM Untitled
one rather long sequence. That means that all the other (short) sequences have to be padded to
the long one! Thus, in the worst case, our per-batch padding might only gain us very little. It
would be great if there was a way to sort the data such that only sequences of a similar length
are grouped in a batch... Maybe there is something in the notebook?
Note: If you truncated sequences to a relatively small value, like 200, bucketing may provide
little benefit. The reason is that there will be so many sequences at the exact length 200 that the
majority of batches will belong to this bucket. However, if you decide to allow a larger value, say
length 500, bucketing should become more and more effective (noticeable via shorter time
spent per batch).
Embeddings
Previously, we represented words by one-hot vectors. This is wasteful in terms of memory, and
also the matrix products in our deep models will be very inefficient. It turns out, multiplying a
matrix with a one-hot vector simply extracts the corresponding column from the matrix.
Keras offers an Embedding layer for an efficient implementation. Use this instead of the one-
hot operation! Note that the layer adds additional parameters, however it can actually result in
fewer parameters overall if you choose a small enough embedding size (recall the lecture
discussion on using linear hidden layers).
RNNs in Keras
Keras offers various RNN layers. These layers take an entire 3d batch of inputs ( batch x time
x features ) and return either a complete output sequence, or only the final output time step.
There are two ways to use RNNs:
1. The more general is to define a cell which implements the per-step computations, i.e. how
to compute a new state given a previous state and current input. There are pre-built cells
for simple RNNs, GRUs and LSTMs ( LSTMCell etc.). The cells are then put into the RNN
layer which wraps them in a loop over time.
2. There also complete classes like LSTM which already wrap the corresponding cell.
While the first approach gives more flexibility (we could define our own cells), it is highly
recommended that you stick with the second approach, as this provides highly optimized
implementations for common usage scenarios. Check the docs for the conditions under which
these implementations can be used!
Once you have an RNN layer, you can use ist just like other layers, e.g. in a sequential model.
Maybe you have an embedding layer, followed by an LSTM, followed by a dense layer that
produces the final output. Now, you can easily create stacked RNNs (just put multiple RNN
layers one after the other), use Bidirectional RNNs, etc. Also try LSTMs vs GRUs!
Masking
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/3
5/15/23, 11:51 PM Untitled
One method to prevent new states being computed on padded time steps is by using a mask. A
mask is a binary tensor with shape (batch x time) with 1s representing "real" time steps and
0s representing padding. Given such a mask, the state computation can be "corrected" like this:
Where the mask is 1, the new state will be used. Where it is 0, the old state will be propagated
instead!
Masking with Keras is almost too simple: Pass the argument mask_zero=True to your
embedding layer (the constructor, not the call)! You can read more about masking here. The
short version is that tensors can carry a mask as an attribute, and Keras layers can be configured
to use and/or modify these masks in some way. Here, the embedding layer "knows" to create a
mask such that 0 inputs (remember that index 0 encodes padding) are masked as False , and
the RNN layers are implemented to perform something like the formula above.
Add masking to your model! The result should be much faster learning (in terms of steps
needed to reach a particular performance, not time), in particular with post padding (the only
kind of padding supported by padded_batch ). The effect will be more dramatic the longer
your sequences are.
What to hand in
Implement the various improvements outlined in this assignment. Experiment with adding them
one by one and judge the impact (on accuracy, training time, convenience...) of each. You can
also carry out "ablation" studies where you take the full model with all improvements, and
remove them one at a time to judge their impact.
You can also try using higher or smaller vocabulary sizes and maximum sequence lengths and
investigate the impact of these parameters!
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/3
5/15/23, 11:53 PM Untitled
Please do not just use the exemplary English-Spanish example to reduce temptation of simply
copying the tutorial.
You can find data sets here. We recommend picking a language pair where you understand both
languages (so if you do speak Spanish... feel free to use it ;)). This makes it easier (and more fun)
for you to evaluate the results. However, keep in mind that some language pairs have a very
large amount of examples, whereas some only have very few, which will impact the learning
process and the quality of the trained models.
You may run into issues with the code in two places:
1. The downloading of the data inside the notebook might not work (it crashes with a 403
Forbidden error). In that case, you can simply download & extract the data on your local
machine and upload the .txt file to your drive, and then mount it and load the file as you've
done before.
2. The load_data function might crash. It expects each line to result in pairs of sentences,
but there seems to be a third element which talks about attribution of the example (at least
if you download a different dataset from the link above). If this happens, you can use
line.split('\t')[:-1] to exclude this in the function.
Tasks:
Follow the tutorial and train the model on your chosen language pair.
You might need to adapt the preprocessing depending on the language.
Implement other attention mechanisms and train models with them (there are Keras layers
for both):
Bahdanau attention ( AdditiveAttention )
Luong's multiplicative attention ( Attention )
Compare the attention weight plots for some examples between the attention mechanisms.
We recommend to add ,vmax=1.0 when creating the plot in ax.matshow(attention,
cmap='viridis') in the plot_attention function so the colors correspond to the same
attention values in different plots.
Do you see qualitative differences in the attention weights between different attention
mechanisms?
Do you think that the model attends to the correct tokens in the input language (if you
understand both languages)?
Here are a few questions for you to check how well you understood the tutorial.
Please answer them (briefly) in your solution!
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/2
5/15/23, 11:53 PM Untitled
Which parts of the sentence are used as a token? Each character, each word, or are some
words split up?
Do the same tokens in different language have the same ID?
e.g. Would the same token index map to the German word die and to the English word
die ?
Is the decoder attending to all previous positions, including the previous decoder
predictions?
Does the encoder output change in different decoding steps?
Does the context vector change in different decoding steps?
The decoder uses teacher forcing. Does this mean the time steps can be computed in
parallel?
Why is a mask applied to the loss function?
When translating the same sentence multiple times, do you get the same result? Why (not)?
If not, what changes need to be made to get the same result each time?
Hand in all of your code, i.e. the working tutorial code along with all changes/additions you
made. Include outputs which document some of your experiments. Also remember to answer
the questions above! Of course you can also write about other observations you made.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/2
5/15/23, 11:54 PM Untitled - Jupyter Notebook
Assignment 8: Word2Vec
In this week, we will look at "the" classic model for learning word embeddings. This will be another tutorial-
based assignment. Find the link here (https://www.tensorflow.org/tutorials/text/word2vec).
Getting to know an example of self-supervised learning, where we have data without labels, and are
constructing a task directly from the data (often some kind of prediction task) in order to learn deep
representations,
Understanding how Softmax with a very large number of classes is problematic, and getting to know
possible workarounds,
Exploring the idea of word embeddings.
Given the sentence "I like to cuddle dogs", how many skipgrams are created with a window size of 2?
In general, how does the number of skipgrams relate to the size of the dataset (in terms of input-target
pairs)?
Why is it not a good idea to compute the full softmax for classification?
The way the dataset is created, for a given (target, context) pair, are the negative samples
(remember, these are randomly sampled) the same each time this training example is seen, or are they
different?
For the given example dataset (Shakespeare), would the code create (target, context) pairs for
sentences that span multiple lines? For example, the last word of one line and the first word of the next
line?
Does the code generate skipgrams for padding characters (index 0)?
The skipgrams function uses a "sampling table". In the code, this is shown to be a simple list of
probabilities, and it is created without any reference to the actual text data. How/why does this work?
I.e. how does the program "know" which words to sample with which probability?
localhost:8888/notebooks/Untitled.ipynb# 1/2
5/15/23, 11:54 PM Untitled - Jupyter Notebook
Compute the similarity between the resulting vector and all word vectors. Which one gives the
highest similarity? It "should" be queen . Note that it might actually be king , in which case
queen should at least be second-highest. To compute the similarity, you should use cosine
similarity.
You can try this for other pairs, such as Paris - France + Germany = Berlin etc.
Use a larger vocabulary and/or larger text corpora to train the models. See how embedding quality and
training effort changes. You can also implement a version using the "naive" full softmax, and see how
the negative sampling increases in efficiency compared to the full version as the vocabulary becomes
larger!
In CBOW, each training example consists of multiple context words and a single target word. There is
no equivalent to the skipgrams preprocessing function, but you can simply iterate over the full text
data in small windows (there is tf.data.Dataset.window which may be helpful here) and for each
window use the center word as the target and the rest as context.
The context embedding is computed by embedding all context words separately, and then averaging
their embeddings.
The rest stays pretty much the same. You will still need to generate negative examples through sampling,
since the full softmax is just as inefficient as with the Skipgram model.
localhost:8888/notebooks/Untitled.ipynb# 2/2
5/15/23, 11:56 PM Untitled
OPTIONAL ASSIGNMENT
General Pipeline
No matter the exact kind of model, we usually do something like this:
Your task
For a dataset of your choice, implement the above pipeline. Try at least three different kinds of
self-supervised models; for each, train the model and then use the features for a classification
task.
Also train a model directly on classification (no pre-training) and compare the performance to
the self-supervised models. Also compare the different self-supervision methods with each
other.
To make these comparisons fair, your models should have the same number of parameters. E.g.
you might want to use the same "encoder" architecture for each task, and add a small
classification head on top; then, the network that you train directly on classification should have
the same architecture as the encoder and the classification head combined.
The remainder of this text discusses some issues to keep in mind when building autoencoders or
similar models.
Autoencoders in Tensorflow
Building autencoders in Tensorflow is pretty simple. You need to define an encoding based on
the input, a decoding based on the encoding, and a loss function that measures the distance
between decoding and input. An obvious choice may be simply the mean squared error (but see
below). To start off, you could try simple MLPs. Note that you are in no way obligated to choose
the "reverse" encoder architecture for your encoder; e.g. you could use a 10-layer MLP as an
encoder and a single layer as a decoder if you wanted. As a start, you should opt for an
"undercomplete" architecture where the encoding is smaller than the data.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 1/3
5/15/23, 11:56 PM Untitled
Note: The activation function of the last decoder layer is very important, as it needs to be able
to map the input data range. Having data in the range [0, 1] allows you to use a sigmoid output
activation, for example. Experiment with different activations such as sigmoid, relu or linear (i.e.
no) activation and see how it affects the model. Your loss function should also "fit" the output
function, e.g. a sigmoid output layer goes well with a binary (!) cross-entropy loss.
Note that you can use the Keras model APIs to build the encoder and decoder as different
models, which makes it easy to later use the encoder separately. You can also have sub-
models/layers participate in different models at the same time, e.g. an encoder model can be
part of an autoencoder model together with a decoder , and of a classification model
together with a classifer_head .
Convolutional Autoencoders
Next, you should switch to a convolutional encoder/decoder to make use of the fact that we are
working with image data. The encoding should simply be one or more convolutional layers, with
any filter size and number of filters (you can optionally apply fully-connected layers at the end).
As an "inverse" of a Conv2D , Conv2DTranspose is commonly used. However, you could also
use UpSampling2D along with regular convolutions. Again, there is no requirement to make
the parameters of encoder and decoder "fit", e.g. you don't need to use the same filter sizes.
However, you need to take care when choosing padding/strides such that the output has the
same dimensions as the input. This can be a problem with MNIST (why?). It also means that the
last convolutional (transpose) layer should have as many filters as the input space (e.g. one filter
for MNIST or three for CIFAR).
Other models
Even other self-supervised models are often similar to autoencoders. For example, in a denoising
autoencoder, the input is a noisy version of the target (so input and target are not the same
anymore!), and the loss is computed between the output and this "clean" target. The
architecture can remain the same, however.
Similarly, if the input has parts of the image removed and the task is to reconstruct those parts,
the target is once again the full image, but an autoencoding architecture would in principle be
appropriate once again.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 2/3
5/15/23, 11:56 PM Untitled
Allowing the encoder to be "fine-tuned" allows it to learn features that are suited for
classification, in case the self-supervised features are not optimal.
However, this might cause the encoder to overfit on the training set. Training only the
classification head would keep the encoder features more general.
The third option is a compromise between both.
Experiment! You can easily "freeze" layers or whole models by setting their trainable
argument to False .
As before, compare to a model that is trained directly on the classification task, but only on the
labeled subset. If everything works as expected, your self-supervised model should significantly
outperform the directly trained one (on the test set)! This is because the direct training massively
overfits on the small dataset, whereas the self-supervised model was able to learn features on all
available data. You will most likely want to freeze the encoder model, i.e. not fine-tune it -- if you
did, the self-supervised model would overfit, as well.
localhost:8888/nbconvert/html/Untitled.ipynb?download=false 3/3