Prep - SIA Assignment #1 - Jupyter Notebook
Prep - SIA Assignment #1 - Jupyter Notebook
The company has carried out a small study among a random sample of approximately 1000 students who took the College Admissions Exam. They have data on
students’ scores on both math and verbal sections of the College Admissions Exam, their high school GPA, their freshman GPA, and their gender.
You will analyze the data set and prepare a report by completing the tasks and answering the questions that follow.
Task 1.
For this assignment you will select a random sample of 100 students from the 1000 students in the original data set and analyze the data for those 100 students. To
select your random sample and save your data set on your computer follow these instructions:
1. Go to the 4th line of the code: df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv')
2. Change Path where you want to store the exported CSV file to where you want to store your data.
3. Change File Name to first name.
4. Run the code.
Use this data set to complete your assignment. Also include this data set in your assignment submission!
In [7]:
# To save the data set and take 100 random values from this data set fo 1000 values. This is the data set that you will use for
#Task 1
import pandas
original_data = pandas.read_csv("https://raw.githubusercontent.com/ZUCourses/SIA-Public/main/Data%20Sets/CAEGPA.csv")
df=original_data.sample(n=100)
df.to_csv("Desktop/assignment1_yourname.csv")
#df. to csv("Downloads/mydata_")
#df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv')
print (df)
Task 2.
Start your report with a brief introduction where you introduce the study, tell us something about the sample you are analyzing, and introduce the variables. Be brief but
clear here so that your readers will be familiar with what you will be reporting. After this brief introduction, begin a report on your analyses.
Quantitative Variables:
Categorical Variables:
Start a Report here of your analysis. After you do the analysis, complete the report here.
Task 3.
Create a histogram and generate descriptive statistics for each of the quantitative variables in the data set and describe their distributions in terms of shape, center,
spread, and presence of outliers.
In [ ]:
#Sample code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv("Desktop/assignment1_yourname.csv") #enter the path and name of your csv file
#plot the histogram
plt.hist(df['CAE_v'],bins = 15) #replace XX with the number of bins
plt.title("CAE_v")
#produce descriptive statistics
print ("Descriptive Statistics for CAE_v")
df["CAE_v"].describe()
In [ ]:
# Task 3 - part 1
#Write your code for Task 3 here
# Write Code to Create a Histogram for
#CAE_v: Score on the verbal section of the College Admissions Exam here
# CODE:
#Task 3 - part 2
#Describe the above distribution in terms of shape, center, spread, and presence of outliers.
Answer:
In [ ]:
#Task 3 - part 3
# Write Code to Create a Histogram for
#CAE_m: Score on the math section of the College Admissions Exam here
#CODE:
#Task 3 - part 4
#Describe the above distribution in terms of shape, center, spread, and presence of outliers.
Answer:
In [ ]:
#Task 3 - part 5
# Write Code to Create a Histogram for
#CAE_sum: Total of the scores on the math and verbal section of the College Admissions Exam here.
#CODE:
#Task 3 - part 6
#Describe the above distribution in terms of shape, center, spread, and presence of outliers.
Answer:
In [ ]:
# Task 3 - part 7
# Write Code to Create a Histogram for
#hs_gpa: High school grade point average here.
#CODE:
#Task 3 - part 8
#Describe the above distribution in terms of shape, center, spread, and presence of outliers.
Answer:
In [ ]:
# Task 3 - part 9
# Write Code to Create a Histogram for
# fy_gpa: College freshman grade point average here.
#CODE:
#Task 3 - part 10
#Describe the above distribution in terms of shape, center, spread, and presence of outliers.
Answer:
Task 4.
localhost:8888/notebooks/SIA Assignment %231.ipynb 2/10
2/20/23, 10:33 PM SIA Assignment #1 - Jupyter Notebook
a. Generate a grouped box plot to compare the distribution of high-school GPA between male and female students. Describe your observations referring to the five-
number-summaries of both genders.
b. Generate a grouped box plot to compare the distribution of college freshman GPA between male and female students. Describe your observations referring to the
five-number-summaries of both genders.
c. Generate a grouped box plot to compare the distribution of CAE_v between male and female students. Describe your observations referring to the five-number-
summaries of both genders.
d. Generate a grouped box plot to compare the distribution of CAE_m between male and female students. Describe your observations referring to the five-number-
summaries of both genders.
e. Discuss any patterns you observe between male and female students’ achievement when you consider their performances in high school, on the College Entrance
Exams, and in their freshman year.
In [ ]:
#Sample code:
import pandas
import matplotlib.pyplot as plt
from numpy import percentile
df = pandas.read_csv("Desktop/assignment1_yourname.csv") #enter the path and name of your csv file
male=df[df["sex"]==1]
female=df[df["sex"]==2]
Task 4 a
In [ ]:
#Task 4a_Part1
#Task 4a. part 1: Write a code to generate a grouped box plot to compare the distribution #
#of high-school GPA between male and female students.
#hs_gpa: High school grade point average.
#CODE:
#Task 4a_Part2:
#Describe your observations referring to the five-number-summaries of both genders.
#to compare high-school GPA between male and female students.
Answer:
Task 4 b
In [ ]:
# Task 4b_Part1:
#4b Part1: Generate a grouped box plot to compare the distribution of
#college freshman GPA between male and female students.
#fy_gpa: College freshman grade point average.
#Task 4b_Part2: Describe your observations referring to the five-number-summaries of both genders.
#to compare college freshman GPA between male and female students.
Answer:
Task 4 c
In [ ]:
# Task 4c_Part1
#Write your code for Task 4, c.
# Generate a grouped box plot to compare the distribution of
# CAE_v between male and female students
#CAE_v: Score on the verbal section of the College Admissions Exam
#CODE:
#Task 4c_Part2
#Describe your observations referring to the five-number-summaries of both genders,
#to compare CAE_v: Score on the verbal section of the College Admissions Exam between male and female students
Answer:
Task 4 d
In [ ]:
#Task 4d_Part1
# Write your code for Task 4d.
# Generate a grouped box plot to compare the distribution of CAE_m between male and female students.
#CAE_m: Score on the math section of the College Admissions Exam
#CODE:
#Task 4d_Part2: Describe your observations referring to the five-number-summaries of both genders.
#to compare the distribution of CAE_m between male and female students.
#CAE_m: Score on the math section of the College Admissions Exam
Answer:
Task 4 e
Task 4e
e. Discuss any patterns you observe between male and female students’ achievement when you consider their performances in
high school, on the College Entrance Exams, and in their freshman year.
Answer:
Task 5
Task 5. a. Create separate scatterplots to examine the relationship between CAE_v (dependent variable) and high school GPA and college freshman GPA (independent
variables). Describe the scatterplots in terms of the form, strength, and direction of the relationships.
Further examine if the relationships between the dependent variable and each independent variables vary by gender (you will need to create scatterplots separately for
each gender to answer this question.)
b. Create separate scatterplots to examine the relationship between CAE_m (dependent variable) and high school GPA and college freshman GPA (independent
variables). Describe the scatterplots in terms of the form, strength, and direction of the relationships.
Further examine if the relationships between the dependent variable and each independent variables vary by gender (you will need to create scatterplots seaparately for
each gender to answer this question.)
c. (Optional) Create separate scatterplots to examine the relationship between CAE_sum (dependent variable) and high school GPA and college freshman GPA
(independent variables). Describe the scatterplots in terms of the form, strength, and direction of the relationships.
Further examine if the relationships between the dependent variable and each independent variables vary by gender (you will need to create scatterplots seaparately for
each gender to answer this question.)
In [ ]:
#Sample code:
import pandas
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels
df = pandas.read_csv(r"Desktop/assignment1_yourname.csv.csv")
#enter the path and name of your csv file
Task 5a
In [ ]:
#5a. Part 1
#Write your code for Task 5 a.
#a. part 1 : Write code to create scatterplots to examine the relationship between
#CAE_v (dependent variable) and high school GPA
#CODE:
#5a. part 2
#Describe the scatterplots in terms of the form, strength, and direction of the relationships.
#Scatterplot between CAE_v (dependent variable) and high school GPA(independent variables)
#CAE_v: Score on the verbal section of the College Admissions Exam
Answer:
In [ ]:
#5a. Part 3
#Write your code for Task 5 a.
#a. part 3 : Write code to create scatterplots to examine the relationship between
#CAE_v (dependent variable) and college freshman GPA (independent variables)
#CODE:
#5a. part 4
#Describe the scatterplots in terms of the form, strength, and direction of the relationships.
#CAE_v (dependent variable) and college freshman GPA (independent variables)
#CAE_v: Score on the verbal section of the College Admissions Exam
Answer:
In [ ]:
#5a. part 5
#Further examine if the relationships between the dependent variable and each independent variables vary by gender
#(you will need to create scatterplots separately for each gender to answer this question.)
#5a part5: Write Code to
# Create Separate scatter plot to examine the relationship between
##CAE_v male(dependent variable) and high school GPA male (independent variables)
#CODE:
In [ ]:
#Further examine if the relationships between the dependent variable and each independent variables vary by gender
#(you will need to create scatterplots separately for each gender to answer this question.)
#5a part 6: . Write Code to
# Create Separate scatter plot to examine the relationship between
##CAE_v female(dependent variable) and high school GPA female (independent variables)
#CODE:
#Describe
#5a part 7: From the above two Scatter Plots, examine and describe if the relationship between
CAE_v (dependent variable) and high school GPA (independent variables) varies by gender male/female.
Answer:
In [ ]:
#Further examine if the relationships between the dependent variable and each independent variables vary by gender
#(you will need to create scatterplots separately for each gender to answer this question.)
#5a part 8: Write Code
# Create Separate scatter plot to examine the relationship between
##CAE_v male(dependent variable) and college freshman GPA male(independent variables)
#CODE:
In [ ]:
#Further examine if the relationships between the dependent variable and each independent variables vary by gender
#(you will need to create scatterplots separately for each gender to answer this question.)
#5a part 9: Write Code
# Create Separate scatter plot to examine the relationship between
##CAE_v female(dependent variable) and college freshman GPA female(independent variables)
#CODE:
#Describe
#5a part 10: From the above two Scatter Plots, examine and describe if the relationship between
CAE_v (dependent variable) and college freshman GPA (independent variables) varies by gender.
Answer:
Task 5b
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
#5b. Part 8
#Further examine if the relationships between the
#dependent variable and each independent variables vary by gender
#5b. Part 8: Write code to create a scatter plot
#between CAE_m male (dependent variable) Vs college freshman GPA male (independent variable)
In [ ]:
#5b. Part 9
#Further examine if the relationships between the
#dependent variable and each independent variables vary by gender
#5b. Part 8: Write code to create a scatter plot
#between CAE_m female (dependent variable) Vs college freshman GPA female(independent variable )
#5b. Part 10
#5b. Part 7: From the scatter plots above examine and describe if the relationship
#between CAE_m (dependent variable andcollege freshman GPA varies by gender.)
Task 6
Task 6. a. Fit a simple linear regression model that predicts “CAE_v” using high-school GPA and freshman college GPA separately. Generate and use the residual plot,
the standard error, and the R^2 to assess the fit of each linear model. If the model is a good fit, interpret the slope and the intercept.
Additionally, if you found that the relationship between CAE_v and the independent variables varied by gender in Task 5, then run each regression model for each gender
separately and interpret your findings accordingly.
b. Fit a simple linear regression model that predicts “CAE_m” using high-school GPA and freshman college GPA separately. Generate and use the residual plot, the
standard error, and the R^2 to assess the fit of each linear model. If the model is a good fit, interpret the slope and the intercept.
Additionally, if you found that the relationship between CAE_m and the independent variables varied by gender in Task 5, then run each regression model for each
gender separately and interpret your findings accordingly.
c. (Optional) Fit a simple linear regression model that predicts “CAE_sum” using high-school GPA (hs_gpa) and freshman college GPA (fy_gpa) as independent variables
separately. Generate and use the residual plot, the standard error, and the R^2 to assess the fit of each linear model. If the model is a good fit, interpret the slope and the
intercept.
Task 6a
In [ ]:
#Sample code:
import pandas
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels
import seaborn as sns
df = pandas.read_csv(r"Desktop/assignment1_yourname.csv")
def regression_equation(column_x, column_y):
# fit the regression line using "statsmodels" library:
X= df[column_x]
X = statsmodels.add_constant(X)
Y = df[column_y]
regressionmodel = statsmodels.OLS(Y,X).fit() #OLS stands for "ordinary least squares"
print('R2: ', round(regressionmodel.rsquared, 3))
SE=np.sqrt(regressionmodel.mse_resid)
print ('SE=', round(SE, 3))
# extract regression parameters from model, rounded to 2 decimal places and print the regression equation:
slope = round(regressionmodel.params[1],3)
intercept = round(regressionmodel.params[0],3)
print("Regression equation: "+column_y+" = ",slope,"* "+column_x+" + ",intercept)
In [ ]:
In [ ]:
#Task6a_Part1
#Fit a simple linear regression model that
#predicts “CAE_v” using high-school GPA
#CAE_v(dependent variable) and high-school GPA (independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#CODE:
Task6a_Part2
#Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 value
#CAE_v(dependent variable) and high-school GPA (independent)
#Interpret the slope and the intercept of the above Regression line, if the model is a good fit.
Answer:
In [ ]:
#Task6a_Part3
#Fit a simple linear regression model that
#predicts “CAE_v” using freshman college GPA
#CAE_v(dependent variable) and freshman college GPA (independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#CODE:
Task6a_Part4
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 value
#CAE_v(dependent variable) and freshman college GPA (independent)
#interpret the slope and the intercept of the above Regression line.
Answer:
In [ ]:
#Task6a_Part5
Check if the relationship between CAE_v and high school GPA in Task 5a, part 10
varied by gender.
#Task6a_Part6
Write a new code
#To fit a simple linear regression model that
#predicts “CAE_v” male using high-school GPA male
#CAE_v male(dependent variable) and high-school GPA male(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6a_Part7
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_v” female using high-school GPA female
#CAE_v female(dependent variable) and high-school GPA female(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6a_Part8
Interpret your findings for for regression line in Task6b_ Part 6 and Part 7:
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 valu
#interpret the slope and the intercept of the above Regression line.
#Task6a_Part9
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_v” male using freshman college GPA male
#CAE_v male(dependent variable) and freshman college GPA male(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6a_Part10
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_v” female using freshman college GPA female
#CAE_v female(dependent variable) and freshman college GPA female(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
Interpret your findings for regression line in Task6b_ Part 9 and Part 10:
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 valu
#interpret the slope and the intercept of the above Regression line.
Task6b.
#Fit a simple linear regression model that predicts #“CAE_m” using high-school GPA and freshman college GPA separately. #Generate and use the residual plot, the
standard error, and the R^2 #to assess the fit of each linear model. #If the model is a good fit, interpret the slope and the intercept. Additionally, if you found that the
relationship between CAE_m and the independent variables varied by gender in Task 5, then run each regression model for each gender separately and interpret your
findings accordingly.
In [ ]:
#Task6b_Part1
#Fit a simple linear regression model that
#predicts “CAE_m” using high-school GPA
#CAE_m(dependent variable) and high-school GPA (independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#CODE:
Task6b_Part2
#Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 value
#CAE_m(dependent variable) and high-school GPA (independent)
#Interpret the slope and the intercept of the above Regression line, if the model is a good fit.
Answer:
In [ ]:
#Task6b_Part3
#Fit a simple linear regression model that
#predicts “CAE_m” using freshman college GPA
#CAE_m(dependent variable) and freshman college GPA (independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#CODE:
Task6b_Part4
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 value
#CAE_m(dependent variable) and freshman college GPA (independent)
#interpret the slope and the intercept of the above Regression line.
Answer:
#Task6b_Part5
Check if the relationship between CAE_m and high school GPA in Task 5b, part 10
varied by gender.
#Task6b_Part6
Write a new code
#To fit a simple linear regression model that
#predicts “CAE_m” male using high-school GPA male
#CAE_m male(dependent variable) and high-school GPA male(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6b_Part7
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_m” female using high-school GPA female
#CAE_m female(dependent variable) and high-school GPA female(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6b_Part8
Interpret your findings for for regression line in Task6b_ Part 6 and Part 7:
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 valu
#interpret the slope and the intercept of the above Regression line.
#Task6b_Part9
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_m” male using freshman college GPA male
#CAE_m male(dependent variable) and freshman college GPA male(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
#Task6b_Part10
Write a new code for
#To fit a simple linear regression model that
#predicts “CAE_m” female using freshman college GPA female
#CAE_m female(dependent variable) and freshman college GPA female(independent)
#Generate and use the residual plot, the standard error, and the R^2
#to assess the fit of each linear model.
Interpret your findings for regression line in Task6b_ Part 9 and Part 10:
Describe if the model is a good fit,
#Assess the fit of each linear model using the residual plot, the standard error and the R^2 valu