0% found this document useful (0 votes)
11 views19 pages

Jana Sir - Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

Jana Sir - Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Assignment

The data analysis conducted for data set downloaded from R and for 5716 data items. This is
a credit card data downloaded from Kaggle. The data is used for credit card issuance
process .The values considered are age, gender( male, female and binary) , education
level( High school , bachelors, Masters and PHD) ,marital status ( Divorced ,
Married ,Single , Widowed ), income, credit score ,asset value , loan amount , loan purpose
(Auto , Business, Home ,Personal) ,employment status (Employed, Unemployed, Self-
employed) , years of current job, payment history(Excellent , Fair , good and poor) , debit to
income ratio , Number of dependents . Marital status is converted to scale of 0-4 and Payment
history is converted to 5 scale and risk rating as high, medium and low.

I had created bar diagram of population based on age , income , credit score , Asset value,
Loan amount
1.
I had tried to identified the relationship between in excel sheet – data analytics function

1.Income VS Risk

Regression Statistics
Multiple R 0.014211354
R Square 0.000201963
Adjusted R
Square 2.69892E-05
Standard
Error 0.668623044
Observations 5716
ANOVA
Significan
df SS MS F ce F
0.5
16
01 0.5160 1.1542
Regression 1 4 14 47 0.282707
25
54.
48 0.4470
Residual 5714 2 57
25
54.
99
Total 5715 8

Sta
nd
ar
d
Er P- Lower Upper Lower Upper
Coefficients ror t Stat value 95% 95% 95.0% 95.0%
0.0
23 64.031 1.52180 1.521801
Intercept 1.476594362 06 7 0 1.431387 1 1.431387 424
3.0
3E 1.0743 0.2827 9.21E- 9.208E-
Income 3.25982E-07 -07 59 07 -2.7E-07 07 -2.7E-07 07

2. Asset Vs Risk

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.005137
R Square 2.64E-05
Adjusted R
Square -0.00015
Standard
Error 0.668682
Observations 5716

ANOVA
Significan
df SS MS F ce F
Regression 1 0.06741 0.06741 0.15076 0.697824
0.44713
Residual 5714 2554.931 5
Total 5715 2554.998

Coefficien Standard Lower Upper Lower Upper


ts Error t Stat P-value 95% 95% 95.0% 95.0%
1.5312 1.45405
Intercept 1.492646 0.019687 75.8197 0 1.454052 4 2 1.53124
Assets_Valu 0.38827 0.69782 2.58E- 2.58E-
e 4.26E-08 1.1E-07 9 4 -1.7E-07 07 -1.7E-07 07

3.Loan amount VS Risk View

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.006852
R Square 4.69E-05
Adjusted R
Square -0.00013
Standard
Error 12971.59
Observations 5716

ANOVA
Significan
df SS MS F ce F
4514000 0.26827
Regression 1 45140003 3 2 0.604513
1.68E+0
Residual 5714 9.61E+11 8
Total 5715 9.61E+11

Coefficien Standard Lower Upper Lower Upper


ts Error t Stat P-value 95% 95% 95.0% 95.0%
72.6731 28502.4 27005.1 28502.4
Intercept 27753.79 381.8989 2 0 27005.12 5 2 5
Assets_Valu - 0.60451 0.00307 - 0.00307
e -0.0011 0.002129 0.51795 3 -0.00528 1 0.00528 1

5. Credit Score vs Risk

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.00805
R Square 6.48E-05
Adjusted R
Square -0.00011
Standard
Error 12971.47
Observations 5716

ANOVA
Significan
df SS MS F ce F
6230605 0.37029
Regression 1 62306055 5 8 0.542866
1.68E+0
Residual 5714 9.61E+11 8
Total 5715 9.61E+11

Coefficien Standard Lower Upper Lower Upper


ts Error t Stat P-value 95% 95% 95.0% 95.0%
13.6347 1.11E- 33010.4 24711.3 33010.4
Intercept 28860.89 2116.712 8 41 24711.34 5 4 5
- 0.54286 4.07947 - 4.07947
Credit_Score -1.83632 3.017683 0.60852 6 -7.75213 8 7.75213 8

6. Years of experience Vs Risk

SUMMARY OUTPUT

Regression Statistics
0.00577
Multiple R 4
3.33E-
R Square 05
Adjusted R -
Square 0.00014
12971.6
Standard Error 8
Observations 5716

ANOVA
Significan
df SS MS F ce F
32051 0.190
Regression 1 32051637 637 484 0.662531
1.68E
Residual 5714 9.61E+11 +08
Total 5715 9.61E+11

Coeffici Standard P- Lower Upper Lower Upper


ents Error t Stat value 95% 95% 95.0% 95.0%
27699.0 84.467 28341.8
Intercept 4 327.9252 54 0 27056.18 9 27056.18 28341.89
-
Years_at_Curre - 0.4364 0.662 45.1518
nt_Job 12.9312 29.62847 4 531 -71.0142 6 -71.0142 45.15186

7. Debit to Income Ratio Vs Risk

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.00903
5
8.16E-
R Square 05
Adjusted R -9.3E-
Square 05
0.14340
Standard Error 1
Observations 5716

ANOVA
Significan
df SS MS F ce F
0.009 0.466
Regression 1 0.009592 592 455 0.49465
0.020
Residual 5714 117.5021 564
Total 5715 117.5117

Coeffici Standard P- Lower Upper Lower Upper


ents Error t Stat value 95% 95% 95.0% 95.0%
0.34943 96.38 0.35653
Intercept 1 0.003625 964 0 0.342324 8 0.342324 0.356538
Years_at_Curre 0.00022 0.682 0.494 0.00086
nt_Job 4 0.000328 975 65 -0.00042 6 -0.00042 0.000866

Hypothesis

Ho: The asset value is the same across all categories of education level of customers
specifically: Bachelors Degree, High School, Masters and PhD.

Ha: The asset value is different from each other, at least in one pair of educational groups of
customers specifically: Bachelors Degree, High School, Masters and PhD.

Inference:
p value >0.05
Therefore, fail to reject the null hypothesis.
The asset value is the same across all categories of education level of customers specifically:
Bachelors Degree, High School, Masters and PhD.

Anova Output Visualisation

Chi-Square Test:
Ho: There is no significant association between Education Level and Employment Status
Ha: There is significant association between Education Level and Employment Status

Inference:
p value >0.05
Therefore, fail to reject the null hypothesis.
There is no significant association between Education Level and Employment Status.

Project -2
I had downloaded the data from Kaggle on cyber security attack and tried to identify the
relationship between various factors in the table. The relation ship between Anomaly Score to
protocol ( tcp & other than tcp),packet length, packet type (data-0, control-1),traffic
type(Http-1,other than Http-0) ,IoC( Detected -0, not detected-1).

The summary of data from R as follows:-

Descriptive statics as follows

SUMMARY
OUTPUT

Regression Statistics
0.0098
Multiple R 42968
9.6884
R Square E-05
-
Adjusted R 2.8122
Square 6E-05
28.854
Standard Error 00397
Observations 40000
ANOVA
Signific
df SS MS F ance F
3226.27 645.25 0.775 0.5675
Regression 5 4047 4809 031 333
332971 832.55
Residual 39994 46.48 3545
333003
Total 39999 72.75

Coeffic Standar P- Lower Upper Lower Upper


ients d Error t Stat value 95% 95% 95.0% 95.0%
49.941 0.39656 125.93 49.164 50.71 49.164 50.718
Intercept 64525 6809 501 0 3651 89254 3651 9254
- -
0.1720 0.30641 0.5614 0.574 0.4285 0.772 0.4285 0.7726
Protocol 4673 4369 8388 47086 3256 62602 3256 2602
- - - -
0.0002 0.00034 0.7246 0.468 0.0009 0.000 0.0009 0.0004
Packet Length 5129 6782 3518 68011 3099 42841 3099 2841
- -
0.3473 0.28857 1.2036 0.228 0.2182 0.912 0.2182 0.9129
Packet Type 54315 32 9568 71435 5586 96449 5586 6449
- -
0.3867 0.30589 1.2644 0.206 0.2127 0.986 0.2127 0.9863
Traffic Type 72847 3623 0311 09275 8577 33146 8577 3146
- -
IoC detected/ 0.0123 0.28855 0.0429 0.965 0.5531 0.577 0.5531 0.5779
not detected 94188 515 5258 73955 8061 96899 8061 6899

1. Relation ship between protocol and Anomaly Score

SUMMARY
OUTPUT

Regression Statistics
0.0027
Multiple R 99008
7.8344
R Square 4E-06
-
Adjusted R 1.7166
Square 6E-05
Standard 28.853
Error 84591
Observatio
ns 40000

ANOVA
Signific
df SS MS F ance F
260.889 260.88 0.3133 0.5756
Regression 1 9111 9911 6455 26
333001 832.54
Residual 39998 11.86 4424
333003
Total 39999 72.75

Coeffici Standar P- Lower Upper Lower Upper


ents d Error t Stat value 95% 95% 95.0% 95.0%
50.056 0.17649 283.62 49.710 50.402 49.710 50.402
Intercept 5639 0152 242 0 6391 4887 6391 4887
- -
0.1715 0.30639 0.5597 0.5756 0.4290 0.7720 0.4290 0.7720
Protocal 17019 5432 8974 26 2515 5919 2515 5919

2.Relation ship between packet length and Anomaly Score

SUMMARY
OUTPUT
Regression Statistics
0.00359
Multiple R 8602
1.29499
R Square E-05
-
Adjusted R 1.2051E
Square -05
Standard 28.8537
Error 7211
Observatio
ns 40000

ANOVA
Signific
df SS MS F ance F
431.237 431.23 0.5179 0.4717
Regression 1 7362 7736 783 0956
332999 832.54
Residual 39998 41.51 0165
333003
Total 39999 72.75

Coeffici Standar P- Lower Upper Lower Upper


ents d Error t Stat value 95% 95% 95.0% 95.0%
50.3085 0.30699 163.87 49.706 50.910 49.706 50.910
Intercept 0129 3534 4791 0 7868 2158 7868 2158
- - - -
Packet 0.00024 0.00034 0.7197 0.4717 0.0009 0.0004 0.0009 0.0004
Length 9571 6768 0709 0956 2924 301 2924 301

3. Relation ship between packet type and Anomaly Score

SUMMARY
OUTPUT

Regression Statistics
0.0059
Multiple R 84414
3.5813
R Square 2E-05
Adjusted R 1.0812
Square 9E-05
Standard 28.853
Error 44226
Observatio
ns 40000

ANOVA
Signific
df SS MS F ance F
1192.59 1192.5 1.4325 0.2313
Regression 1 3229 9323 0806 6265
332991 832.52
Residual 39998 80.16 113
333003
Total 39999 72.75

Coeffici Standar P- Lower Upper Lower Upper


ents d Error t Stat value 95% 95% 95.0% 95.0%
49.938 0.20524 243.31 49.536 50.341 49.536 50.341
Intercept 74513 4344 3624 0 4614 0288 4614 0288
- -
Packet 0.3453 0.28855 1.1968 0.2313 0.2202 0.9109 0.2202 0.9109
Type 63681 4683 7429 6265 102 3757 102 3757

3. Relationship between traffic type and Anomaly Score

SUMMARY
OUTPUT

Regression Statistics
Multiple R 0.0063
14619
3.9874
R Square 4E-05
Adjusted R 1.4874
Square 2E-05
Standard 28.853
Error 38366
Observatio
ns 40000

ANOVA
Signific
df SS MS F ance F
1327.83 1327.8 1.5949 0.2066
Regression 1 2881 3288 6045 2622
332990 832.51
Residual 39998 44.92 7749
333003
Total 39999 72.75

Coeffici Standar P- Lower Upper Lower Upper


ents d Error t Stat value 95% 95% 95.0% 95.0%
49.984 0.17677 282.75 49.637 50.330 49.637 50.330
Intercept 44707 858 1718 0 9569 9372 9569 9372
- -
0.3863 0.30588 1.2629 0.2066 0.2132 0.9858 0.2132 0.9858
Traffic Type 05922 3751 1743 2622 3334 4518 3334 4518
4. Relationship between IoC detection and Anomaly Score

SUMMARY
OUTPUT

Regression Statistics
0.0002
Multiple R 04821
4.1951
R Square 6E-08
-
Adjusted R 2.4959
Square 3E-05
28.853
Standard Error 95833
Observations 40000

ANOVA
Signific
df SS MS F ance F
1.39700 1.397 0.001 0.9673
Regression 1 5487 00549 67798 2545
333003 832.5
Residual 39998 71.35 50911
333003
Total 39999 72.75

Coeffic Standar P- Lower Upper Lower Upper


ients d Error t Stat value 95% 95% 95.0% 95.0%
50.107 0.20402 245.5 49.707 50.50 49.707 50.507
Intercept 5635 8296 91246 0 6633 74637 6633 4637
- -
IoC detected/ 0.0118 0.28853 0.040 0.967 0.5537 0.577 0.5537 0.5773
not detected 195 9583 96318 32545 2479 36379 2479 6379
Other plots attempted

Linear Relationship between Age and Credit Score

Inference: There is no linear relationship between Age and Credit Score

Linear Relationship between Age and Income


Inference: There is no linear relationship between Age and Income

Linear Relationship between Income and Credit Score

Inference: There is no linear relationship between Income and Credit Score

Linear Relationship between Income and Asset


Inference: There is no linear relationship between Income and Asset value

Linear Relationship between Income and Loan amount

Inference: There is no linear relationship between Income and Loan Amount

Linear Relationship between Debt-to-Income Ratio and Loan amount


Inference: There is no linear relationship between Debt-to-Income Ratio and Loan Amount

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy