HANDS-ON ANALYSIS 1
Churn Data Set
Using the churn data set, develop EDA which shows that the remaining numeric variables in the data set (apart from those covered in the text above) indicate no obvious association with the target variable.
Figure1
AccountLength
Figure2
DayCalls
Figure3
EveningCalls
Figure4
NightCalls
Figure5
InternationalCalls
Figure6
VoicemailMessages
From Figure 1 to 6, the normal curves on thehistogram diagrams indicate no obvious relationship amongst thecontinuous variables and the target variable, churn.
Adult Data Set
Which variables are categorical and which are continuous?
The categorical variables are workclass, education, maritalstatus,occupation, relationship, race, sex, nativecountry, income. Thecontinuous variables are age, demogweight, educationnum,capitalgain, capitalloss, hoursperweek.
Using software, construct a table of the first 10 records of the data set, in order to get a feel for the data. 
Table 1 First Ten Entries Statistics |
|||||||
age |
demogweight |
education-num |
capital-gain |
capital-loss |
hours-per-week |
||
N |
Valid |
10 |
10 |
10 |
10 |
10 |
10 |
Missing |
0 |
0 |
0 |
0 |
0 |
0 |
|
Mean |
41.90 |
180924.40 |
11.00 |
2143.60 |
.00 |
36.40 |
|
Median |
40.50 |
184914.50 |
13.00 |
.00 |
.00 |
40.00 |
|
Std. Deviation |
8.825 |
94190.885 |
3.232 |
4520.900 |
.000 |
12.020 |
|
Minimum |
28 |
45781 |
5 |
0 |
0 |
13 |
|
Maximum |
53 |
338409 |
14 |
14084 |
0 |
50 |
Investigate whether we have any correlated variables.
Figure7
ScatterPlot
From Figure 7, it is evident that there aren’tany mutually related variables.
For each of the categorical variables, construct a bar chart of the variable, with an overlay of the target variable. Normalize if necessary. Discuss the relationship, if any, each of these variables has with the target variables. Which variables would you expect to make a significant appearance in any data mining classification model we work with?
Figure8
Workclass
Figure9
Education
Figure10
Maritalstatus
Figure11
Occupation
Figure12
Relationship
Figure13
Race
Figure14
Sex
Figure15
Nativecountry
From Figure 8 to 15 we can deduce that those under private workclasscategory have the largest share in both income groups. Those atHS-grad education earn the least. Those who’ve never been marriedearn the least while those married-civ-spouse earn the most.Prof-speciality and exec-managerial occupations earn the most.Husbands have a majority share in both income groups. White race hasthe largest share in both income groups. Also, men have the largestproportions in both income groups.
For each pair of categorical variables, construct a crosstabulation. Discuss your salient results.
Table2
Workclass Crosstab |
|||||||||
income |
Total |
||||||||
<=50K. |
>50K. |
||||||||
workclass |
? |
Count |
1251 |
148 |
1399 |
||||
% within income |
6.6% |
2.5% |
5.6% |
||||||
Federal-gov |
Count |
467 |
283 |
750 |
|||||
% within income |
2.5% |
4.7% |
3.0% |
||||||
Local-gov |
Count |
1156 |
468 |
1624 |
|||||
% within income |
6.1% |
7.8% |
6.5% |
||||||
Never-worked |
Count |
5 |
0 |
5 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Private |
Count |
13624 |
3761 |
17385 |
|||||
% within income |
71.6% |
62.9% |
69.5% |
||||||
Self-emp-inc |
Count |
379 |
478 |
857 |
|||||
% within income |
2.0% |
8.0% |
3.4% |
||||||
Self-emp-not-inc |
Count |
1411 |
567 |
1978 |
|||||
% within income |
7.4% |
9.5% |
7.9% |
||||||
State-gov |
Count |
714 |
279 |
993 |
|||||
% within income |
3.8% |
4.7% |
4.0% |
||||||
Without-pay |
Count |
9 |
0 |
9 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Total |
Count |
19016 |
5984 |
25000 |
|||||
% within income |
100.0% |
100.0% |
100.0% |
Table3
Education Crosstab |
||||||||||
income |
Total |
|||||||||
<=50K. |
>50K. |
|||||||||
education |
10th |
Count |
666 |
55 |
721 |
|||||
% within income |
3.5% |
0.9% |
2.9% |
|||||||
11th |
Count |
858 |
51 |
909 |
||||||
% within income |
4.5% |
0.9% |
3.6% |
|||||||
12th |
Count |
299 |
24 |
323 |
||||||
% within income |
1.6% |
0.4% |
1.3% |
|||||||
1st-4th |
Count |
115 |
5 |
120 |
||||||
% within income |
0.6% |
0.1% |
0.5% |
|||||||
5th-6th |
Count |
233 |
11 |
244 |
||||||
% within income |
1.2% |
0.2% |
1.0% |
|||||||
7th-8th |
Count |
460 |
31 |
491 |
||||||
% within income |
2.4% |
0.5% |
2.0% |
|||||||
9th |
Count |
374 |
20 |
394 |
||||||
% within income |
2.0% |
0.3% |
1.6% |
|||||||
Assoc-acdm |
Count |
599 |
202 |
801 |
||||||
% within income |
3.1% |
3.4% |
3.2% |
|||||||
Assoc-voc |
Count |
786 |
273 |
1059 |
||||||
% within income |
4.1% |
4.6% |
4.2% |
|||||||
Bachelors |
Count |
2428 |
1712 |
4140 |
||||||
% within income |
12.8% |
28.6% |
16.6% |
|||||||
Doctorate |
Count |
84 |
231 |
315 |
||||||
% within income |
0.4% |
3.9% |
1.3% |
|||||||
HS-grad |
Count |
6826 |
1294 |
8120 |
||||||
% within income |
35.9% |
21.6% |
32.5% |
|||||||
Masters |
Count |
569 |
731 |
1300 |
||||||
% within income |
3.0% |
12.2% |
5.2% |
|||||||
Preschool |
Count |
36 |
0 |
36 |
||||||
% within income |
0.2% |
0.0% |
0.1% |
|||||||
Prof-school |
Count |
114 |
316 |
430 |
||||||
% within income |
0.6% |
5.3% |
1.7% |
|||||||
Some-college |
Count |
4569 |
1028 |
5597 |
||||||
% within income |
24.0% |
17.2% |
22.4% |
|||||||
Total |
Count |
19016 |
5984 |
25000 |
||||||
% within income |
100.0% |
100.0% |
100.0% |
Table 4 Marital-Status Crosstab |
|||||||||
income |
Total |
||||||||
<=50K. |
>50K. |
||||||||
marital-status |
Divorced |
Count |
3085 |
350 |
3435 |
||||
% within income |
16.2% |
5.8% |
13.7% |
||||||
Married-AF-spouse |
Count |
9 |
7 |
16 |
|||||
% within income |
0.0% |
0.1% |
0.1% |
||||||
Married-civ-spouse |
Count |
6336 |
5105 |
11441 |
|||||
% within income |
33.3% |
85.3% |
45.8% |
||||||
Married-spouse-absent |
Count |
301 |
27 |
328 |
|||||
% within income |
1.6% |
0.5% |
1.3% |
||||||
Never-married |
Count |
7840 |
385 |
8225 |
|||||
% within income |
41.2% |
6.4% |
32.9% |
||||||
Separated |
Count |
736 |
50 |
786 |
|||||
% within income |
3.9% |
0.8% |
3.1% |
||||||
Widowed |
Count |
709 |
60 |
769 |
|||||
% within income |
3.7% |
1.0% |
3.1% |
||||||
Total |
Count |
19016 |
5984 |
25000 |
|||||
% within income |
100.0% |
100.0% |
100.0% |
Table5
Occupation Crosstab |
|||||||||
income |
Total |
||||||||
<=50K. |
>50K. |
||||||||
occupation |
? |
Count |
1256 |
148 |
1404 |
||||
% within income |
6.6% |
2.5% |
5.6% |
||||||
Adm-clerical |
Count |
2582 |
393 |
2975 |
|||||
% within income |
13.6% |
6.6% |
11.9% |
||||||
Armed-Forces |
Count |
7 |
0 |
7 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Craft-repair |
Count |
2419 |
703 |
3122 |
|||||
% within income |
12.7% |
11.7% |
12.5% |
||||||
Exec-managerial |
Count |
1596 |
1488 |
3084 |
|||||
% within income |
8.4% |
24.9% |
12.3% |
||||||
Farming-fishing |
Count |
677 |
90 |
767 |
|||||
% within income |
3.6% |
1.5% |
3.1% |
||||||
Handlers-cleaners |
Count |
936 |
64 |
1000 |
|||||
% within income |
4.9% |
1.1% |
4.0% |
||||||
Machine-op-inspct |
Count |
1348 |
188 |
1536 |
|||||
% within income |
7.1% |
3.1% |
6.1% |
||||||
Other-service |
Count |
2443 |
112 |
2555 |
|||||
% within income |
12.8% |
1.9% |
10.2% |
||||||
Priv-house-serv |
Count |
121 |
0 |
121 |
|||||
% within income |
0.6% |
0.0% |
0.5% |
||||||
Prof-specialty |
Count |
1758 |
1422 |
3180 |
|||||
% within income |
9.2% |
23.8% |
12.7% |
||||||
Protective-serv |
Count |
343 |
160 |
503 |
|||||
% within income |
1.8% |
2.7% |
2.0% |
||||||
Sales |
Count |
2064 |
751 |
2815 |
|||||
% within income |
10.9% |
12.6% |
11.3% |
||||||
Tech-support |
Count |
488 |
215 |
703 |
|||||
% within income |
2.6% |
3.6% |
2.8% |
||||||
Transport-moving |
Count |
978 |
250 |
1228 |
|||||
% within income |
5.1% |
4.2% |
4.9% |
||||||
Total |
Count |
19016 |
5984 |
25000 |
|||||
% within income |
100.0% |
100.0% |
100.0% |
Table6
Relationship Crosstab |
||||||
income |
Total |
|||||
<=50K. |
>50K. |
|||||
relationship |
Husband |
Count |
5554 |
4510 |
10064 |
|
% within income |
29.2% |
75.4% |
40.3% |
|||
Not-in-family |
Count |
5783 |
660 |
6443 |
||
% within income |
30.4% |
11.0% |
25.8% |
|||
Other-relative |
Count |
701 |
28 |
729 |
||
% within income |
3.7% |
0.5% |
2.9% |
|||
Own-child |
Count |
3856 |
55 |
3911 |
||
% within income |
20.3% |
0.9% |
15.6% |
|||
Unmarried |
Count |
2481 |
159 |
2640 |
||
% within income |
13.0% |
2.7% |
10.6% |
|||
Wife |
Count |
641 |
572 |
1213 |
||
% within income |
3.4% |
9.6% |
4.9% |
|||
Total |
Count |
19016 |
5984 |
25000 |
||
% within income |
100.0% |
100.0% |
100.0% |
Table 7 Race Crosstab |
||||||
income |
Total |
|||||
<=50K. |
>50K. |
|||||
race |
Amer-Indian-Eskimo |
Count |
214 |
27 |
241 |
|
% within income |
1.1% |
0.5% |
1.0% |
|||
Asian-Pac-Islander |
Count |
575 |
200 |
775 |
||
% within income |
3.0% |
3.3% |
3.1% |
|||
Black |
Count |
2081 |
298 |
2379 |
||
% within income |
10.9% |
5.0% |
9.5% |
|||
Other |
Count |
199 |
15 |
214 |
||
% within income |
1.0% |
0.3% |
0.9% |
|||
White |
Count |
15947 |
5444 |
21391 |
||
% within income |
83.9% |
91.0% |
85.6% |
|||
Total |
Count |
19016 |
5984 |
25000 |
||
% within income |
100.0% |
100.0% |
100.0% |
Table 8 Sex Crosstab |
||||||||
income |
Total |
|||||||
<=50K. |
>50K. |
|||||||
sex |
Female |
Count |
7395 |
896 |
8291 |
|||
% within income |
38.9% |
15.0% |
33.2% |
|||||
Male |
Count |
11621 |
5088 |
16709 |
||||
% within income |
61.1% |
85.0% |
66.8% |
|||||
Total |
Count |
19016 |
5984 |
25000 |
||||
% within income |
100.0% |
100.0% |
100.0% |
Table 9 Native-country Crosstab |
|||||||||
income |
Total |
||||||||
<=50K. |
>50K. |
||||||||
native-country |
? |
Count |
337 |
108 |
445 |
||||
% within income |
1.8% |
1.8% |
1.8% |
||||||
Cambodia |
Count |
12 |
4 |
16 |
|||||
% within income |
0.1% |
0.1% |
0.1% |
||||||
Canada |
Count |
64 |
35 |
99 |
|||||
% within income |
0.3% |
0.6% |
0.4% |
||||||
China |
Count |
45 |
15 |
60 |
|||||
% within income |
0.2% |
0.3% |
0.2% |
||||||
Columbia |
Count |
43 |
2 |
45 |
|||||
% within income |
0.2% |
0.0% |
0.2% |
||||||
Cuba |
Count |
57 |
15 |
72 |
|||||
% within income |
0.3% |
0.3% |
0.3% |
||||||
Dominican-Republic |
Count |
52 |
2 |
54 |
|||||
% within income |
0.3% |
0.0% |
0.2% |
||||||
Ecuador |
Count |
17 |
2 |
19 |
|||||
% within income |
0.1% |
0.0% |
0.1% |
||||||
El-Salvador |
Count |
65 |
7 |
72 |
|||||
% within income |
0.3% |
0.1% |
0.3% |
||||||
England |
Count |
46 |
26 |
72 |
|||||
% within income |
0.2% |
0.4% |
0.3% |
||||||
France |
Count |
12 |
9 |
21 |
|||||
% within income |
0.1% |
0.2% |
0.1% |
||||||
Germany |
Count |
70 |
32 |
102 |
|||||
% within income |
0.4% |
0.5% |
0.4% |
||||||
Greece |
Count |
16 |
7 |
23 |
|||||
% within income |
0.1% |
0.1% |
0.1% |
||||||
Guatemala |
Count |
49 |
1 |
50 |
|||||
% within income |
0.3% |
0.0% |
0.2% |
||||||
Haiti |
Count |
35 |
3 |
38 |
|||||
% within income |
0.2% |
0.1% |
0.2% |
||||||
Holand-Netherlands |
Count |
1 |
0 |
1 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Honduras |
Count |
7 |
1 |
8 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Hong |
Count |
7 |
4 |
11 |
|||||
% within income |
0.0% |
0.1% |
0.0% |
||||||
Hungary |
Count |
9 |
1 |
10 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
India |
Count |
42 |
25 |
67 |
|||||
% within income |
0.2% |
0.4% |
0.3% |
||||||
Iran |
Count |
22 |
13 |
35 |
|||||
% within income |
0.1% |
0.2% |
0.1% |
||||||
Ireland |
Count |
14 |
5 |
19 |
|||||
% within income |
0.1% |
0.1% |
0.1% |
||||||
Italy |
Count |
33 |
22 |
55 |
|||||
% within income |
0.2% |
0.4% |
0.2% |
||||||
Jamaica |
Count |
50 |
8 |
58 |
|||||
% within income |
0.3% |
0.1% |
0.2% |
||||||
Japan |
Count |
28 |
20 |
48 |
|||||
% within income |
0.1% |
0.3% |
0.2% |
||||||
Laos |
Count |
7 |
1 |
8 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Mexico |
Count |
463 |
25 |
488 |
|||||
% within income |
2.4% |
0.4% |
2.0% |
||||||
Nicaragua |
Count |
23 |
2 |
25 |
|||||
% within income |
0.1% |
0.0% |
0.1% |
||||||
Outlying-US(Guam-USVI-etc) |
Count |
8 |
0 |
8 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
Peru |
Count |
22 |
1 |
23 |
|||||
% within income |
0.1% |
0.0% |
0.1% |
||||||
Philippines |
Count |
110 |
41 |
151 |
|||||
% within income |
0.6% |
0.7% |
0.6% |
||||||
Poland |
Count |
40 |
9 |
49 |
|||||
% within income |
0.2% |
0.2% |
0.2% |
||||||
Portugal |
Count |
25 |
3 |
28 |
|||||
% within income |
0.1% |
0.1% |
0.1% |
||||||
Puerto-Rico |
Count |
85 |
11 |
96 |
|||||
% within income |
0.4% |
0.2% |
0.4% |
||||||
Scotland |
Count |
8 |
1 |
9 |
|||||
% within income |
0.0% |
0.0% |
0.0% |
||||||
South |
Count |
49 |
15 |
64 |
|||||
% within income |
0.3% |
0.3% |
0.3% |
||||||
Taiwan |
Count |
24 |
18 |
42 |
|||||
% within income |
0.1% |
0.3% |
0.2% |
||||||
Thailand |
Count |
12 |
3 |
15 |
|||||
% within income |
0.1% |
0.1% |
0.1% |
||||||
Trinadad&Tobago |
Count |
11 |
1 |
12 |
|||||
% within income |
0.1% |
0.0% |
0.0% |
||||||
United-States |
Count |
16941 |
5480 |
22421 |
|||||
% within income |
89.1% |
91.6% |
89.7% |
||||||
Vietnam |
Count |
48 |
2 |
50 |
|||||
% within income |
0.3% |
0.0% |
0.2% |
||||||
Yugoslavia |
Count |
7 |
4 |
11 |
|||||
% within income |
0.0% |
0.1% |
0.0% |
||||||
Total |
Count |
19016 |
5984 |
25000 |
|||||
% within income |
100.0% |
100.0% |
100.0% |
From Table 2 to Table 9, it is evident that thesalient results are similar to those displayed in the bar charts fromquestion 26.
(If your software supports this.) Construct a web graph of the categorical variables. Fine tune the graph so that interesting results emerge. Discuss your findings.
The IBM SPSS Statistics 23 does notsupport the construction of web graphs.
Report on whether anomalous fields exist in this data set, based on your EDA, which fields these are, and what we should do about it.
Table 10 workclass |
|||||||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||||||
Valid |
? |
1399 |
5.6 |
5.6 |
5.6 |
||||
Federal-gov |
750 |
3.0 |
3.0 |
8.6 |
|||||
Local-gov |
1624 |
6.5 |
6.5 |
15.1 |
|||||
Never-worked |
5 |
.0 |
.0 |
15.1 |
|||||
Private |
17385 |
69.5 |
69.5 |
84.7 |
|||||
Self-emp-inc |
857 |
3.4 |
3.4 |
88.1 |
|||||
Self-emp-not-inc |
1978 |
7.9 |
7.9 |
96.0 |
|||||
State-gov |
993 |
4.0 |
4.0 |
100.0 |
|||||
Without-pay |
9 |
.0 |
.0 |
100.0 |
|||||
Total |
25000 |
100.0 |
100.0 |
Table 11 race |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid |
Amer-Indian-Eskimo |
241 |
1.0 |
1.0 |
1.0 |
Asian-Pac-Islander |
775 |
3.1 |
3.1 |
4.1 |
|
Black |
2379 |
9.5 |
9.5 |
13.6 |
|
Other |
214 |
.9 |
.9 |
14.4 |
|
White |
21391 |
85.6 |
85.6 |
100.0 |
|
Total |
25000 |
100.0 |
100.0 |
Table 12 sex |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid |
Female |
8291 |
33.2 |
33.2 |
33.2 |
Male |
16709 |
66.8 |
66.8 |
100.0 |
|
Total |
25000 |
100.0 |
100.0 |
Table 13 native-country |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid |
? |
445 |
1.8 |
1.8 |
1.8 |
Cambodia |
16 |
.1 |
.1 |
1.8 |
|
Canada |
99 |
.4 |
.4 |
2.2 |
|
China |
60 |
.2 |
.2 |
2.5 |
|
Columbia |
45 |
.2 |
.2 |
2.7 |
|
Cuba |
72 |
.3 |
.3 |
2.9 |
|
Dominican-Republic |
54 |
.2 |
.2 |
3.2 |
|
Ecuador |
19 |
.1 |
.1 |
3.2 |
|
El-Salvador |
72 |
.3 |
.3 |
3.5 |
|
England |
72 |
.3 |
.3 |
3.8 |
|
France |
21 |
.1 |
.1 |
3.9 |
|
Germany |
102 |
.4 |
.4 |
4.3 |
|
Greece |
23 |
.1 |
.1 |
4.4 |
|
Guatemala |
50 |
.2 |
.2 |
4.6 |
|
Haiti |
38 |
.2 |
.2 |
4.8 |
|
Holand-Netherlands |
1 |
.0 |
.0 |
4.8 |
|
Honduras |
8 |
.0 |
.0 |
4.8 |
|
Hong |
11 |
.0 |
.0 |
4.8 |
|
Hungary |
10 |
.0 |
.0 |
4.9 |
|
India |
67 |
.3 |
.3 |
5.1 |
|
Iran |
35 |
.1 |
.1 |
5.3 |
|
Ireland |
19 |
.1 |
.1 |
5.4 |
|
Italy |
55 |
.2 |
.2 |
5.6 |
|
Jamaica |
58 |
.2 |
.2 |
5.8 |
|
Japan |
48 |
.2 |
.2 |
6.0 |
|
Laos |
8 |
.0 |
.0 |
6.0 |
|
Mexico |
488 |
2.0 |
2.0 |
8.0 |
|
Nicaragua |
25 |
.1 |
.1 |
8.1 |
|
Outlying-US(Guam-USVI-etc) |
8 |
.0 |
.0 |
8.1 |
|
Peru |
23 |
.1 |
.1 |
8.2 |
|
Philippines |
151 |
.6 |
.6 |
8.8 |
|
Poland |
49 |
.2 |
.2 |
9.0 |
|
Portugal |
28 |
.1 |
.1 |
9.1 |
|
Puerto-Rico |
96 |
.4 |
.4 |
9.5 |
|
Scotland |
9 |
.0 |
.0 |
9.5 |
|
South |
64 |
.3 |
.3 |
9.8 |
|
Taiwan |
42 |
.2 |
.2 |
10.0 |
|
Thailand |
15 |
.1 |
.1 |
10.0 |
|
Trinadad&Tobago |
12 |
.0 |
.0 |
10.1 |
|
United-States |
22421 |
89.7 |
89.7 |
99.8 |
|
Vietnam |
50 |
.2 |
.2 |
100.0 |
|
Yugoslavia |
11 |
.0 |
.0 |
100.0 |
|
Total |
25000 |
100.0 |
100.0 |
Table 14 income |
|||||
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
||
Valid |
<=50K. |
19016 |
76.1 |
76.1 |
76.1 |
>50K. |
5984 |
23.9 |
23.9 |
100.0 |
|
Total |
25000 |
100.0 |
100.0 |
Figure16
Figure17
Figure18
The anomalous fields are income-gain andincome-loss. There is no evident relationship between the two fieldswith income. There is no pattern in the how the income-gain andincome-loss affect the target variable.
Report the mean, median, minimum, maximum, and standard deviation for each of the numerical variables.
Table 15 Statistics |
|||||||
age |
demogweight |
education-num |
capital-gain |
capital-loss |
hours-per-week |
||
N |
Valid |
25000 |
25000 |
25000 |
25000 |
25000 |
25000 |
Missing |
0 |
0 |
0 |
0 |
0 |
0 |
|
Mean |
38.61 |
189741.84 |
10.08 |
1088.58 |
86.50 |
40.41 |
|
Median |
37.00 |
178353.00 |
10.00 |
.00 |
.00 |
40.00 |
|
Std. Deviation |
13.688 |
105294.740 |
2.557 |
7486.621 |
401.254 |
12.299 |
|
Minimum |
17 |
12285 |
1 |
0 |
0 |
1 |
|
Maximum |
90 |
1484705 |
16 |
99999 |
4356 |
99 |
The mean, median, standard deviation, minimum andmaximum for age, demogweight, education-num, capital-gain,capital-loss, and hours-per-week are shown in Table 15.
Construct a histogram of each numerical variables, with an overlay of the target variable income. Normalize if necessary. Discuss the relationship, if any, each of these variables has with the target variables.
Figure19
Figure20
Figure21
Figure 22
Figure23
Figure24
From Figure 19 to 24, we can deduce that theparticipants either earn <=50K or >50K at the peak of 40 years,at around 200,000 demogweight, and at education-num of around 10.
For each pair of numerical variables, construct a scatter plot of the variables. Discuss your salient results.
Figure25
ScatterPlot
There does not seem to be any relationship among age, demogweight,capital-gain, capital-loss, and hours-per-week.
Based on your EDA so far, identify interesting sub-groups of records within the data set that would be worth further investigation.
From the EDA, private workclass,white race, male sex, and United States native-country sub groups areworth undergoing more investigation.
Apply binning to one of the numerical variables. Do it in such a way as to maximize the effect of the classes thus created (following the suggestions in the text). Now do it in such a way as to minimize the effect of the classes, so that the difference between the classes is diminished. Comment.
Figure26 No Binning
Figure27 Binning – Maximize Effect
Figure28 Binning – Minimize Effect
From Figure 26 to 28, performing binning ondemogweight to maximize the effect helps identify the slow reductionpattern as we move from left to right while when binning to minimizethe effect, we do not come upon a slow reduction pattern as we movefrom left to right.
Refer to the previous exercise. Apply the other two binning methods (equal width, and equal number of records) to this same variable. Compare the results and discuss the differences. Which method do you prefer?
After comparing Figure 29 and 30, I prefer theequal width binning method since it violates less errors although thehistograms appear to be almost similar.
Figure29 Binning – Equal Number of Records
Figure30 Binning – Equal Width
Summarize your salient EDA findings from the above exercises. Just as if you were writing a report.
The EDA analyzed the variables andidentified anomalous variable income whereby <=50K represents twopossibilities. Also, histograms were drawn from the data,distribution of categorical variables, and interrelationship betweenthe variables were explored. It was evident that the variablescapital-gain and capital-loss had no substantial relationship withthe target variable.