bin widths
bin widths
Paper 216-2009
ABSTRACT
Percent tallies associated with midpoint labeled intervals define the basic histogram generated in PROC
UNIVARIATE. However, most statistics textbooks display histograms with frequencies and endpoints rather than per-
cents and midpoints. Frequencies are more descriptive, and endpoints are better suited for continuous data. With
recent updates to SAS software it is now easy to generate a textbook histogram by using PROC UNIVARIATE. En-
hancements to the textbook histogram such as a normal curve overlay and bar height labels are also easily managed
in PROC UNIVARIATE.
Unfortunately, the UNIVARIATE ENDPOINTS= option, new for Version 9.13 SAS, is restricted in form to <m TO n BY
increment>. This means that plotting an n-bar histogram or a histogram with unequal intervals is only possible when
the graph is developed from scratch in PROC GPLOT. A macro that works with any release of SAS software is pro-
vided that automates the production of GPLOT generated histograms.
With complete instructions provided for both UNIVARIATE and GPLOT derived histograms, you should come away
from this presentation knowing how to create a textbook histogram in SAS.
For the histogram, then, the width of the bar becomes an added dimension for conveying information. This means
that bar widths in a histogram do not have to be equal.
THE SAS-STYLE HISTOGRAM IS DERIVED FROM THE GCHART PROCEDURE
Prior to Version 8, the only way to quickly generate a histogram in SAS was to remove the DISCRETE option from an
invocation of PROC GCHART while setting SPACE (between bars) to zero. Just like a regular bar chart, measure-
ment classes were labeled at midpoints along the horizontal axis. Now, as Figure 1 demonstrates, the output is al-
most identical when defaults are applied to a PROC UNIVARIATE generated histogram.
Figure 1. GCHART and UNIVARIATE generate similar histograms. For GCHART, FREQUENCY is the default setting
whereas PERCENTS are plotted in UNIVARIATE. Midpoints are also marked by ticks in the UNIVARIATE histogram.
25
Number of Meetings
8
20
Percent
6
15
4
10
2 5
0
0 0 0.8 1.6 2.4 3.2 4
0.0 0.8 1.6 2.4 3.2 4.0
Hours Hours
1
SAS Global Forum 2009 Reporting and Information Visualization
The MEETINGS data set graphed in Figure 1 comes from The How-To Book for SAS/GRAPH Software by Thomas
Miron [2, p.88] (copyright 1995, SAS Institute Inc., Cary, NC, USA. All Rights Reserved; reproduced with permission of SAS Insti-
tute Inc., Cary, NC). Since the variation in meeting lengths is not infinite, HOURS in Figure 1 would be better described
as a discrete, continuous variable. Lots of ties exist in the small, 32-observation data set. As Figure 2 demonstrates,
the needle plot is the graph of choice for discrete continuous data whereas the histogram should be reserved for data
that are truly continuous.
Figure 2. The MEETINGS data set is graphed as a needle plot. Discrete continuous data do not need to be summarized. On the
other hand, information about 300+ baseball players can be collapsed into six bar areas of a histogram, because it is possible for
a player to score any number of runs during a given season.
6
100
Number of Meetings
Number of Players
75
4
50
2
25
0
0
0.25 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.0
0 25 50 75 100 125 150
Hours Runs
Even though the MEETINGS data should not be plotted as a histogram, the small data set highlights structural issues
associated plot construction. Therefore, MEETINGS histograms appear alongside their BASEBALL counterparts
throughout the paper.
0
? 0 0.8 1.6 2.4 3.2 4
Hours
• ftext= htext= While GCHART-generated histograms can reference AXES statements, PROC
UNIVARIATE must rely on a limited set of options in the GOPTIONS statement to format values
and labels along the horizontal axis.
2
SAS Global Forum 2009 Reporting and Information Visualization
The internal algorithm developed by Terrel and Scott that SAS uses for default midpoint assignments is also "primar-
ily applicable to continuous data that are approximately normally distributed" [8, p.225]. Thus the MEETINGS data set
presents difficulties when defaults are used in histogram construction. Meeting lengths are not normally distributed
and the negative time as implied by the question mark in Figure 2 doesn't exist.
CONSTRUCTING A TEXTBOOK HISTOGRAM:
The results from a poll of eleven statistics references can be found in Tables 1 and 2 in the appendix. From Table 1,
nine out of the eleven references contain at least one histogram with endpoints rather than midpoints. Two midpoint
histograms from the surveyed texts present problems that are illustrated in Figure 4 below.
The first graph in Figure 4 shows how a histogram was used for quality control [12, 229-231]. The diameters of 500
rods were grouped at 0.001 intervals. The lower specification limit (LSL) was set to 1.000. If a rod diameter was less
than 1.000, the rod had to be discarded. Rods with diameters greater than 1.000 could be retooled. A zero at 0.999
raised questions about the results, and the mystery was cleared up when inspectors revealed that they passed rods
that were slightly below the lower specification limit. However, given that rod diameters could be anywhere from
0.996 to 1.008 cm inclusive in diameter, what about rods between 0.9995 and 0.9999 cm? Shouldn't they have been
rejected too? If endpoints had been used instead of midpoints, the results would have been unambiguous.
The second graph of theoretical proportions is not so convoluted [19, 171-172]. Negative proportions don't exist. Again
zero appears as (an unlabeled) midpoint, so half of the corresponding bar is out of bounds.
Figure 4. Two histograms from ten statistical references contain ambiguities caused by using midpoints rather than endpoints to
define class intervals. The histograms were re-created by visual inspection. Therefore, bar heights are estimated.
Distribution of Inside Diameters for 500 Steel Rods Distribution of a Theoretical Proportion
from 'Statistics: A Guide to the Unknown', p.230 from 'Statistics: Concepts and Controversies', p.162
100 LSL
0.2
75
Relative Frequency
Frequency
50 0.1
25
0 0 0
0 0.0
0.996 0.998 1.000 1.002 1.004 1.006 1.008 -0.04 0.04 0.12 0.20 0.28 0.36 0.44
Diameter(CM) Value of P-Hat
By default, SAS uses a BEST format with varied precision to label class intervals in a histogram. If uniform precision
is desired, a specific format should be defined. For example, to display 0.0, 0.8, 1.5, 2.4, 3.2, 4.0, and
4.8 in Figure 5, insert "format hours 3.1;" into the code.
3
SAS Global Forum 2009 Reporting and Information Visualization
Figure 5. An endpoint histogram is created in PROC UNIVARIATE with the ENDPOINTS= option.
Histogram for Meeting Lengths filename histo "&outpath.\Fig5a.cgm";
25 goptions htext=4pct htitle=5pct
gsfname=histo;
noframe;
Label Hours='Hours';
10 run;
0
0 0.8 1.6 2.4 3.2 4 4.8
Hours
Figure 6. Percents are changed to frequencies by setting the VSCALE= option to COUNT.
Histogram for Meeting Lengths filename histo "&outpath.\Fig6a.cgm";
8 goptions htext=4pct htitle=5pct
gsfname=histo;
0
0 0.8 1.6 2.4 3.2 4 4.8
Hours
• vscale=count Other choices available are PERCENT, the default, and PROPORTION (or
relative frequency).
• vaxis= VALUE LIST In Version 9.13, a NAME associated with an axis statement can also be
used. There is no corresponding HAXIS= option for PROC UNIVARIATE.
• vaxisLabel= labels the vertical axis.
Now that frequencies can be plotted along the vertical axis of a histogram with PROC UNIVARIATE, it would be de-
sirable to attach counts to the individual bars. While this task can be easily completed in PROC GCHART with the
OUTSIDE= or INSIDE= options, ANNOTATE must be used in PROC UNIVARIATE. A labeled histogram along with
relevant SAS code is displayed in Figure 7.
4
SAS Global Forum 2009 Reporting and Information Visualization
Figure 7. Bars are labeled with output from an ANNOTATE data set that is linked to the HISTOGRAM statement in PROC
UNIVARIATE with the ANNOTATE= option.
Panel 2:
• tot_N The total number of observed values for hours is stored in a macro variable, so that per-
cents in _OBSPCT_ can be converted to frequencies for display.
• noplot ... outhistogram= histoMtgsDS suppresses the histogram, since only the output data
set is desired in this step.
• endpoints=0.0 to 4.80 by 0.8 is needed for calculating _MINPT_ and _OBSPCT_ in the
ANNOTATE data set. . Adding vscale=count vaxis=0 2 4 6 8 10 from panel 3 to panel 2 will
not change the contents of the output data set. In other words, _OBSPCT_ is fixed. A corre-
sponding automatic variable such as _OBSFREQ_ does not exist.
• data annoMtgsMP ... set histoDS the output data set from the first invocation of PROC
UNIVARIATE is used as input to ANNOTATE.
• x = _minPT_ + 0.4; Since frequency labels are centered over the bar midpoint, one-half of the
range (0.8) or 0.4 is added to _MINPT_.
• y = &tot_n * _obsPct_ * 0.01; _OBSPCT_ is converted to a frequency.
• %label(x,y,chN,CX0386BE,0,0,3.5,HWCGM001,2) For a description of the %Label annotate
macro see [7, p. 685].
Panel 3:
• annotate=annoMtgsMP is the link to the ANNOTATE data set created in panel 2.
• endpoints=0 to 4.8 by 0.8 must contain the same range as the ENDPOINTS= option in panel
2.
• vaxis=0 2 4 6 8 10 the vertical axis maximum is increased to 10 to accommodate the labels.
5
SAS Global Forum 2009 Reporting and Information Visualization
Figure 8. Normal curves can be easily added to a UNIVARIATE-generated histogram. The histograms become more informative
when sample sizes and probabilities are also listed. Despite the larger sample size, the curve in panel 3 does not come from a
normal distribution whereas the curve in the panel 4 histogram is destined to be normal. The idea for the panel 4 histogram origi-
nates in Example 8 of the Version 8 Procedures Guide [8, p.1444-1446].
histogram hours /
6 normal(noprint color=CX0386BE w=3)
6
annotate= annoMtgsMP
cfill=ltgray
4 4
4 endpoints=0 to 4.8 by 0.8 noframe
vscale=count vaxis=0 2 4 6 8 10
2 vaxisLabel="Number of Meetings";
2 inset N probN='prob(Norm)'(6.4) /
N 32 cfill=white height=3 position=s;
prob(Norm) 0.0228 Label Hours='Hours';
0 run;
0 0.8 1.6 2.4 3.2 4 4.8
Hours
124 205
125 200
103 164
Number of Players
100
150
Count
75
100 81
50 42 45
50 32
25 N 500
N 322 6 11
8 prob(Norm) 0.8278 1
prob(Norm) 0.0004 0
0 12 24 36 48 60 72 84 96
0 50 100 150 200 250
Random Normal Deviate
Hits
Panel 2:
• normal ... normal Two "normal" keywords are required. The first is for prob(Norm) and the sec-
ond generates a curve.
• normal(noprint color=CX0386BE w=3) secondary options associated with the second "nor-
mal" keyword assign a color and line thickness to the normal curve. NOPRINT "suppresses ta-
bles summarizing the curve" [9,214].
• inset N probN='prob(Norm)'(6.4) The first part of the inset statement defines and formats
the statistics that are displayed. In this instance N and PROBN are requested. PROBN is also
assigned a label and format.
• cfill=white height=3 position=s Statement options control the appearance and position of
the inset . The background color is set to white with CFILL, text HEIGHT is set to 3 (percent), and
POSITION is set to s(outh).
6
SAS Global Forum 2009 Reporting and Information Visualization
7
SAS Global Forum 2009 Reporting and Information Visualization
The range is reconstructed internally as &XMIN to &XMAX by &CONFIGINFO. The UNVARIATE and macro generated
histograms appear side by side in Figure 9.
Figure 9. UNIVARIATE and macro generated histograms are almost identical when HISTOCONFIG is set to '2'.
Meeting Lengths Histogram via UNIVARIATE Meeting Lengths Histogram Via Macro
10 8 8
8
8 8
8
6
Number of Meetings
Number of Meetings
6
6
6
4 4
4 4 4
4
2 2
2 2
0
0 0.8 1.6 2.4 3.2 4 4.8 0
0.0 0.8 1.6 2.4 3.2 4.0 4.8
Hours Hours
Since PLOTHISTO makes use of the axis statement, both axes labels can be emboldened to set them apart from
their corresponding axis values.
Example 2: Generating n-bar Histograms with HISTOCONFIG set to '1':
%PlotHisto(inds=work.hitsAndRuns, cgmFile=%str(&outpath.\Fig10a.cgm),
xvar=hits, xmin=0, xmax=250, xdataOffset=1,
HistoConfig=1, ConfigInfo=6, yorigin=12, pctSize=3,
XaxisLbl=Hits, YaxisLbl=%str(Frequency), xValFmt=3., yby=25,
ListFreqsYvN=Y,
title1=%str(move=(+10pct,+0pct) "Baseball Data from 1986-1987"));
%PlotHisto(inds=work.hitsAndRuns, cgmFile=%str(&outpath.\Fig10b.cgm),
xvar=hits, xmin=0, xmax=250, xdataOffset=1,
HistoConfig=1, ConfigInfo=9, yorigin=12, pctSize=3,
XaxisLbl=Hits, YaxisLbl=%str(Frequency), xValFmt=3., yby=25,
ListFreqsYvN=Y,
title1=%str(move=(+10pct,+0pct) "Baseball Data from 1986-1987"));
The only changes required for generating the two histograms in Figure 10 are highlighted in the source code above.
Use HISTOCONFIG='1' when the number of bars in a histogram is more important than the intervals that define the
class boundaries.
Figure 10. With HISTOCONFIG set to '1' for an n-bar histogram, CONFIGINFO=6 or 9 produces two histograms with 6 and 9
bars each.
Baseball Data from 1986-1987 Baseball Data from 1986-1987
125 75 73
108
100 56 55
85 52
50 48
74
Frequency
Frequency
75
50
25 21
25 25
25
8 7
5 2
0 0
0 42 83 125 167 208 250 0 28 56 83 111 139 167 194 222 250
Hits Hits
8
SAS Global Forum 2009 Reporting and Information Visualization
The only way to convert the midpoint histogram from Figure 1 to the endpoint histogram in Figure 11 is to issue a
macro call with HISTOCONFIG set to 3 for uneven intervals.
Figure 11. With HISTOCONFIG set to '3' for an uneven interval histogram, endpoints are fully defined in XMIN, XMAX and
CONFIGINFO when PLOTHISTO is invoked.
25
8
Number of Meetings
20
Percent
6
15
4
10
5 2
0
0 0.8 1.6 2.4 3.2 4 0
0.0 0.4 1.2 2.0 2.8 3.6 4.0
Hours Hours
When xmin=0, xmax=4, HistoConfig=3, and ConfigInfo=%str(0.4 1.2 2.0 2.8 3.6) Xfmt is defined as:
Proc format;
value xfmt
0 -< 0.4 = "0" 0.4 -< 1.2 = "0.4" 1.2 -< 2.0 = "1.2"
2.0 -< 2.8 = "2.0" 2.8 -<3.6 = "2.8" 3.6 - 4 = "3.6" ;
run;
9
SAS Global Forum 2009 Reporting and Information Visualization
Task #2:Generate then hide a Conventional Axis with nested macro: MKUNDERLYINGSCALE
When CONFIGINFO is set to 1 (n-bar) or 3 (uneven scale), the macro MKUNDERLINGSCALE is invoked. This
macro makes use of XMIN, XMAX, and XDATAOFFSET sent to PLOTHISTO.
axis2 label=none w=1 value=none major=none minor=none
origin=(,&yorigin.)
%if &histoConfig eq 2 %then
order=(&xmin to &xmax by &ConfigInfo) offset=(&xdataOffset.pct,);
%else
offset=(0pct,)
order=(%MkUnderlyingScale(calcXMin=&xMin, calcXMax=&XMax, OffSet=&xdataOffset));
;
--------------------------------------------------------------------------------------------------------------------------------------------
• label=none major=none minor=none value=none erases the axis completely, leaving only a
single horizontal line. Even though the axis is erased, the ORDER and ORIGIN options remain in
effect. Otherwise the algorithm wouldn't work.
• origin=(,&yorigin)YORIGIN must match the value for YORIGIN in the macro %unevenInter-
valAxis where the displayed X-axis is redrawn via ANNOTATE.
• %MkUnderlyingScale is a macro function that returns an order statement in a range format. For
example for MEETINGS run, that would be -0.16 to 4.16 by 0.04
• &calcXMin, &calcXMax in the case of the MEETINGS data these macro variables resolve to 0.0
and 4.0.
• OffSet=&xdataOffset extends the axis range by +/- 3 (+1) units or 0.16 hours where a unit is
defined as 0.04 in the %getIncr macro contained within MKUNDERLYINGSCALE. The increase
in range is needed to accommodate a text size of 3.75 percent.
Task #3:Generate a Display-Axis with the UNEVENINTERVALAXIS macro
XFMT is used to create a control-out data set that serves as input to the XAXISTICKS data set. Relevant code from
PLOTHISTO:
proc format library=WORK
cntlout=XaxisTicks(keep=start end);
select xfmt;
run;
data XaxisTicks(keep=xtick);
set XaxisTicks;
xtick=input(left(start),best.); output;
xtick=input(left(end),best.); output;
run;
%UnevenIntervalAxis(inDS=xAxisTicks, xvar=xtick, pctSize=&pctSize, xlabel=&XaxisLbl,
yOrigin=&yOrigin, xvalfmt=&xvalfmt.)
A partial listing of the UnevenIntervalAxis macro that uses annotate macros to create tick marks, associated axis val-
ues, and the axis label appears below. Again, full code listings can be found in the zip file.
%macro UnevenIntervalAxis(inDS=, xvar=, pctSize=, xlabel=, yOrigin=, XvalFmt=);
< Local macro variables are assigned and a select distinct in PROC SQL yields DISTINCTXTICK from INDS >
data annoAxisX;
%dclanno;
length text $30;
set distinctXtick end=last;
%system(2,3,3);
%move(xtick, &yOrigin);
%draw(xtick, &yOrigin - &tickLength., black, 1, 0.04);
%label(xtick, &yOrigin - &LabelYpos., DisplayX, black, 0, 0, &pctSize, Hwcgm001,5);
if last then do;
%system(1,3,3);
%label(50, &yOrigin - &axisLabelPos., "&xLabel", black, 0, 0, &pctSize, Hwcgm002,5);
end;
run;
--------------------------------------------------------------------------------------------------------------------------------------------
• %system(2,3,3) is an annotate macro that translates positional parameters to XYS, YSYS and
HSYS coordinate systems. For axis ticks and value labels an XSYS value of '2' uses absolute
values from the data area whereas a value of '3' for YSYS and HSYS translates assigned num-
bers to percentages of the graphics output area. With YSYS set to '3', TICKLENGTH and
LABELYPOS can be accurately subtracted from YORIGIN which is also defined as a percent.
• %system(1,3,3) XSYS, here is changed from '2' to '1' (percent of data area) so that the axis la-
bel is centered on the horizontal axis when the corresponding X coordinate is set to 50.
10
SAS Global Forum 2009 Reporting and Information Visualization
The adjusted axis in Figure 12 highlights the completion of Tasks #2 and #3 above.
Figure 12. The display axis from %UnevenIntervalAxis overlays a grayed-out (usually invisible) axis where the ORDER
option is filled in with an invocation of the %MkUnderlyingScale macro function.
8 8
8
Number of Meetings
4
4
2
2
1
0 - - - - 000000000000000000000000011111111111111111111111112222222222222222222222222333333333333333333333333344444
0000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 000112223344455666778889900011222334445566677888990001122233444556667788899000112223344455666778889900011
1100048260482604826048260482604826048260482604826048260482604826048260482604826048260482604826048260482604826
6284
Task #4: Create a Plot Data Set by Binning the Input Data with XFMT.
SAS code plus input and output data are listed in this section to demonstrate how the binning uses XFMT to create a
data set amenable to plotting.
XX and YY become the plot variables. Zeros are interspersed with actual values for YY when PLTDS is created so
that the symbol statement works as expected when INTERPOLATE= is set to STEPRJ.
From the sorted version of the input data below, it can be seen that there are a lot of tied meeting lengths. Ties
should not be confused with binning. The distinction is addressed in Figure 13.
11
SAS Global Forum 2009 Reporting and Information Visualization
XFMT for the MEETINGS data set is displayed again to show how the binning pictured in Figure 13 works:
Proc format;
value xfmt
0 -< 0.4 = "0" 0.4 -< 1.2 = "0.4" 1.2 -< 2.0 = "1.2"
2.0 -< 2.8 = "2.0" 2.8 -<3.6 = "2.8" 3.6 - 4 = "3.6" ;
run;
Minimum (0) and the maximum (4) are inclusive (>= or <=) whereas intermediate endpoints are exclusive (<). With
this set up, all intermediate points within a member class are set to the value of the left-most endpoint.
Figure 13. XFMT provides the foundation for binning in the PLOTHISTO macro. Using a format means that the generated histo-
gram conforms to the requirement that "each measurement falls into one and only one measurement class" [17,p37].
Count
Raw ofData
Meetings by their Length
with Duplicates Binned Count of Meetings
Data positioned by their Length
at Left-Most Endpoints
10 10
8 8
Number of Meetings
Number of Meetings
6 6
4 4
2 2
0 0
0.25 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.0 0.0 0.4 1.2 2.0 2.8 3.6 4.0
Hours Hours
Count of Meetings
Histogram by their
with Binned DataLength Count ofwith
Histogram Meetings by Binned
Raw and their Length
Data
10 10
9 9
8 8 8 8
8 8
Number of Meetings
Number of Meetings
6 6
4 4 4 4 4 4 4
4 4
2 2 2 2
2 2
1 1
1
0
0 0.25 0.5 1 1.5 2 2.5 3 3.5 4
0.0 0.4 1.2 2.0 2.8 3.6 4.0
0.0 0.4 1.2 2.0 2.8 3.6 4.0
Hours Hours
12
SAS Global Forum 2009 Reporting and Information Visualization
The input data are represented by a needle plot in the first panel of Figure 13. A second needle plot shows the calcu-
lated frequencies at the designated intervals, whereas the histogram in panel #3 is generated from the plot data set
containing the interleaved zeros. The fourth panel is a composite of the first three graphs.
Task #5: Color the bars and Generate a Plot
The AREAS= option in the PLOT statement of PROC GPLOT is not satisfactory for coloring the bars of a histogram.
Bar outlines are overwritten! Multiple calls to GREPLAY are convoluted, so the best solution involves the creation of a
second ANNOTATE data set from the plot data set. :
data annoBarFill;
%dclanno;
%system(2,2,3);
set pltds;
xx1=lag(xx); yy1=lag(yy); xx2=xx; yy2=yy;
if xx1 ne . AND yy1 eq 0;
%bar(xx1,yy1,xx2,yy2,graycc,0,solid);
run;
Since annotate macros such as %bar contain an implicit "when=before", the bars are colored before they are outlined.
Now the histogram is ready to be plotted with PROC GPLOT:
proc gplot data=pltds %if &ListFreqsYvN eq N %then anno=annoAxisX; %else anno=annoText; ;
plot yy*xx /vaxis=axis1
haxis=axis2
noframe
anno=annoBarFill;
run;
--------------------------------------------------------------------------------------------------------------------------------------------
• &ListFreqsYvN Not described in this summary is the option for labeling the histogram bars. The
method is very similar to the one used for generating UNIVARIATE histograms.
• anno=annoAxisX %else anno=annoText ANNOTEXT augments ANNOAXISX with a code ex-
tension for the midpoint frequency labels. Both data sets incorporate a call to
UNEVENINTERVALAXIS. The call to MKUNDERLYINGSCALE is embedded in the axis2 state-
ment shown earlier.
• anno=annoBarFill Both the GPLOT and PLOT statements can support the ANNO= option.
Thus the bars can be colored by a separate annotate data set.
Figure 14. An upper triangular matrix is derived from the rectangular matrix that originally appeared in SAS® System for Statisti-
cal Graphics by Michael Friendly.
X
238
hits
24 X +
+++X+ +
XXX++ X++X X
years +
++++
X
X
+ +
X
X+ ++
X+
X
+X+ +
X +++
+X+ X + X+X +
+++X++X+++ + XX
XX
X X +
X
XX +++ ++ XXX +
+XX+X+X
XX X + X
X X+
+
+ XX+
X+ X
XX +XX+X
1 +
X
+
X
+
+
+
+
X
X
+X
X
++
X
+
X
X+
+
X
+
+
X
X+
+
X
+
X
+
X ++
+
+X+
+ X
X
++
X
X
+
X
XX+
X
++ + ++
+
X
X + +
+
+
+
X
X
+
+
X
X
X+
X
+X X X
13
SAS Global Forum 2009 Reporting and Information Visualization
In Figure 15, marginal histograms provide a summary that partially offsets the degree of overlay in the scatter plot.
Figure 15. Marginal histogram totals must be the same, since points are plotted only when information is available for both
RUNS and HITS. Bar heights are comparable between histograms. GREPLAY was used to create the graph.
42 45
8
150
125
10
100
55
Runs
75
85
50
115
25
56
0
0 50 100 150 200 250
Hits
Figure 16. The bar chart confirms that the baseball data are not normally distributed. High frequencies are not clustered in the
middle. Spaces between the bars are visible when the graph is enlarged. The maximum frequency is 12.
103 103
Number of Players
100 100
75 75
50 42 45
50 42 45
25
25 8
N 322
8
prob(Norm) 0.0004 0 12 12
0 0 0
0 50 100 150 200 250
0 50 100 150 200 250
Hits Hits
14
SAS Global Forum 2009 Reporting and Information Visualization
In Figure 17, the bar chart from a normal distribution is more balanced with higher frequencies moving towards the
center of the plot.
Figure 17. The bar chart in the second panel confirms that random normal deviates are, in fact, normally distributed. Now the
maximum frequency in the bar chart is 24.
Count
100 100 81
81
50 32
50 32
6 11
N 500
11 1
6 prob(Norm) 0.8278 1 0 24 24
0 0 0
12 24 36 48 60 72 84 96
12 24 36 48 60 72 84 96
Random Normal Deviate Random Normal Deviate
While the PLOTHISTO macro that uses PROC GPLOT is more versatile, it is also more complex. However, by re-
viewing the description of how PLOTHISTO works on pages 9-13 in the paper along with a full listing of the source
code in the zip file associated with paper #NP03 in the NESUG 2008 proceedings, it will be possible to generate his-
tograms with ease from PROC GPLOT.
The PLOTHISTO macro has also been recently extended to produce both EMF and CGM output. Normal and
Gamma curves can now be superimposed over GPLOT generated histograms, and subgroup histograms, identical in
concept to subgroup bar charts, are also included in the updated macro. In addition, a separate macro has been writ-
ten to produce a histogram from summary data. Output from the new and recently updated macros will be shown in
the presentation, and a follow-up paper that shows how the new features work is being considered for presentation at
the upcoming NESUG conference.
COPYRIGHT STATEMENT
The paper, Using SAS® Software to Generate Textbook-Style Histograms, along with all associated files in the
NESUG 2008 proceedings is protected by copyright law. This means if you would like to use part or all of the original
ideas or text from these documents in a publication where no monetary profit is to be gained, you are welcome to do
so. All you need to do is to cite the paper in your reference section. For ALL uses that result in corporate or individual
profit, written permission must be obtained from the author. Conditions for usage have been modified from
http://www.whatiscopyright.org.
REFERENCES
[1] Nguyen, Chauthi. Histogram of Numeric Data Distribution from the UNIVARIATE Procedure. Proceedings of the
th
20 Annual Northeast SAS Users Group Conference. Baltimore, MD, 2007, paper #NP12.
[2] Miron, Thomas. The How-To Book for SAS/GRAPH Software. Cary, NC: SAS Institute Inc., 1995.
[3] Watts, Perry. Generate a Customized Axis Scale with Uneven Intervals in SAS® Automatically. Proceedings of
the SAS® Global Forum 2009 Conference. Washington, DC, 2009, paper #192-2009.
[4] Watts, Perry. Multiple-Plot Displays: Simplified with Macros. Cary, NC: SAS Institute Inc., 2002.
WEB CITATIONS:
[5] http://en.wikipedia.org/wiki/Histogram. Histogram: From Wikipedia, the free encyclopedia. The histogram is
defined and compared to a bar chart.
15
SAS Global Forum 2009 Reporting and Information Visualization
[6] http://lib.stat.cmu.edu/datasets/baseball.data. From StatLib --- DataSets Archive. This was the 1988 ASA Graph-
ics Section Poster Session dataset, organized by Lorraine Denby.
SAS INSTITUTE REFERENCES:
[7] SAS Institute Inc. SAS/GRAPH® 9.1 Reference, Volumes 1, 2, and 3, Cary NC: SAS Institute Inc., 2004.
[8] SAS Institute Inc. SAS® Procedures Guide, Version 8, Cary NC: SAS Institute Inc., 1999.
[9] SAS Institute Inc. Base SAS® 9.1.3 Procedures Guide, Volume 3: CORR, FREQ, and UNIVARIATE Procedures,
Cary NC: SAS Institute Inc., 2004.
Statistics Textbooks
[10] Blalock, Hubert M. Social Statistics. New York, NY: McGraw-Hill Book Company, Inc., 1960.
[11] Croxton, Frederick E. and Dudley J. Cowden. Applied General Statistics: Second Edition. New York, NY: Pren-
tice-Hall, Inc., 1955.
[12] Deming, W. Edwards. Making Things Right. Statistics: A Guide to the Unknown. Ed. Judith M. Tanur, et al. San
Francisco, CA: Holden-Day, Inc., 1972. 229-236.
[13] Efron, Bradley Bootstrap Methods: Another Look at the Jackknife. Breakthroughs in Statistics Volume II: Meth-
dology and Distribution. Ed. Samuel Kotz and Norman L. Johnson. New York, NY: Springer-Verlag New York,
Inc., 1992. 569-593.
[14] Freudenthal, Hans. Probability and Statistics. New York, NY: Elsevier Publishing Company, 1965.
[15] Jaeger, Richard M. Statistics A Spectator Sport: Second Edition. Newbury Park, CA: SAGE Publications, Inc.,
1990.
[16] Kvanli, Alan H., C. Stephen Guynes, Robert J. Pavur. Introduction to Business Statistics: A Computer Integrated
Approach. St. Paul, MN: West Publishing Company, 1986.
[17] McClave, James T. and P. George Benson. Statistics for Business and Economics: Third Edition. San Francisco,
CA: Dellen Publishing Company, 1985.
[18] Moore, David S. Statistics Concepts and Controversies: Second Edition. San Francisco, CA: W. H. Freeman and
Company, 1979.
[19] Mosteller, Frederick and David L. Wallace. Deciding Authorship. Statistics: A Guide to the Unknown. Ed. Judith
M. Tanur, et al. San Francisco, CA: Holden-Day, Inc., 1972. 164-175.
[20] Yule, G. Udny and M. G. Kendall. An Introduction to the Theory of Statistics. New York, NY: Hafner Publishing
Company, 1950.
TRADEMARK CITATION
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
The author welcomes feedback via email at perryWatts@comcast.net. A text version of the source code is available
upon request.
16
SAS Global Forum 2009 Reporting and Information Visualization
17
SAS Global Forum 2009 Reporting and Information Visualization
Table 2. Vertical Axis Types: Frequencies vs. Percents. (Relative Frequency X 100 = Percent)
Frequencies vs. Percents
Textbook ID
(Reference #, Title) Type Page Comments
[10] Social Statistics. Both 40-42 Percents are listed next to fre-
quencies on the vertical axis (an
effective technique)
[11] Applied General Statistics: Second Frequency 74
Edition.
[12] Making Things Right. Frequency 230
[13] Bootstrap Methods: Another Look at Frequency 589
the Jackknife
[14] Probability and Statistics Frequency 20-21
[15] Statistics A Spectator Sport: Second Frequency 17
Edition
[16] Introduction to Business Statistics: A Frequency 14
Computer Integrated Approach. Relative Frequency 15
[17] Statistics for Business and Econom- Relative Frequency 38
ics: Third Edition Frequency 38
Cumulative Relative 43
Frequency
[18] Statistics Concepts and Controver- Frequency 159
sies: Second Edition. Relative Frequency 162
[19] Deciding Authorship.. Proportion 171-172
[20] An Introduction to the Theory of Frequency 79
Statistics. Frequency 90
18