R Programming
R Programming
R-PROGRAMMING CHAKRABOR
INTRODUCTION TY
OUTLINE
Introduction: Grouping, loops and
Historical development conditional execution
Function
S, Splus
Capability Reading and writing data
Statistical Analysis from files
References Modeling
Regression
Calculator ANOVA
Data Type
Data Analysis on
Resources Association
Lottery
Simulation and Statistical Geyser
Tables
Probability distributions Smoothing
Programming
R, S AND S-PLUS
S: an interactive environment for data analysis
developed at Bell Laboratories since 1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
http://cm.bell-labs.com/cm/ms/departments/sia/S/hist
ory.html
R: initially written by Ross Ihaka and Robert
Gentleman at Dep. of Statistics of U of Auckland, New
Zealand during 1990s.
Since 1997: international “R-core” team of ca. 15
INTRODUCTION
R is “GNU S” — A language and environment for data
manipula-tion, calculation and graphical display.
R is similar to the award-winning S system, which was developed at
Bell Laboratories by John Chambers et al.
a suite of operators for calculations on arrays, in particular
matrices,
a large, coherent, integrated collection of intermediate tools for
interactive data analysis,
graphical facilities for data analysis and display either directly at
the computer or on hardcopy
a well developed programming language which includes
conditionals, loops, user defined recursive functions and input and
output facilities.
1.0
[1] 5
0.5
> sqrt(2)
0.0
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5 -0.5
-1.0
0 20 40 60 80 100
Index
Parlance:
• class: the “abstract” definition of it
• object: a concrete instance
• method: other word for ‘function’
• slot: a component of an object
OBJECT ORIENTATION
Advantages:
Encapsulation (can use the objects and methods someone else has
written without having to care about the internals)
Generic functions (e.g. plot, print)
Inheritance (hierarchical organization of complexity)
Caveat:
Overcomplicated, baroque program architecture…
VARIABLES
> a = 49
> sqrt(a) numeri
[1] 7 c
> a = "The dog ate my homework"
> sub("dog","cat",a) character
[1] "The cat ate my homework“ string
> a = (1+1==3)
>a logical
[1] FALSE
VECTORS, MATRICES AND ARRAYS
• vector: an ordered collection of data of the same type
> a = c(1,2,3)
> a*2
[1] 2 4 6
Example:
>a
localisation tumorsize progress
XX348 proximal 6.3 FALSE
XX234 distal 8.0 TRUE
XX987 proximal 10.0 FALSE
FACTORS
A character string can contain arbitrary text. Sometimes it is useful to use a limited
vocabulary, with a small number of allowed words. A factor is a variable that can only
take such a limited number of values, which are called levels.
>a
[1] Kolon(Rektum) Magen Magen
[4] Magen Magen Retroperitoneal
[7] Magen Magen(retrogastral) Magen
Levels: Kolon(Rektum) Magen Magen(retrogastral) Retroperitoneal
> class(a)
[1] "factor"
> as.character(a)
[1] "Kolon(Rektum)" "Magen" "Magen"
[4] "Magen" "Magen" "Retroperitoneal"
[7] "Magen" "Magen(retrogastral)" "Magen"
> as.integer(a)
[1] 1 2 2 2 2 4 2 3 2
> as.integer(as.character(a))
[1] NA NA NA NA NA NA NA NA NA NA NA NA
Warning message: NAs introduced by coercion
SUBSETTING
Individual elements of a vector, matrix, array or data frame are
accessed with “[ ]” by specifying their index, or their name
>a
localisation tumorsize progress
XX348 proximal 6.3 0
XX234 distal 8.0 1
XX987 proximal 10.0 0
> a[3, 2]
[1] 10
> a["XX987", "tumorsize"]
[1] 10
> a["XX987",]
localisation tumorsize progress
XX987 proximal 10 0
>a
localisation tumorsize progress
XX348 proximal 6.3 0
SUBSETTING
SUBSETTING
XX234 distal 8.0 1
XX987 proximal 10.0 0
> a[c(1,3),]
localisation tumorsize progress subset rows by a
XX348 proximal 6.3 0 vector of indices
XX987 proximal 10.0 0
> a[c(T,F,T),]
localisation tumorsize progress subset rows by a
XX348 proximal 6.3 0 logical vector
XX987 proximal 10.0 0
> a$localisation
[1] "proximal" "distal" "proximal"
> a$localisation=="proximal" subset a column
[1] TRUE FALSE TRUE
> a[ a$localisation=="proximal", ] comparison resulting in
localisation tumorsize progress logical vector
XX348 proximal 6.3 0
XX987 proximal 10.0 0
subset the selected
rows
RESOURCES
A package specification allows the production of
loadable modules for specific purposes, and several
contributed packages are made available through the
CRAN sites.
CRAN and R homepage:
http://www.r-project.org/
It is R’s central homepage, giving information on the R
project and everything related to it.
http://cran.r-project.org/
It acts as the download area,carrying the software itself,
extension packages, PDF manuals.
>? t.test
or
>help(t.test)
GETTING HELP
O HTML SEARCH
ENGINE
O SEARCH FOR
TOPICS
WITH REGULAR
EXPRESSIONS:
“HELP.SEARCH”
PROBABILITY DISTRIBUTIONS
Cumulative distribution function P(X ≤ x): ‘p’ for the CDF
Probability density function: ‘d’ for the density,,
Quantile function (given q, the smallest x such that P(X ≤ x) >
q): ‘q’ for the quantile
simulate from the distribution: ‘r
Distribution R name additional arguments
beta beta shape1, shape2, ncp
binomial binom size, prob
Cauchy cauchy location, scale
chi-squared chisq df, ncp
exponential exp rate
F f df1, df1, ncp
gamma gamma shape, scale
geometric geom prob
hypergeometric hyper m, n, k
log-normal lnorm meanlog, sdlog
logistic logis; negative binomial nbinom; normal norm; Poisson pois;
Student’s t t ; uniform unif; Weibull weibull; Wilcoxon wilcox
GROUPING, LOOPS AND
CONDITIONAL EXECUTION
Grouped expressions
R is an expression language in the sense that its only
command type is a function or expression which returns a
result.
Commands may be grouped together in braces, {expr 1, . . .,
expr m}, in which case the value of the group is the result of
the last expression in the group evaluated.
Control statements
if statements
The language has available a conditional construction of the
form
if (expr 1) expr 2 else expr 3
where expr 1 must evaluate to a logical value and the result
of the entire expression is then evident.
a vectorized version of the if/else construct, the ifelse
function. This has the form ifelse(condition, a, b)
REPETITIVE EXECUTION
for loops, repeat and while
for (name in expr 1) expr 2
where name is the loop variable. expr 1 is a vector
expression, (often a sequence like 1:20), and expr
2 is often a grouped expression with its sub-
expressions written in terms of the dummy name.
expr 2 is repeatedly evaluated as name ranges
through the values in the vector result of expr 1.
Other looping facilities include the
repeat expr statement and the
while (condition) expr statement.
The break statement can be used to terminate any
loop, possibly abnormally. This is the only way to
terminate repeat loops.
The next statement can be used to discontinue one
particular cycle and skip to the “next”.
BRANCHING
if (logical expression) {
statements
} else {
alternative statements
}
for(i in 1:10) {
print(i*i)
}
i=1
while(i<=10) {
print(i*i)
i=i+sqrt(i)
}
LAPPLY, SAPPLY, APPLY
• When the same or similar tasks need to be performed multiple
times for all elements of a list or for all columns of an array.
• May be easier and faster than “for” loops
• lapply(li, function )
• To each element of the list li, the function function is applied.
• The result is a list whose elements are the individual function
results.
> li = list("klaus","martin","georg")
> lapply(li, toupper)
> [[1]]
> [1] "KLAUS"
> [[2]]
> [1] "MARTIN"
> [[3]]
> [1] "GEORG"
LAPPLY, SAPPLY, APPLY
sapply( li, fct )
Like apply, but tries to simplify the result, by converting it into a
vector or array of appropriate size
> li = list("klaus","martin","georg")
> sapply(li, toupper)
[1] "KLAUS" "MARTIN" "GEORG"
Example:
add = function(a,b)
{ result = a+b
return(result) }
Operators:
Short-cut writing for frequently used functions of one or two
arguments.
Examples: + - * / ! & | %%
FUNCTIONS AND OPERATORS
• Functions do things with data
• “Input”: function arguments (0,1,2,…)
• “Output”: function result (exactly one)
Exceptions to the rule:
• Functions may also use data that sits around in other places, not
just in their argument list: “scoping rules”*
• Functions may also do other things than returning a result. E.g.,
plot something on the screen: “side effects”
• Every R object can be stored into and restored from a file with
the commands “save” and “load”.
• This uses the XDR (external data representation) standard of
Sun Microsystems and others, and is portable between MS-
Windows, Unix, Mac.
> x = read.delim(“filename.txt”)
also: read.table, read.csv
何謂彩卷的發行是公平的?
中獎號碼的分配是否接近於一離散均勻分配?
如何檢查中獎號碼的分配是否接近於一離散均勻分配?
length(lottery.number) #254
breaks<- 100*(0:10); breaks[1]<- -1
hist(lottery.number,10,breaks)
abline(256/10,0) 直條圖看起來相當平坦 (goodnes-of-fit test)
除非能預測未來,我們挑選的號碼僅有千分之一的機會中獎
這個彩卷的期望獎金為何?
當每張彩卷以 50 分出售,如果反覆買這個彩卷,我們期望中獎時,其獎金至少為 $500 ,因為中獎機率
為 1/1000 。
boxplot(lottery.payoff, main = "NJ Pick-it Lottery + (5/22/75-
3/16/76)", sub = "Payoff")
lottery.label<- ”NJ Pick-it Lottery (5/22/75-3/16/76)”
hist(lottery.payoff, main = lottery.label)
DATA ANALYSIS
是否中獎獎金曾多次高過 $500 ?
該如何下注? 中獎獎金是否含 outliers?
min(lottery.payoff) # 最低中獎獎金 83
lottery.number[lottery.payoff == min(lottery.payoff)] #
123
# <, >, <=, >=, ==, != : 比較指令
max(lottery.payoff) # 最高中獎獎金
869.5
lottery.number[lottery.payoff == max(lottery.payoff)] # 499
高額中獎獎金的中獎號碼是否具有任何特徵?
高額中獎獎金的中獎號碼特徵
特徵:大部份高額獎金中獎號碼,都有重複的數字。
此彩卷有一特別下注的方式稱作「 combination bets 」,下注號碼必須是三個不同的數字,只要下注
號碼與中獎號碼中所含的數字相同就算中獎。
plot(a[1,],a[2,],xlab="lottery.number",ylab="lottery.payoff",
main= "Payoff >=500")
boxplot(split(lottery.payoff,lottery.number%/%100), sub=
"Leading Digit of Winning Numbers", ylab= "Payoff")
依據中獎號碼的首位數字製作盒狀圖。
當中獎號碼的首位數字為零時,其獎金都較高。一個解釋是較少人會下注這樣的號碼。
在不同時間下,中獎獎金金額的比較。
qqplot(lottery.payoff, lottery3.payoff); abline(0,1)
使用盒狀圖來比較不同時間下,中獎獎金金額的分配。
boxplot(lottery.payoff, lottery2.payoff, lottery3.payoff)
依時間先後來看,中獎獎金金額漸漸穩定下來,很少能超過 $500 。
rbind(lottery2.number[lottery2.payoff >=
500],lottery2.payoff[lottery2.payoff >= 500])
rbind(lottery3.number[lottery3.payoff >=
500],lottery3.payoff[lottery3.payoff >= 500])
NEW JERSEY PICK-IT LOTTERY (每天開
獎)
• 三筆數據(收集於不同的時間):
• lottery ( 254 個中獎號碼由 1975 年 5 月 22 日至 1976 年 3 月 16 日)
• number: 中獎號碼由 000 至 999 ;這個樂透獎自 1975 年 5 月 22 日開始。
• payoff: 中獎號碼所得到的獎金金額;獎金金額為所有中獎者來平分當日下注總金額的半數。
• lottery2 (1976 年 11 月 10 日至 1977 年 9 月 6 日的中獎號碼及獎金 ) 。
• lottery3 (1980 年 12 月 1 日至 1981 年 9 月 22 日的中獎號碼獎金 ) 。
• lottery.number<- scan("c:/lotterynumber.txt")
• lottery.payoff<- scan("c:/lotterypayoff.txt")
僅看這一連串的中獎號碼,是頗難看出個所以然。
• lottery2<- scan("c:/lottery2.txt")
• lottery2<- matrix(lottery2,byrow=F,ncol=2)
• lottery2.payoff<- lottery2[,2]; lottery2.number<- lottery2[,1]
• lottery3<- matrix(scan("c:/lottery3.txt"),byrow=F,ncol=2)
• lottery3.payoff<- lottery3[,2]; lottery3.number<- lottery3[,1]
OLD FAITHFUL GEYSER IN YELLOWSTONE
NATIONAL PARK
研究目的:
便利遊客安排旅遊
瞭解 geyser 形成的原因,以便維護環境
數據:
收集於 1985 年 8 月 1 日至 1985 年 8 月 15 日
waiting: time interval between the starts of successive eruptions,
denote it by wt
duration: the duration of the subsequent eruption, denote it by
dt.
Some are recorded as L(ong), S(hort) and M(edium) during the
night
w1 d1 w2 d2
由 dt 預測 wt+1( 迴歸分析 )
In R, use help(faithful) to get more information on this
data set.
Load the data set by data(faithful).
geyser<- matrix(scan("c:/geyser.txt"),byrow=F,ncol=2)
geyser.waiting<- geyser[,1]; geyser.duration<- geyser[,2]
hist(geyser.waiting)
KERNEL DENSITY ESTIMATION
The function `density' computes kernel density estimates
with the given kernel and bandwidth.
density(x, bw = "nrd0", adjust = 1, kernel = c("gaussian",
"epanechnikov", "rectangular", "triangular", "biweight", "cosine",
"optcosine"), window = kernel, width, give.Rkern = FALSE, n = 512, from,
to, cut = 3, na.rm = FALSE)
n: the number of equally spaced points at which the density is to be
estimated.
hist(geyser.waiting,freq=FALSE)
lines(density(geyser.waiting))
plot(density(geyser.waiting))
lines(density(geyser.waiting,bw=10))
lines(density(geyser.waiting,bw=1,kernel=“e”))
Show the kernels in the R parametrization
(kernels <- eval(formals(density)$kernel))
plot (density(0, bw = 1), xlab = "", main="R's density() kernels with
bw = 1")
for(i in 2:length(kernels)) lines(density(0, bw = 1, kern = kernels[i]),
col = i)
legend(1.5,.4, legend = kernels, col = seq(kernels), lty = 1, cex = .8,
y.int = 1)
THE EFFECT OF CHOICE OF KERNELS
The average amount of annual precipitation (rainfall)
in inches for each of 70 United States (and Puerto
Rico) cities.
data(precip)
bw <- bw.SJ(precip) ## sensible automatic choice
plot(density(precip, bw = bw, n = 2^13), main =
"same sd bandwidths, 7 different kernels")
for(i in 2:length(kernels)) lines(density(precip, bw =
bw, kern = kernels[i], n = 2^13), col = i)
迴歸分析
•duration<- geyser.duration[1:298]
•waiting<- geyser.waiting[2:299]
•plot(duration,waiting,xlab=" 噴泉持續時間 ",ylab="waiting")
•plot(density(duration),xlab=" 噴泉持續時間 ",ylab="density")
plot(density(geyser.waiting),xlab="waiting",ylab="density")
•# 由 wt 預測 dt
•plot(geyser.waiting,geyser.duration,xlab="waiting",
ylab="duration")
•可能之物理模型
•噴泉口之下方有一細長 tube ,內充滿了水而受環繞岩石加熱。
由於 tube 內滿了大量的水,故 tube 下方的水因壓力的緣故,其沸點較高,且愈深處沸點愈高。
•當 tube 上方的水,因環繞岩石加熱達到沸點變為蒸氣;而較下方的水因壓力降低,故其沸點隨之降低,而加速將下方的水
變為蒸氣,故開始噴泉。
•有關此物理模型之進一步討論,參看 Rinehart (1969; J. Geophy. Res., 566-573)
•依據上述理論,可期待此次噴泉 duration 較長久者,等待噴泉口再次噴泉之時間可能較長。
迴歸分析
分析一 : 由 dt 預測 wt+1
plot(duration,waiting,xlab=“ 噴泉持續時間” ,
ylab=“waiting")
分析二 : 期待 duration 的時間長短交錯,探討時間與 dt 之關係 ( 時間序列
分析 A
ts.plot(geyser.duration,xlab=“ 時間” ,ylab=“ 噴泉持續時間” )
此時間序列呈現高度振盪,且振盪於兩個水準之間
分析 B: dt+1 versus d
lag.plot(geyser.duration,1)
問題 1: 噴泉時間短者,其隨後噴泉時間較長,但噴泉時間長者,其隨後噴泉時間大多較短
改良物理模型;嘗試較複雜之 Second-Order Markov Chain
EXPLORE ASSOCIATION
Data(stackloss)
It is a data frame with 21 observations on 4 variables.
[,1] `Air Flow' Flow of cooling air
[,2] `Water Temp' Cooling Water Inlet Temperature
[,3] `Acid Conc.' Concentration of acid [per 1000, minus
500]
[,4] `stack.loss' Stack loss
The data sets `stack.x', a matrix with the first three
(independent) variables of the data frame, and
`stack.loss', the numeric vector giving the fourth
(dependent) variable, are provided as well.
Scatterplots, scatterplot matrix:
plot(stackloss$Ai,stackloss$W)
plot(stackloss) data(stackloss)
two quantitative variables.
summary(lm.stack <- lm(stack.loss ~ stack.x))
summary(lm.stack <- lm(stack.loss ~ stack.x))
EXPLORE ASSOCIATION
Boxplot suitable for
showing a quantitative
and a qualitative
variable.
The variable test is not
quantitative but
categorical.
Such variables are also
called factors.
LEAST SQUARES ESTIMATION
Geometric representation of the estimation .
The data vector Y is projected orthogonally onto the model
space spanned by X.
The fit is represented by projection with the difference
yˆ Xˆ
between the fit and the data represented by the residual
vector e.
HYPOTHESIS TESTS TO COMPARE MODELS
Given several predictors for a response, we might
wonder whether all are needed.
Consider a large model, , and a smaller model, , which
consists of a subset of the predictors that are in .
By the principle of Occam’s Razor (also known as the law of
parsimony), we’d prefer to use if the data will support it.
So we’ll take to represent the null hypothesis and to
represent the alternative.
A geometric view of the problem may be seen in the following
figure.