302 SM and Da (Unit 3 4 5)
302 SM and Da (Unit 3 4 5)
What Is R?
According to R-Project.org, “R is a language and environment
for statistical computing and graphics.” It’s an open-source
programming language often used as a data analysis and statistical
software tool.
R was developed in 1993 by Ross Ihaka and Robert Gentleman and
includes linear regression, machine learning algorithms, statistical
inference, time series, and more
R is a universal programming language compatible with the
Windows, UNIX, and Linux platforms.
The environment features of R program is discussed below:
Advantages of R programming
3. Click on "install R for the first time" link to download the R executable
(.exe) file.
4. Run the R executable file to start installation, and allow the app to make
changes to your device.
R has now been sucessfully installed on your Windows OS. Open the R GUI
to start writing R codes.
Installing RStudio Desktop
To install RStudio Desktop on your computer, do the following:
Variables can store data of different types, and different types can do
different things.
In R, variables do not need to be declared with any particular type, and can
even change type after they have been set
We can use the class() function to check the data type of a variable:
Creating Variables in R
Variables are containers for storing data values.
name:
In R we must use . and _(under score) in variable name other symbole is not
allowed in variable name like *,-,&,%,# etc.
In starting of variable we may use . and any character only we don’t use any
digit or any symbole in starting of variable name.
And if we have to run all the program then we have to use : edit > run all
option.
Ex. Print(“hello”)
10) If we have to convert other the data type into numeric then, we use
the function as.numeric()
Ex. W<-as.numeric(25L)
W
25
10) If we have to convert other the data type into integer then, we use the
function as.integer()
Ex. W<-as.numeric(25.75)
W
25
operation in R programming.
Arithmetic operation (+, -, *, /, %%, %/%, ^)
Ex.
Relational operation
< Less than
> Greater than
== Equal to
<= Less than equal to
>= Greater than equal to
!= Not equal to
Logical operator
& And
/ Or
! Not
Assignment operator
Conditional statement in R
1) if statement
# Example 1: Basic if statement
x <- 10
if (x > 5) {
print("x is greater than 5")
}
2) if else statement
# Example 2: if-else statement
x <- 3
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}
if (x > 10) {
print("x is greater than 10")
} else if (x > 5) {
print("x is greater than 5 but not greater than 10")
} else {
print("x is 5 or less")
}
4) nested if statement
# Example 4: Nested if statements
x <- 12
if (x > 5) {
if (x < 10) {
print("x is between 5 and 10")
} else {
print("x is greater than or equal to 10")
}
} else {
print("x is 5 or less")
}
looping statement
1) for Loop
A for loop is used when you know exactly how many times you want to
execute a block of code. It iterates over a sequence of values, such as a
sequence of numbers or elements in a vector.
Statement
2) while Loop
A while loop is used when you want to execute a block of code repeatedly as
long as a condition is TRUE.
while (condition) {
# Code block to be executed
}
Example
# Example: while looP
count <- 1
while (count <= 5) {
print(count)
count <- count + 1
}
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In both for and while loops, you can use break to exit the loop prematurely
and next to skip the current iteration and proceed to the next one.
Example:
Output
R functions
In R, a function is a block of code that performs a specific task and can be
reused throughout your script or session. Functions in R are defined using
the `function()` keyword. Here's a basic overview of how functions work in
R:
Syntax:
1. Function Name:
This is the name you give to your function, which you use to call it later in
your script.
2. Arguments (Parameters):
These are placeholders for values that you pass into the function. They are
enclosed in parentheses `()` after the function name.
Arguments are optional. If your function doesn't require any inputs, you can
leave them empty.
3. Function Body:
This is where you write the code that you want the function to execute. It can
contain any valid R expressions, assignments, control structures (like `if`,
`else`, `for`, `while`), and other function calls.
4. Return Value:
The `return()` statement specifies the value that the function should return to
the caller. It's optional; if omitted, the function returns the value of the last
expression evaluated.
Examples:
# Output: 8
Notes:
Basic Usage:
readline(prompt = "Enter your name: "): This line prompts the user to
enter their name with the specified message ("Enter your name: ").
The user's input is captured in the variable user_input.
cat("Hello, ", user_input, "! Nice to meet you.\n"): This line prints a
greeting message using the value entered by the user.
Function Example:
greet_user() is a function that uses readline() to prompt for the user's
name.
Inside the function, name <- readline(prompt = "Enter your name: ")
captures the user's input.
cat("Hello, ", name, "! Welcome.\n") then prints a personalized greeting.
Built in function
R has a wealth of built-in functions that cover a wide range of tasks, from
basic arithmetic to complex statistical analyses. Here are some common
categories of built-in functions in R:
1. Mathematical Functions
2. Statistical Functions
mean(x): Mean of x.
median(x): Median of x.
sd(x): Standard deviation of x.
var(x): Variance of x.
summary(x): Summary statistics of x.
5. Control Flow
These are just a few examples of the extensive set of built-in functions in R.
For a comprehensive list and detailed documentation, you can use R’s help
system with ?function_name or help(function_name).
Data structure in R
Data structure is a way to store a data in a memory.
1. Vectors
Example:1
x<-1:10
by the use of seq() function
sq<-seq(1,3,length.out=5)
seq(from=3.5,to=1.5,by=.5)
seq(from=-2.7,to=1.5,length.out=.5)
2. Lists
example
my_list <- list(c("rer","ter","ttt"), c(55,65,70), c("b.com","b.b.a.","b.ca."))
my_list
#here we give name of each list in my_list.
names(my_list)=c("name","roll no.","department")
my_list
v1<-c(12:17)
v2<-c(15,17,19,20)
row_name<-c("r1","r2","r3")
col_name<-c("c1","c2","c3")
mat_name<-c("mat1","mat2")
my<-
+array(c(v1,v2),dim=c(3,3,2),dimnames=list(row_name,col_name,mat_n
ame))#give name to row column and matrix
my
c<-print(my[3,2,2])#indexing the element of array
5. Data Frames
Definition: A data frame is a two-dimensional table-like structure where
each column can be of a different type. It is similar to a spreadsheet or
SQL table.
Creation: Use the data.frame() function.
We can convert data frame into string by the use of str() function.
We can indexing in data frame with the help of [] bracket and $ sign.
Ex:
print(my_df[1])
print(my_df[1,2])
print(my_df[1,])
print(my_df([,3])
my_df$name
my_df$Name<-c("aa","bb","cc")
my_df
We can add the new column and row by the use of cbind() and rbind()
function in data frame.
Ex:
cbind(my_df,assign=c(78,76,78))
rbind(my_df,c("AA",40,88,77))
Also we can combine two data frame with the help of cbind() and rbind()
function.
Ex: v<-rbind(my_df1,my_df2)
We can check the dimention of data frame with the use of dim() function.
Ex: dim(my_df) #give the number of row and column in data frame.
we can get the sub data frame by the help of subset() function.
Ex: subset(my_df,Name!="aa")
1. CSV Files
CSV (Comma Separated Values) files are one of the most straightforward
formats to import into R.
# Assuming your CSV file is named 'data.csv' and is located in your working
directory
data <- read.csv("data.csv")
aa<-read.csv("people.csv")
View(aa)
fix(aa)=fixong the csv file
str(aa)=show the structure of the data frame
summary(aa)=give statistical summary of the csv file
names(aa)=provide all the variable name.
nrow(aa)=provide number of row.
ncol(aa)=provide number of colume.
length(aa)=give length of file.
dim(aa)=show the dimention of the data frame.
colnames(aa)=return column names.
head(aa)=show first six row of file.
tail(aa)=show last six row of file.
bb<-aa[c(1:2,3,7)]=give 1,2.3.7 column of file
View(bb)
cc<-aa[c(1:3),c(1:3)]=provide first three column and three row of data
View(cc)
names(aa)
vv<-aa$User.Id[]=for indexing the data value.
Vv
ff<-subset(aa,Sex=="Male")
View(ff)
2. Excel Files
For Excel files, you typically need to use the readxl package, which provides
functions to read Excel files into R.
First, make sure to install the readxl package if you haven't already:
install.packages("readxl")
Then, you can use the read_excel() function to import data from Excel:
library(readxl)
Subsetting Data
To subset data by specific rows and columns, you can use square brackets [ ].
Example: subset_data <- data[1:10, c("column1",
"column2")]
This extracts the first 10 rows and columns "column1" and "column2" from
data.
You can subset data based on logical conditions using square brackets [ ] with
logical expressions.
Example: subset_data <- data[data$column1 > 50, ]
This selects rows from data where values in column1 are greater than 50.
Filtering Data
1. Using the subset() Function:
The subset() function in R allows for more complex filtering based on logical
conditions.
Example: subset_data <- subset(data, column1 > 50 &
column2 == "value")
This creates subset_data by filtering data where column1 values are
greater than 50 and column2 equals "value".
The dplyr package provides a more intuitive way to filter data using functions
like filter() and select().
Example:
library(dplyr)
Notes:
Adding Variables
1. Using $ Operator:
This creates a new column named new_column in data with values 1, 2, 3, 4, and
5.
2. Using cbind() :
Combine the existing data frame with the new column using cbind().
If you are working with the dplyr package, you can use mutate() to add a new
column based on existing data.
library(dplyr)
data <- data %>%
mutate(new_column = c(1, 2, 3, 4, 5))
Removing Variables
To remove (delete) variables from a data frame in R:
library(dplyr)
data <- select(data, -column_to_remove)
This removes column_to_remove from data.
Renaming Variables
To rename variables (columns) in a data frame in R:
If you have a data frame and you want to rename its columns, you can use the
names()functions.
library(dplyr)
data <- data %>%
rename(new_name = old_name)
Notes:
Be cautious with data manipulation: Always make sure you understand the
structure of your data frame and how changes will affect your analysis.
Using packages: Functions from the base R and dplyr package provide convenient
ways to add, remove, and rename variables in data frames, depending on your
preference and workflow.
Data cleaning and transformation
Data cleaning and transformation are crucial steps in preparing data for analysis in R. They
involve handling missing values, correcting data types, transforming variables, and more.
Here’s a structured approach to perform data cleaning and transformation in R:
Data Cleaning
Identify missing values (NA, NaN, empty strings, etc.) in your data using functions
like is.na() or complete.cases().
Replace or impute missing values using functions like na.omit(), complete()
(from tidyr), or impute() (from imputeMissings package).
2. Removing Duplicates:
3. Handling Outliers:
Identify and handle outliers using statistical methods like z-score, IQR
(interquartile range), or domain-specific knowledge.
# Remove outliers
data_clean <- data[!data %in% outliers]
# Output results
print(outliers) # Print outliers
print(data_clean) # Print cleaned data
Data Transformation
2. Reshaping Data:
Reshape data using functions like melt() and cast() from the reshape2
package or pivot_longer() and pivot_wider() from the tidyr
package.
To reshape data from a wide format to a long format (and vice versa), you typically use
functions from the tidyr or reshape2 packages.
Example:
# Load tidyr
library(tidyr)
print(data_long)
Explanation:
Example:
# Load reshape2
library(reshape2)
print(data_long)
Explanation:
To convert data from a long format to a wide format, you can use tidyr or reshape2
functions.
Example:
# Load tidyr
library(tidyr)
print(data_wide)
Explanation:
pivot_wider is used to transform rows into columns.
names_from specifies the column that will become the new column names.
values_from specifies the column that contains the values.
Example:
# Load reshape2
library(reshape2)
print(data_wide)
Explanation:
Notes:
By following these guidelines and using appropriate R functions and packages, you can
effectively clean and transform your data to prepare it for further analysis or modeling tasks
Use functions like is.na() or is.null() to detect missing values in your data
frame.
The any(is.na(data)) function checks if there are any missing values in the
entire data frame data. The colSums(is.na(data)) function calculates the
number of missing values (NA) in each column of the data frame data.
The na.omit(data) function removes rows from data that contain any missing
values (NA). The complete.cases(data$column1, data$column2)
function removes rows with missing values in column1 or column2.
Replace missing values with estimated values such as mean, median, mode, or
predictive models.
Notes:
Data Context: Understand the context and reasons behind missing values to choose
appropriate handling techniques.
Documentation: Document your approach to handling missing values to ensure
transparency and reproducibility.
Validation: Validate the impact of missing data handling on analysis results to ensure
robustness.
By applying these techniques in R, you can effectively identify and manage missing values in
your datasets, ensuring that your data is ready for further analysis or modeling tasks
In R, data type conversion and handling variables are essential tasks when preparing data for
analysis or modeling. Here's how you can perform data type conversion, record variables, and
manage them effectively:
Ensure that the character data can be correctly converted to numeric (e.g., no non-
numeric characters).
Be cautious when converting factors to numeric, as it converts the levels of the factor,
not the underlying values.
Recording Variables
Use arithmetic operations or functions to create new variables from existing ones.
2. Renaming Variables:
# Rename a variable
names(data)[which(names(data) == "old_name")] <-
"new_name"]
Ensure the new names are meaningful and consistent with your data.
3. Removing Variables:
Use subset(), select() from dplyr, or indexing to remove variables from a
data frame.
Notes:
Data Integrity: Ensure data type conversions maintain data integrity and correctness.
Documentation: Document your data type conversions and variable management
steps for clarity and reproducibility.
Validation: Validate data type conversions and variable operations to avoid
unintended consequences in your analysis or modeling process.
By following these guidelines and using appropriate functions in R, you can effectively
handle data type conversions, create new variables, rename variables, and manage variables
in your datasets to prepare them for further analysis or modeling tasks.
Unit: 5
Working With Data In R
1. Reordering and reshaping data frames
2. Merging and joining data frames.
3. Calculating summary statistics (mean, median, mode, standard deviation).
4. Generating frequency tables and cross-tabulations.
5. Commands to measures of central tendency and dispersion.
6. Concepts of normal distribution
7. Commands to explore view data distributions graphically (Bell curve).
1. Reordering Rows
To reorder rows based on a particular column, you can use the order() function.
2. Reordering Columns
# Reorder columns
df_reordered <- df[, c("Value", "ID")]
print(df_reordered)
To reshape data frames from wide to long format and vice versa, the reshape2 package (or
its successor data.table) is useful.
Using reshape2:
library(reshape2)
Using data.table:
library(data.table)
Using Base R
1. merge()
The merge() function is versatile and allows you to join data frames by common columns or
row names.
# Left join
left_joined_df <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_joined_df)
# Right join
right_joined_df <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_joined_df)
1. Mean
# Sample vector
data <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Calculate mean
mean_value <- mean(data)
print(mean_value)
2. Median
The median is the middle value when the numbers are sorted.
# Calculate median
median_value <- median(data)
print(median_value)
3. Mode
R does not have a built-in function for mode, but you can create one.
# Calculate mode
mode_value <- get_mode(data)
print(mode_value)
Note: If the data is multimodal (more than one mode), this function will only return one
mode. You can modify it to return all modes if needed.
4. Standard Deviation
The standard deviation measures the amount of variation or dispersion in the data.
You can also calculate these statistics for columns in a data frame.
print(mean_A)
print(median_A)
print(mode_A)
print(std_dev_A)
print(mean_B)
print(median_B)
print(mode_B)
print(std_dev_B)
The dplyr package can simplify calculations, especially for data frames.
library(dplyr)
print(summary_stats)
Additional Considerations
Handling Missing Values: Use the na.rm = TRUE parameter to exclude NA values in
calculations.
Customizing Mode Calculation: For datasets with multiple modes, you might want
to return all modes.
These methods will help you compute and analyze summary statistics efficiently in R.
Frequency Tables
Using Base R
Use the table() function to get the frequency of each unique value in a vector.
# Sample data
data <- c("apple", "banana", "apple", "orange", "banana", "banana",
"apple")
# Frequency table
freq_table <- table(data)
print(freq_table)
For more than one variable, table() can handle this directly.
# Cross-tabulation
cross_tab <- table(df$Fruit, df$Color)
print(cross_tab)
Advanced Cross-Tabulations
The xtabs() function creates contingency tables from a formula and data frame.
# Cross-tabulation
cross_tab_xtabs <- xtabs(~ Fruit + Color, data = df)
print(cross_tab_xtabs)
library(gmodels)
Summary
Frequency Table for Single Variable: Use table() in base R or count() in dplyr.
Cross-Tabulation: Use table() or xtabs() in base R, or count() in dplyr. For
detailed cross-tabulations, consider CrossTable() from the gmodels package.
These methods allow you to summarize and explore categorical data effectively.
1. Mean
# Sample vector
data <- c(10, 20, 30, 40, 50)
# Calculate mean
mean_value <- mean(data)
print(mean_value)
2. Median
# Calculate median
median_value <- median(data)
print(median_value)
3. Mode
R does not have a built-in mode function, but you can define one.
# Calculate mode
mode_value <- get_mode(data)
print(mode_value)
Measures of Dispersion
1. Standard Deviation
The standard deviation measures the amount of variation in the data. Use sd().
# Calculate standard deviation
std_dev <- sd(data)
print(std_dev)
2. Variance
The variance measures the spread of the data points. Use var().
# Calculate variance
variance <- var(data)
print(variance)
3. Range
The range is the difference between the maximum and minimum values. You can
calculate it using range().
# Calculate range
data_range <- range(data)
range_value <- diff(data_range)
print(range_value)
You can also compute these measures for data frames using the dplyr package.
install.packages("dplyr")
library(dplyr)
print(summary_stats)
Summary of Commands
Mean: mean()
Median: median()
Mode: Custom function get_mode()
Standard Deviation: sd()
Variance: var()
Range: range() with diff() or max() - min()
These functions will help you compute and analyze the central tendency and dispersion of
your data effectively.
You can generate random data that follows a normal distribution using the rnorm() function.
# Generate 1000 random numbers from a normal distribution with mean 0 and
standard deviation 1
data <- rnorm(1000, mean = 0, sd = 1)
# Histogram
hist(data, breaks = 30, main = "Histogram of Normally Distributed Data",
xlab = "Value", col = "lightblue", border = "black")
# Density plot
plot(density(data), main = "Density Plot of Normally Distributed Data",
xlab = "Value", ylab = "Density")
3. Quantile Function
You can find the quantile for a given probability using qnorm().
For more advanced visualization, you can use the ggplot2 package.
library(ggplot2)
# Create a data frame
df <- data.frame(value = data)
rnorm(n, mean, sd): Generate n random numbers from a normal distribution with
specified mean and standard deviation.
dnorm(x, mean, sd): Compute the density of the normal distribution at x.
pnorm(q, mean, sd): Compute the cumulative probability up to quantile q.
qnorm(p, mean, sd): Compute the quantile for a given probability p.
hist(): Create histograms for visualizing the distribution.
density(): Create density plots.
ggplot2: For advanced and customizable visualizations.
These commands and functions will help you understand and work with normal distributions
effectively in R.
A histogram with a superimposed density plot is a common way to visualize the distribution
of data.
# Sample data
data <- rnorm(1000, mean = 0, sd = 1)
2. Q-Q Plot
A Q-Q (quantile-quantile) plot compares the quantiles of your data against the quantiles of a
theoretical normal distribution.
# Q-Q plot
qqnorm(data, main = "Q-Q Plot")
qqline(data, col = "red", lwd = 2)
3. Density Plot
# Density plot
plot(density(data), main = "Density Plot", xlab = "Value", ylab =
"Density", col = "blue", lwd = 2)
The ggplot2 package allows for more customizable and advanced visualizations.
install.packages("ggplot2")
library(ggplot2)
A normal probability plot is another way to visualize if the data follows a normal distribution.
These methods will help you graphically explore and understand the distribution of your data,
including how closely it follows a normal (bell curve) distribution.