pFad - Phonifier reborn

Modelling basics

Click on the buttons to open the collapsible content.

- +

insert text
+

Models are a helpful way to understand your data and to determine if there is a statistically significant relationship between your two (or more!) chosen variables. Before you start to create your model, predict what relationship your two variables may have. In this example, we inspect whether leaf area has a statistically significant relationship with stem area. What do you hypothesise? A simple linear model can be set up as follows:

# comment
code

# comment
 code
+                # load your data
graph1 <- read.csv("graph1.csv")

# check that your data is formatted as expected 
 head(graph1)

 # test if the relationship between your two variables is significant 
 area.model <-lm(leaf_area ~stem_area, data=graph1)

# output of model
 summary(area.model)


-        
+        
         
-            sub-title
Insert text

sub-title
insert text.
-            
-                # comment
code

# comment
 code
+            Once you create your model, RStudio outputs a lot of information, which can be very hard to pick through! Here, we breakdown each part for you.
+           
Call
 This outputs the model you are testing. Make sure it’s correct! 
+
+            
Residuals
 This provides you with a summary of the residuals’ distribution. In general, the median should be close to 0 and the 1st and 3rd quartiles (1Q/3Q) should be similar in magnitude. If this is not the case, you may should double check if you’ve met your model’s assumptions.
+
+            
Coefficients
 Here you can see information about the model fit. Here, we are looking at the fit to the regression equation. The estimate value is the slope; if it significantly differs from zero, there is a relationship between the response and explanatory variables. If the estimate is positive, there is a positive relationship and vice versa. A positive relationship would mean, for example, as stem area increases, so does leaf area. The t-statistics and p-values indicate if the relationship is significant or not. 
+
+            
Summary
 The R-squared value indicates how much of the variation is explained by the model, or in our example, how much leaf area is explained by stem area. The adjusted R-squared accounts for sample size and thus is a more accurate representation. Here, our model explains 33.5% of the variation. The p-value shows the overall significance of the model, however, it is important to look at each constituent part to assess significance. Generally, a model is seen as significant if the p-value is less than 0.05. It is important to note that this is arbitrary; a p-value of 0.051 doesn’t necessarily mean your model is invalid!
+            

+        

+
+        
+            Using code
When designing your model, it’s important to check that you’ve met any assumptions your model has before you proceed with your analysis. However, it can be hard to understand what the assumptions mean, as well as to remember how to check them. Look below for an example of how to check if you’ve met the assumptions for a linear model. 
+
+            
1. The residuals (the difference between the observed and predicted value of the dependent variable) are normally distributed.A p-value of over 0.05 means that the residuals don’t deviate from a normal distribution. This means that your model assumptions are met!
+    

+                model.resid <- resid(area.model)
shapiro.test(model.resid)
 
                     
+         
+            2. The data are homoscedastic, meaning they have equal variances. The null hypothesis here is that the variance is the same across all groups. This means that a p-value of over 0.05 meets the model assumptions.
+     

+                bartlett.test(leaf_area ~ stem_area, data = graph1)
 
+                    
+            
+            

Using plots
Another way to check if your data meet your model’s assumptions is to use the command plot(your_model). This brings up four plots: (1) residual versus fitted plot, (2) Q-Q plot, (3) scale-location and (4) residuals versus leverage.
+            Residuals versus fitted helps you assess if you have constant variances, helping assess if your data is homoscedastic. It It also helps assess whether or not there is a linear relationship between your variables. R gives the row numbers or names of the biggest outliers.
+            
 The normal Q-Q plot assesses if your residuals are normally distributed. If the points are close to the dashed line, this means that they are likely normally distributed. Here, the tails drift slightly from the line of normal distribution. This is common in small datasets and is nothing to be concerned about, especially if your Shapiro test output says your data is normally distributed! 
+            
 The scale-location plot  aims to identify heteroscedasticity -- what we don’t want! This plot is a bit easier to read tan the first line: if the red line is not horizontal, then the residuals are not homoskedastic. However, the degree to which it has to be horizontal can be debated; a slightly horizontal line is okay!
+           
 The residuals versus leverage plot measures the leverage, or how much each data point influences the fit of the model (think R-squared value). Points that are isolated and farther from zero will have a larger leverage. You can see on the plot that Cook’s distance is also measured - this is how much the model fit would change if the isolated point was deleted. You want to avoid having isolated residuals with a Cook’s distance of over 0.5. This plot has a few points that fit that description which may need to be removed or perhaps the data should be transformed! 
+ 
+               

What if your data don’t fit your model’s assumptions?
There are many ways to approach the problem of missed assumptions. First, consider if you’ve designed the correct model. Think about the ecological reasoning behind your decisions and make sure it is the most logical. Review some other design options to ensure you’ve created the best model. Second, it’s important to consider how far off your model is from meeting its assumptions. Plus or minus 0.01 is most likely arbitrary!  If you are confident in your model, stick with it. If your data significantly don’t meet your model’s assumptions, it may be time to transform your data. This can be done by through a log or square root transformation of one or both of your variables. For example:
+
+
+
+            
+                area.model <- lm(log(leaf_area) ~ stem_area, data = graph1)
+                      
+                    
+                    This takes a bit of trial and error to see what combinations work, but then you will have met your model assumptions. 
+
+
In all cases, it’s important that you can backup your decisions on altering your model with logical reasoning, making well-informed conclusions.
+

Data visualisation

Modelling basics

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.