Sheet 14
Sheet 14
Statistics in Bioinformatics
Exercise Sheet 14
5. What are essential properties of Markov chains and their importance in bioinformatics
(typical questions)?
7. What do the terms recurrent, transient, irreducible, periodic, and aperiodic mean?
10. What is a Markov process? What is the relationship between transition matrix and rate
matrix?
17. Name two important evolutionary DNA models and their properties.
18. Explain how time estimation can be done with evolutionary DNA models.
23. When comparing a sequence to a database, how are the P value and the E value defined?
27. What is kernel-based regression estimation and how is it different from the K-nearest
neighbor approach?
29. Formulate the minimization problems in local linear and local polynomial regression.
32. What is Huber’s error model for measured intensity of gene expression data and what is
the resulting transformation?
33. What does robust regression mean and what is minimized with LTS?
37. Name three important distance measures in cluster analysis. How do they differ?
38. Explain the most popular algorithms for clustering microarray data.
39. What methods can be used to determine the number of clusters? How is the average
silhouette width defined?
41. How can you reasonably select genes in a cluster analysis for microarray samples?
44. Briefly describe linear and quadratic discriminant analysis. How does regularized discri-
minant analysis (RDA) work?
45. Explain PAM (prediction analysis for microarrays)? What is the significance of the regu-
larization parameter ∆?
47. What is the procedure for SVMs in the case of overlapping classes?
50. Given a random forest, how does construction and prediction work?
51. How to determine the amount of regularization needed in a classification procedure? How
does one proceed with PAM?
55. How are AUC (area under the curve) and pAUC defined and what is a possible application
in microarray experiments?
58. What is the difference between FWER (family-wise error rate) and the FDR (false disco-
very rate)? Name a procedure for controlling each of these errors.
59. What is the underlying statistical model in the LIMMA methodology? Specify the distri-
bution assumptions.
60. List two approaches to dividing a group of patients into two subgroups based on the
distribution of expression levels of a gene.
61. Which bimodality measures from the lecture can be assigned to each of these two approa-
ches?
62. What are the two main approaches for enrichment analysis? Explain the two methods.
64. What is meant by Global Test? How can the test statistic be interpreted?
65. What are the steps of the STEM algorithm and how can you use it to analyze gene groups?
66. What is the most popular model for fitting dose-response curves? Specify the model and
the procedure for estimating its parameters.
67. What are the incidence matrix and the ancestor matrix of a DAG?
70. Name all classes of disease progression models from the lecture.
71. Explain oncogenetic trees. How is the probability calculated that a certain combination of
events occurs?
This sheet will not be corrected or graded, thus no submission is required. Note that there will
be no sample solution for this exercise sheet.