Lecture 01
Lecture 01
Big Data
Handling
What are ML techniques
Machine Learning
Deep learning:
Deep learning is a subset of ML, in which
artificial neural networks (ANNs) that
mimic the human brain are used to
perform more complex reasoning tasks
without human intervention.
Machine Learning Algorithms
Supervised (inductive) learning
Supervised
Learning
Regression Classification
•x can be multi-dimensional
– Each dimension corresponds to an attribute
Evaluation metrics - For Classification Problem
•Analysis:
•Precision
•Recall
•Accuracy
•F1-score
•Etc…
True Positive (TP)
The predicted value matches the actual value.
The actual value was positive and the model
predicted a positive value.
True Negative (TN)
The predicted value matches the actual value.
The actual value was negative and the model
predicted a negative value.
False Positive (FP) – Type 1 error
The predicted value was falsely predicted. The
actual value was negative but the model
predicted a positive value.
False Negative (FN) – Type 2 error
The predicted value was falsely predicted. The
actual value was positive but the model
predicted a negative value.
Evaluation metrics
Accuracy= (TP+TN)/(TP+TN+FP+FN)
Precision P = TP/(TP+FP)
Recall/Sensitivity R= TP/(TP+FN)
Specificity S = TN/(TN+FP)
F-measure F=2*(P*R)/(P+R)
[harmonic mean of precision and recall]
Evaluation metrics
Accuracy= (TP+TN)/(TP+TN+FP+FN)
= 77.51 %
Numericals
1. Suppose a regression model follows y=mx+c, if the dataset (X,y) with
X=[1,2,3,4,5] and y = [2,4,6,8,10], find the parameters m (slope) and c
(intercept) for this model.
2. If a binary classification model has 90 true positives, 30 false positives,
20 false negatives, and 160 true negatives, calculate the accuracy,
precision, recall, and F1-score.
Machine Learning Algorithms
Unsupervised learning
– Given: training data (without desired outputs)
Unsupervised
Learning
Clustering Dimensionality
Reduction
Given x1, x2, ..., xn (without labels)
Output hidden structure behind the x’s Given x1, x2, ..., xn (without labels)
Reduces the number of variables to get
the exact information
Numericals
Given the following points: (1, 1), (1, 4), (4, 1), and (4, 4), if we want to
cluster them into 2 clusters, what are the initial centroids and the final
centroids after one iteration of K-means clustering?
Machine Learning Algorithms
Reinforcement Learning
Discovers data through a process of trial and error and then decides what
action results in higher rewards.
Major components: the agent/learner/decision-maker, the environment, and
the actions.
Reinforcement Learning
Given a sequence of states and actions with (delayed) rewards, output a policy
Policy is a mapping from states actions that tells you what to do in a given state
Machine Learning Steps
• Data Collection
• Data Cleaning and
Preprocessing
• Exploratory Data
Analysis (EDA)
• Feature Engineering
• Model Selection
• Model Training
• Model Evaluation
• Hyperparameter Tuning
• Model Deployment
• Monitoring and
Maintenance
• Documentation and
Reporting
Applications
Supervised ML Unsupervised ML Reinforcement Learning
Species Identification: ML models are used to identify species from images, sounds,
and other data, aiding in biodiversity studies.
Ref.: Auslander, N., Gussow, A. B., & Koonin, E. V. (2021). Incorporating machine learning into established bioinformatics frameworks. International
journal of molecular sciences, 22(6), 2903.
Machine Learning in Biological Systems
• Machine learning is increasingly being utilized to
understand and model biological systems due to its ability
to handle and analyze large, complex datasets.
• Genomics:
• Identifying patterns in DNA sequences,
• Predicting gene expression,
• Understanding genetic variations associated with diseases.
• Proteomics:
• Analyzing protein structures and functions,
• Predicting protein interactions,
• Identifying biomarkers for diseases.
• Drug Discovery:
• Predicting the efficacy and toxicity of new compounds,
• Identifying potential drug targets,
• Optimizing drug design.
Machine Learning in Biological Systems
• Medical Imaging:
• Enhancing the analysis of medical images (e.g., MRI, CT scans) for disease
diagnosis and treatment planning.
• Systems Biology:
• Modeling complex biological networks to understand cellular processes, metabolic
pathways, and disease mechanisms.
• Personalized Medicine:
• Tailoring medical treatments to individual patients based on their genetic and
phenotypic information.
• Epidemiology:
• Predicting the spread of infectious diseases,
• understanding risk factors,
• planning public health interventions.
• Neuroscience:
• Analyzing neural activity data to understand brain function, cognitive processes,
and neurological disorders.
• Machine learning models can uncover hidden patterns and relationships within
biological data, leading to new insights and advancements in biological research
and healthcare.
Challenges and Future Directions
•(e.g.,
Data Integration: Combining data from different sources and types
genomic, proteomic, clinical) remains a challenge.
•crucial
Interpretability: Understanding how ML models make decisions is
for their acceptance in critical fields like medicine.
•scalable
Scalability: Handling large-scale biological data efficiently requires
ML methods.
•implications
Ethical Considerations: Issues such as data privacy and the ethical
of ML-driven decisions need careful consideration.
• Overfitting
• Underfitting