AIML-HC Mod 03
AIML-HC Mod 03
EVALUATING LEARNING
FOR INTELLIGENCE
INTRODUCTION
• The most laborious tasks within a machine learning project are identifying the appropriate model and
engineering features, which make a substantial difference to the output of the model.
• In fact, the features chosen can often have more impact on the quality of a model compared to the
model choice itself.
• Therefore, it is important to evaluate the learning algorithm that will determine the model’s intelligence
to predict the output of an unknown sample.
• This is usually done using various metrics, which are discussed further.
MODEL DEVELOPMENT AND WORKFLOW
• To successfully deploy a machine learning model, there are several stages of development and
evaluation that take place, as illustrated in Figure below
• The first stage is the prototype phase.
• During this phase, a prototype is created through testing various models on historical data to determine
the best model.
• Hyperparameter tuning, as discussed later in this chapter, is a requirement of model training.
• Once the best prototype model is chosen, the model is tested and validated.
• Validating a model requires splitting datasets into training, testing, and validation sets as discussed in
Chapter 3.
• Consider the fact that there is no such thing as a random dataset and instead the randomness applies to
the splitting of the dataset.
• Be aware of biases that may appear in the data.
• Once the model has been successfully validated, it is deployed to production.
• The model is then usually evaluated by one (or several) performance metrics.
• There are two ways of evaluating a machine learning model: offline evaluation and online (or live)
evaluation.
WHY ARE THERE TWO APPROACHES TO
EVALUATING A MODEL?
• A deployed machine learning model consumes data from two sources: historical data (or the data that
is used as the experience to be learned from) and live data.
• Many machine learning models assume stationary distribution data—that the data distribution is
constant over time.
• However, this is atypical of real life, as distributions of data often change over time—known as a
distribution shift.
• For instance, consider a system that predicts the side effects of medications to patients based on their
health profile. Medication side effects may change based on population.
• Factors such as ethnicity, disease profile, territory, medication popularity, and new medications.
• The distribution of relevant side effects based on patient data can vary quickly over time, and hence it is
essential for a model to detect a shift in distribution and accordingly evolve the model.
• The method in which this is typically assessed is through the performance of the model based on live
data, evaluated through the validation metric used in the testing and validation of the model on
historical data.
• Model performance that is similar to or within a threshold of permissibility when evaluated on live data
is deemed as a model that continues to fit the data.
• Degradation of model performance indicates that the model does not fit the data and requires
retraining.
• Offline evaluation measures the model based on metrics learned and evaluated from the historical,
stationary, distributed dataset.
• Metrics such as accuracy and precision-recall are typically used within the offline training stage.
• Offline evaluation techniques include the hold-back method and n-fold cross-validation.
• Online evaluation refers to the evaluation of metrics once the model is deployed.
• The key takeaway is that these metrics may differ from the metrics used to evaluate performance when
the model is deployed live.
• For instance, a model that is learning on new pharmacological treatments may seek to be as precise as
possible in training and validation; but when placed online, it may need to consider business goals such
as budget or treatment value when deployed.
• Digital age, can support multivariate testing to understand best-performing models.
• Feedback loops are key to ensuring systems are performing as intended and help to understand the model in
the context of use better.
• This can be performed by a human agent or automated through a contextually intelligent agent or users of the
model.
• It is important that the evaluation of a machine learning model is based on a statistically independent dataset
and not on the dataset it is trained on.
• This is because the evaluation of the training dataset is optimistic about the model’s true performance as it
adapts to the dataset.
• By evaluating the model with previously unseen data, there is a better estimate of the generalization error.
• New data can be hard to find; hence it is important to be able to have new, unseen data from the current
dataset.
• Methods such as n-fold cross-validation discussed in Chapter 3 are useful techniques for this purpose. Often the
data used is more important than the algorithm choice; and the better the features used, the greater the
performance of the model.
• The evaluation metrics discussed can be found in the metrics package for R and scikit-learn for Python.
EVALUATION METRICS
• Accuracy is a general metric that does not consider the division between classes.
• Therefore, it does not consider misclassification or the associated penalty with misclassification. For
instance, a medical misdiagnosis that is a false positive (e.g., take a patient diagnosed with breast
cancer when they do not have it) has substantially different consequences compared to a false
negative, whereby a patient is told that they do not have breast cancer when in fact they do. A
confusion matrix breaks down the correct and incorrect classifications made by the model and
attributes them to the appropriate label.
• True positive: Where the actual class is yes, and the value of the predicted class is also yes.
• False positive: Actual class is no, and predicted class is yes
• True negative: The value of the actual class is no, and the value of the predicted class is no
• False negative: When the actual class value is yes, but predicted class is no
• Take an example whereby a model predicts whether a patient has breast cancer or not based on 50
example inputs from the test dataset with an equal distribution between positive and negative labeled
examples.
• The confusion matrix would be as in Table 5-1.
• From the confusion matrix, it is determined that the positive class has greater accuracy than the
negative class.
• The accuracy of the positive classification is 20/25 = 80%.
• The negative class has an accuracy of 10/25 = 40%. Both metrics differ from the overall accuracy of the
model, which would be determined as (20 + 10)/50 = 60%.
• It is apparent how a confusion matrix adds more detail to the overall accuracy of a machine learning
model.
• As a result, accuracy can be rewritten as the following: Accuracy = (correctly predicted
observation)/(total observation) =
• (TP + TN)/(TP + TN + FP + FN)
PER-CLASS ACCURACY
• Per-class accuracy is an extension of accuracy that takes into account the accuracy of each class. As a
result, the preceding example has a per-class accuracy of (80% + 40%)/2 = 60%.
• Per-class accuracy is usefu in distorted problems where there are a larger number of examples within
one particular class compared to another.
• The class with greater examples dominates the calculation, and therefore accuracy alone may not
suffice for the nature of your model; thus it is useful to evaluate per-class accuracy also.
LOGARITHMIC LOSS
• Logarithmic loss (or log-loss for short) is used for problems where a continuous probability is predicted
rather than a class label.
• Log-loss provides a probabilistic measure of the confidence of the accuracy and considers the entropy
between the distribution of true labels and predictions.
• For a binary classification problem, the logarithmic loss would be calculated as follows:
• Where Pi is the probability of the ith data point belonging to a class and yi the true label (either 0 or 1).
AREA UNDER THE CURVE (AUC)
• The AUC plots the rate of true positives to the rate of false positives.
• The AUC enables the visualization of the sensitivity and specificity of the classifier.
• It highlights how many correct positive classifications can be gained allowing for false positives.
• The curve is known as the receiver operating characteristic curve, or ROC as shown in Figure 5-2.
• A high AUC or greater space underneath the curve is good, and a smaller area under the curve (or less
space under the curve) is undesirable.
• In Figure 5-2, test A has better AUC as compared to test B, as the AUC for test A is larger than for test B.
• The ROC visualizes the trade-off between specificity and sensitivity of the model.
PRECISION, RECALL, SPECIFICITY, AND F-MEASURE
• Precision and recall are two metrics used together to evaluate model performance.
• Precision evaluates how many items are truly relevant compared to the total number of items correctly
classified.
• Recall evaluates how many items are predicted to be relevant by the model from the items that are
relevant.
• Precision: (correctly predicted Positive)/(total predicted Positive) = TP/TP + FP
• Recall: (correctly predicted Positive)/(total correct Positive observation) = TP/TP + FN
• Specificity refers to how well the model performs at returning incorrect
classifications and is calculated as in Figure 5-3.
• Specificity: (correctly predicted Negative)/(total Negative observation) = TN/TN + FP
• F-measure goes beyond the arithmetic mean and calculates
the harmonic mean of precision and recall:
RMSE
• RMSE calculates the square root of the sum of the average distance between predicted and actual
values.
• This can also be understood as the average Euclidean distance between the true value and predicted
value vectors.
• A criticism of RMSE is that it is sensitive to outliers.
• Percentiles (or quantiles) of error are more robust as a result of being less sensitive to outliers.
• Real-world data is likely to contain outliers, and thus it is often useful to look at the median absolute
percentage error (MAPE) rather than the mean.
• Hyperparameters and parameters are often used interchangeably, yet there is a difference between the
two. Machine learning models can be understood as mathematical models that represent the
relationship between aspects of data.
• Model parameters are properties of the training dataset that are learned and adjusted during training
by the machine learning model.
• Model parameters differ for each model, dataset properties, and the task at hand.
• For instance, in the case of an NLP predictor that output the sophistication of a corpus of text,
parameters such as word frequency, sentence length, and noun or verb distribution per sentence would
be considered model parameters.
• Model hyperparameters are parameters to the model building process that are not learned during
training.
• Hyperparameters can make a substantial difference to the performance of a machine learning model.
• Hyperparameters define the model architecture and effect the capacity of the model, influencing model
flexibility.
• Hyperparameters can also be provided to loss optimization algorithms during the training process.
• Optimal setting of hyperparameters can have a significant effect on predictions and help prevent a
model from overfitting.
• Optimal hyperparameters often differ between datasets and models.
• In the case of a neural network, for example, hyperparameters would include the number and size of
hidden layers, weighting, learning rate, and so forth.
• Decision trees hyperparameters would include the desired depth and number of leaves in the tree.
• Hyperparameters with a support vector machine would include a misclassification penalty term.
TUNING HYPERPARAMETERS
• Hyperparameter tuning or optimization is the task of selecting a set of optimal hyperparameters for a
machine learning model.
• Optimized hyperparameters values maximize a model’s predictive accuracy.
• Hyperparameters are optimized through running training a model, assessing the aggregate accuracy,
and appropriately adjusting the hyperparameters.
• Through trialing a variety of hyperparameter values, the best hyperparameters for the problem are
determined, which improves overall model accuracy.
HYPERPARAMETER TUNING ALGORITHMS
• The grid search is a simple, effective, yet resource expensive hyperparameter optimization technique
that evaluates a grid of hyperparameters.
• The method evaluates each hyperparameter and determines the winner.
• For example, if the hyperparameter were the number of leaves in a decision tree, which could be
anywhere from n = 2 to 100, grid search would evaluate each value of n (i.e., points on the grid) to
determine the most effective hyperparameter.
• It is often a case of guessing where to start with hyperparameters, including minimum and maximum
values. The approach is typical of trial and error, whereby if the optimal value lies toward either
maximum or minimum, the grid would be expanded in the appropriate direction in an attempt to
further optimize the model’s hyperparameters.
RANDOM SEARCH
• Random search is a variant of grid search that evaluates a random sample of grid points.
• Computationally, this is far less expensive than a standard grid search.
• Although at first glance it would appear that this is not as useful in finding optimal hyperparameters,
Bergstra et al. demonstrated that in a surprising number of instances, a random search performed
roughly as well as grid search.[65]
• The simplicity and better-than-expected performance of a random search means that it is often chosen
over grid search.
• Both grid search and random search are parallelizable.
• More intelligent hyperparameter tuning algorithms are available that are computationally expensive as
the result of evaluating which samples to try next.
• These algorithms often have hyperparameters of their own.
• Bayesian optimization, random forest smart tuning, and derivative-free optimization are three
examples of such algorithms.
MULTIVARIATE TESTING
• Multivariate testing is an extremely useful method of determining which model is best for the particular
problem at hand.
• Multivariate testing is known as statistical hypothesis testing and determines the difference between a
null hypothesis and alternative hypothesis.
• The null hypothesis is defined as the new model not affecting the average value of the performance
metric; whereas the alternate hypothesis is that the new model does change the average value of the
performance metric.
• Multivariate testing compares similar models to understand which is performing best or compares a
new model against an older, legacy model.
• The respective performance metrics are compared, and a decision is made on which model to proceed
with
• The process of testing is as follows:
1. Split the population into randomized control and experimentation groups.
2. Record the behavior of the populations on the proposed hypotheses.
3. Compute the performance metrics and associated p-values.
4. Decide on which model to proceed with.
• Although the process seems relatively simple, there are a few key aspects for consideration.
WHICH METRIC SHOULD I USE FOR EVALUATION?
• Choosing the appropriate metric to evaluate your model depends on the use case.
• Consider the impact of false positives, false negatives, and the consequences of such predictions.
Furthermore, if a model is attempting to predict an event that only happens 0.001% of the time, an
accuracy of 99.999% can be reported but not confirmed. Build the model to cater to the appropriate
metrics.
• One approach is to repeat the experiment, thus performing repeat evaluations.
• Although not a fail-safe, this reduces the change of illusionary results.
• If there is indeed change between the null and alternate hypothesis, the difference will be confirmed
CORRELATION DOES NOT EQUAL CAUSATION
• The phrase correlation does not equal causation is used to stress that a correlation between two
variables does not suggest that one causes the other.
• Correlation refers to the size and direction of a relationship between two or more variables.
• Causation, also known as cause and effect, emphasizes that the occurrence of one event is related to
the presence of another event.
• It may be tempting to assume that one variable causes the other; however, in models with several
features, there may be hidden factors that cause both variables to move in tandem.
• For instance, smoking tobacco is a cause that increases the risk of developing a variety of cancers.
• However, it may be correlated with alcoholism, but it does not cause alcoholism.
WHAT AMOUNT OF CHANGE COUNTS AS
REAL CHANGE?
• Many multivariate tests use the t-test to analyze the statistical difference between means.
• The t value evaluates the size of the difference relative to the variation in your sample data.
• However, the t-test makes assumptions that are not necessarily satisfied by all metrics. For instance,
the t-test assumes both sets have a normal, or Gaussian, distribution.
• If the distribution does not appear to be Gaussian, select a nonparametric test that does not make
assumptions about a Gaussian distribution, such as the Wilcoxon–Mann–Whitney test.
DETERMINING THE APPROPRIATE P VALUE
• Statistically speaking, the p value is a calculation used in hypothesis testing that represents the strength of the
evidence.
• The p value measures the statistical significance, or probability, that a difference would arise by chance given
there was no real difference between two populations.
• It provides the evidence against the null hypothesis and is a useful metric for stakeholders to draw conclusions
from.
• A p value lies between 0 and 1, and is interpreted as follows:[66]
• a p value of ≤ 0.05 indicates strong evidence against the null hypothesis, thus rejecting the null hypothesis
• a p value of > 0.05 indicates weak evidence against the null hypothesis, hence maintaining the null
hypothesis
• a p value near 0.05 is considered marginal and could swing either way
• The smaller the p value, the smaller the probability that the results are down to chance.
HOW MANY OBSERVATIONS ARE REQUIRED?
• The quantity of observations required is determined by the statistical power demanded by the project.
Ideally, this should be determined at the beginning of the project.
HOW LONG TO RUN A MULTIVARIATE TEST?
• The duration of time required for your multivariate testing is ideally the amount of time required to
capture enough observations to meet the defined statistical power.
• It is often useful to run tests over time to capture a representative, variable sample.
• When determining the duration of your testing phase, consider the novelty effect, which describes how
user reactions in the short term are not representative of the long-term reactions.
• For instance, whenever Facebook updates their news feed layout or design, there is an uproar However,
this soon subsides once the novelty effect has worn off.
• Therefore, it is useful to run your experiment for long enough to overcome this bias.
• Running multivariate tests for long periods of time are typically not a problem in model optimization.
DATA VARIANCE
• The control and experimentation sets could be biased as the result of not being split at random.
• This may result in biases in the sample data.
• If this is the case, other tests can be used, such as Welch’s t-test, which does not assume equal variance
SPOTTING DISTRIBUTION DRIFT
• It is key to measure ongoing performance of your machine learning model once deployed.
• Data drifts and system development require the model to be confirmed against the baseline.
• Typically, this involves monitoring the offline performance, or validation metric, against data from the
live, deployed model.
• If there is a sizeable change in the validation metric, this highlights the need to revise the model
through training on new data.
• This can be done manually or automated to ensure consistent reporting and confidence in the model.
KEEP A NOTE OF MODEL CHANGES
• Keep a log of all changes to your machine learning model with notes on changes.
• Not only does this serve as a change log for stakeholders, it provides a physical record of how the
system has changed over time.
• The use of versioning software within a development enviornment (test/staging to live deployment) will
enable software changes to automatically be noted.
• Versioning software provides a form of technical governance and can be used to deploy software with
extensive rollback and backup facilities.
ETHICS OF AIML IN HEALTHCARE: PRINCIPLES AND
PRACTICES
• 1. Core Ethical Principles in Healthcare AIML
• 1.1 Patient-Centric Care
• Primacy of patient welfare
• Protection of patient autonomy
• Informed consent in AI-assisted decisions
• Balance between automation and human touch
• 1.2 Medical Ethics Integration
• Hippocratic Oath principles in AI systems
• Non-maleficence ("First, do no harm")
• Beneficence (promoting patient well-being)
• Justice in healthcare delivery
• Respect for patient autonomy
• 2. Specific Ethical Challenges
• 2.1 Data Privacy and Security
• Protected Health Information (PHI) handling
• HIPAA compliance in AI systems
• Cross-border data sharing
• Data retention and deletion policies
• Security measures against breaches
• 2.2 Algorithmic Bias and Fairness
• Representative training data
• Demographic bias identification
• Health disparities mitigation
• Equal access to AI-enhanced care
• Cultural competency in AI systems
• 2.3 Transparency and Explainability
• Understanding AI diagnostic recommendations
• Clear communication of AI limitations
• Right to explanation for patients
• Documentation of AI decision processes
• Auditability of AI systems
• 3. Clinical Implementation Considerations
• 3.1 Clinical Validation
• Rigorous testing protocols
• Real-world performance monitoring
• Comparison with standard care
• Population-specific validation
• Continuous performance evaluation
• 3.2 Integration with Clinical Workflow
• Healthcare provider training
• Human oversight mechanisms
• Emergency override procedures
• Integration with existing systems
• Documentation requirements
• 3.3 Quality Assurance
• Regular system audits
• Performance metrics tracking
• Error reporting mechanisms
• Update and maintenance protocols
• Safety monitoring systems
• 4. Stakeholder Responsibilities
• 4.1 Healthcare Providers
• Understanding AI capabilities and limitations
• Maintaining clinical judgment
• Proper communication with patients
• Documentation of AI use
• Continuing education on AI systems
• 4.2 Healthcare Organizations
• Ethical guidelines development
• Staff training programs
• Risk management protocols
• Quality assurance systems
• Patient education initiatives
• 4.3 AI Developers
• Clinical collaboration
• Ethical design principles
• Transparent development
• Regular updates and maintenance
• Response to feedback
• 5. Regulatory and Legal Considerations
• 5.1 Compliance Requirements
• FDA regulations
• HIPAA compliance
• International standards
• State-specific requirements
• Industry best practices
• 5.2 Liability and Responsibility
• Error attribution
• Malpractice considerations
• Documentation requirements
• Insurance implications
• Risk management
• 6. Specific Use Case Ethics
• 6.1 Diagnostic Systems
• Accuracy requirements
• False positive/negative management
• Integration with clinical judgment
• Patient communication
• Result verification protocols
• 6.2 Treatment Planning
• Personalization vs. standardization
• Cost-effectiveness considerations
• Patient preference integration
• Alternative options presentation
• Outcome monitoring
• 6.3 Predictive Analytics
• Risk communication
• Preventive interventions
• Patient autonomy
• Resource allocation
• Follow-up protocols
• 7. Future Considerations
• 7.1 Emerging Technologies
• Integration of new AI capabilities
• Adaptation of ethical frameworks
• Evolution of standards
• Novel use cases
• Technological limitations
• 7.2 Policy Development
• Regulatory updates
• International harmonization
• Industry standards
• Professional guidelines
• Public policy recommendations
• 8. Best Practices Recommendations
• 8.1 Implementation Guidelines
• Phased deployment approach
• Stakeholder engagement
• Training requirements
• Monitoring systems
• Review processes
• 8.2 Ethical Safeguards
• Ethics committee oversight
• Regular audits
• Patient feedback mechanisms
• Incident reporting
• Continuous improvement
• 9. Conclusion
• The ethical implementation of AIML in healthcare requires careful balance between innovation and safety, with
constant attention to patient welfare, privacy, fairness, and transparency. Success depends on collaborative
effort between healthcare providers, organizations, developers, and regulators.