Demonstration 1: How to use classical machine learning in practice for classification task¶
Short simplified demonstration with simulated data how to use machine learning methods for personalized radiation dose prediction.
Demonstration idea taken from Huettenbrink et al. Personalized Prediction of Patient Radiation Exposure for Therapy of Urolithiasis: An Application and Comparison of Six Machine Learning Algorithms J Pers Med. 2023 Apr 7;13(4):643. doi: 10.3390/jpm13040643
In the original paper 9 important features were identified and selected for model training (prediction of low DAP and high DAP values in Urolithiasis operation:
- age (p = 0.0002)
- gender (p = 0.011)
- weight (p < 0.0001)
- stone size (p < 0.000001)
- surgeon experience (p = 0.039)
- number of stones (p = 0.0007)
- stone density (p = 0.023)
- use of flexible endoscope (p < 0.0001)
- preoperative stone position (p < 0.00001)
For the sake of simplicity, in this demonstration we simulate only three features based on graph below:
- Stone height [mm]
- Stone width [mm]
- Stone depth [mm]
(The study did not give insights to the distrinbutions for other features...)
Classification task: Given the 3 features ($x_1, x_2, x_3$) of kidney stone dimension, how to classify the radiation exposure to low DAP ($\le$ 2.8 [Gy·cm2]) or high DAP ($>$ 2.8 [Gy·cm2]) categories ($y$)?
$ \hat{y} = f(x_1,x_2, x_3) $ where $f()$ is the machine learning model (here we demonstrate support vector machine and random forest classifier models).
18.11.2024 Satu Inkinen
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# In the paper to categorize low DAP; high DAP threshold = 2.8 [Gy·cm2]
DAP_THRESHOLD = 2.8 # [Gy·cm2]
# Mean and standard deviation values
means = {
'height_low': 6.5,
'height_high': 7.5,
'width_low': 4.9,
'width_high': 5.9,
'depth_low': 5.2,
'depth_high': 6.4
}
std_devs = {
'height_low': 0.3,
'height_high': 0.5,
'width_low': 0.2,
'width_high': 0.5,
'depth_low': 0.2,
'depth_high': 0.4
}
# Data for plotting
categories = ['height', 'width', 'depth']
positions_low = [1, 2, 3]
positions_high = [4, 5, 6]
colors = ['red', 'green', 'blue']
# Create a plot with error bars
for i, category in enumerate(categories):
plt.errorbar(positions_low[i], means[f'{category}_low'], yerr=std_devs[f'{category}_low'],
fmt=f'-o',markerfacecolor=colors[i],markeredgecolor=colors[i], ecolor=colors[i], capsize=5, capthick=2, label=category)
plt.errorbar(positions_high[i], means[f'{category}_high'], yerr=std_devs[f'{category}_high'],
fmt=f'-o',markerfacecolor=colors[i],markeredgecolor=colors[i], ecolor=colors[i], capsize=5, capthick=2)
# Set the x-axis label
plt.xlabel(f"DAP low < {DAP_THRESHOLD} [Gy·cm2] DAP high > {DAP_THRESHOLD} [Gy·cm2]")
# Hide the x-axis tick labels
ax = plt.gca()
ax.xaxis.set_ticklabels([])
# Add a legend
plt.legend(loc='lower right')
# Add horizontal background lines
for y in np.arange(4, 9, 0.5):
plt.axhline(y=y, color='gray', linestyle='--', linewidth=0.5)
# Set the y-axis label
plt.ylabel('Stone width, height, depth (mm)')
# Display the plot
plt.show()
Simulate data using gaussian distribution¶
Generate samples¶
# Constants
N = 1000 #Number of realizations for both categories
# Scale standard deviations
n = 2
scaled_std_devs = {key: value * n for key, value in std_devs.items()}
# Generate samples
def generate_samples(means, std_devs, N):
return np.array([np.random.normal(means[key], std_devs[key], N) for key in means])
samples_low = generate_samples(
{key: means[key] for key in means if 'low' in key},
{key: scaled_std_devs[key] for key in scaled_std_devs if 'low' in key},
N
)
samples_high = generate_samples(
{key: means[key] for key in means if 'high' in key},
{key: scaled_std_devs[key] for key in scaled_std_devs if 'high' in key},
N
)
# Labels:
y_low = np.zeros(N) # low DAP studies
y_high = np.ones(N) # high DAP studies
# Combine samples and labels
X = np.hstack((samples_low, samples_high)).T
Y = np.hstack((y_low, y_high))
# Add jitter to the data points
jitter = 0.3
X += np.random.normal(0, jitter, X.shape)
#Simulate data:
# Constants
N = 1000 #Number of realizations for both categories
# Scale standard deviations
n = 2
scaled_std_devs = {key: value * n for key, value in std_devs.items()}
# Generate samples
def generate_samples(means, std_devs, N):
return np.array([np.random.normal(means[key], std_devs[key], N) for key in means])
samples_low = generate_samples(
{key: means[key] for key in means if 'low' in key},
{key: scaled_std_devs[key] for key in scaled_std_devs if 'low' in key},
N
)
samples_high = generate_samples(
{key: means[key] for key in means if 'high' in key},
{key: scaled_std_devs[key] for key in scaled_std_devs if 'high' in key},
N
)
# Labels:
y_low = np.zeros(N) # low DAP
y_high = np.ones(N) # high DAP studies
# Combine samples and labels
X = np.hstack((samples_low, samples_high)).T
Y = np.hstack((y_low, y_high))
# Add jitter to the data points
jitter = 0.2
X += np.random.normal(0, jitter, X.shape)
# Plot the results
# Plot the results with labels for each color
plt.scatter(X[Y == 0, 0], X[Y == 0, 1], c='blue', edgecolors='k', alpha=0.5, label='Low DAP')
plt.scatter(X[Y == 1, 0], X[Y == 1, 1], c='red', edgecolors='k', alpha=0.5, label='High DAP')
plt.xlabel('Stone height (mm)')
plt.ylabel('Stone width (mm)')
plt.legend(loc='upper right')
plt.title('Scatter Plot of Height vs Width with Class Labels')
plt.show()
print(f"Observarion/Feature matrix X size: {X.shape}")
print(f"Label vector Y size: {Y.shape}")
print(f"Print out single observation: {X[100,:]}")
Observarion/Feature matrix X size: (2000, 3) Label vector Y size: (2000,) Print out single observation: [7.10102841 4.88075317 5.37946738]
Splitting data to train, validation and test sets and apply standard scaler¶
Data splitting
This is crucial for understanding the model’s performance and avoiding overfitting.
Standard scaler
This process ensures that each feature contributes equally to the mode
Mean Removal: It centers the data by subtracting the mean value of each feature. Scaling: It scales the data to have a standard deviation of 1.
Code:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
# Standardize the features:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
# Standardize the features:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Plot the results with labels for each color
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], c='blue', edgecolors='k', alpha=0.5, label='Low DAP')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], c='red', edgecolors='k', alpha=0.5, label='High DAP')
plt.xlabel('Stone height (mm)')
plt.ylabel('Stone width (mm)')
plt.legend(loc='upper right')
plt.title('Scatter Plot of Height vs Width with Class Labels')
plt.show()
Select Support vector machine (SVM) model model, train it and evaluate¶
SVM: is a type of machine learning model used for classification tasks. Imagine you have a bunch of points on a graph, each belonging to one of two groups. An SVM tries to find the best line (or boundary) that separates these two groups as clearly as possible. It does this by maximizing the margin between the closest points of each group to the line. This helps the model make accurate predictions on new data.
link to StatQuest video for more info on the model
- Cross validation metric will be accuracy (cv_score)
Code:
from sklearn import svm
from sklearn.model_selection import cross_val_score
# Create and train the SVM model:
model = svm.SVC(kernel='rbf', gamma='scale') #create model object
cv_scores = cross_val_score(model, X_train, y_train, cv=5) #create cross validation scoring
# Print cross-validation scores and mean score
print("Cross-validation scores:", cv_scores)
print("Mean cross-validation score:", cv_scores.mean())
# Train the model on the entire training set
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Create and train the SVM model
model = svm.SVC(kernel='rbf', gamma='scale', verbose=True)
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
# Print cross-validation scores and mean score
#print("Cross-validation scores:", cv_scores)
#print("Mean cross-validation score:", cv_scores.mean())
# Train the model on the entire training set
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Calculate Positive Predictive Value (PPV) and Negative Predictive Value (NPV)
tn, fp, fn, tp = cm.ravel()
ppv = tp / (tp + fp)
npv = tn / (tn + fn)
# Plot the results
# Plot the results with labels for each color
plt.scatter(X_test[y_pred == 0, 0], X_test[y_pred == 0, 1], c='blue', edgecolors='k', alpha=0.5, label='Low DAP')
plt.scatter(X_test[y_pred == 1, 0], X_test[y_pred == 1, 1], c='red', edgecolors='k', alpha=0.5, label='High DAP')
# Highlight misclassified points
misclassified = y_test != y_pred
plt.scatter(X_test[misclassified, 0], X_test[misclassified, 1], facecolors='none', edgecolors='yellow', s=100, label='Misclassified')
plt.xlabel('Stone height (mm)')
plt.ylabel('Stone width (mm)')
plt.legend(loc='upper right')
plt.title('Scatter Plot of Height vs Width with Class Labels')
plt.show()
# Plot the confusion matrix:
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()
#Print cross-validation scores and mean score
print("")
print("Cross-validation scores:", cv_scores)
print(f"Mean cross-validation score: {cv_scores.mean():.4f}")
print("")
print(f"Positive Predictive Value (PPV): {ppv:.4f}")
print(f"Negative Predictive Value (NPV): {npv:.4f}")
[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]
Cross-validation scores: [0.93928571 0.93928571 0.92142857 0.92857143 0.96428571] Mean cross-validation score: 0.9386 Positive Predictive Value (PPV): 0.9527 Negative Predictive Value (NPV): 0.8892
Let's use different machine learning model: RandomForestClassifier¶
RandomForestClassifier: Imagine you have a big forest with many trees. Each tree in this forest is a decision tree, which is a simple model that makes decisions based on certain rules. For example, a decision tree might decide if you should bring an umbrella based on whether it’s cloudy and the weather forecast.
Now, a RandomForestClassifier is like having a whole forest of these decision trees. Here’s how it works:
- Multiple Trees: Instead of relying on just one decision tree, the RandomForestClassifier uses many trees to make a decision.
- Randomness: Each tree is built using a random subset of the data and a random subset of the features (like temperature, humidity, etc.). This randomness helps make each tree a bit different from the others.
- Voting: When it’s time to make a prediction, each tree in the forest gives its own prediction. The final decision is made based on the majority vote from all the trees. So, if most trees say “yes, bring an umbrella,” then the RandomForestClassifier will also say “yes
link to StatQuest video for more info on the model
Code:
from sklearn.ensemble import RandomForestClassifier
# Create and train the Random Forest model with cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
# Train the model on the entire training set
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
Learning point : syntax is similar in sklearn library and it is easy to experiment the performance between different ML models.
from sklearn.ensemble import RandomForestClassifier
# Create and train the Random Forest model with cross-validation
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
# Train the model on the entire training set
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Calculate Positive Predictive Value (PPV) and Negative Predictive Value (NPV)
tn, fp, fn, tp = cm.ravel()
ppv = tp / (tp + fp)
npv = tn / (tn + fn)
# Plot the results
# Plot the results with labels for each color
plt.scatter(X_test[y_pred == 0, 0], X_test[y_pred == 0, 1], c='blue', edgecolors='k', alpha=0.5, label='Low DAP')
plt.scatter(X_test[y_pred == 1, 0], X_test[y_pred == 1, 1], c='red', edgecolors='k', alpha=0.5, label='High DAP')
# Highlight misclassified points
misclassified = y_test != y_pred
plt.scatter(X_test[misclassified, 0], X_test[misclassified, 1], facecolors='none', edgecolors='yellow', s=100, label='Misclassified')
plt.xlabel('Stone height (mm)')
plt.ylabel('Stone width (mm)')
plt.legend(loc='upper right')
plt.title('Scatter Plot of Height vs Width with Class Labels')
plt.show()
# Plot the confusion matrix:
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()
#Print cross-validation scores and mean score
print("")
print("Cross-validation scores:", cv_scores)
print(f"Mean cross-validation score: {cv_scores.mean():.4f}")
print("")
print(f"Positive Predictive Value (PPV): {ppv:.4f}")
print(f"Negative Predictive Value (NPV): {npv:.4f}")
Cross-validation scores: [0.92857143 0.93928571 0.91428571 0.91071429 0.94642857] Mean cross-validation score: 0.9279 Positive Predictive Value (PPV): 0.9258 Negative Predictive Value (NPV): 0.8864
Conclusions¶
- Application of classical machine learning models in feature based learning is fairly easy (Scikit learn library in Python)
- it requires only few lines of code:
# Create and train the SVM model: model = svm.SVC(kernel='rbf', gamma='scale') #create model object # Train the model on the entire training set model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test)
- The hard part is to extract/obtain good quality training data (X, Y)!
- The model selection and hyperparameter tuning can be automated as well (autoML):
- TPOT (Tree-based Pipeline Optimization Tool): is an open-source Automated Machine Learning library in Python that uses genetic programming to optimize machine learning pipelines
- Auto-sklearn: An extension of scikit-learn that automates the process of model selection and hyperparameter tuning
