At Karuna Labs we have created a virtual reality software to treat chronic pain through cortical reorganization (aka neuroplasticity).

This entails slowly training our patient’s brain and body to recover the full extent of their movements through physiotheraputic exercises and visual inputs to reprogram the patient’s representations of their pain.

For example, if a patient has suffered a shoulder injury. In their sessions they will put on the VR headset and perform a daily task, such as picking up a can from a shelf. We will then slowly increase the difficulty level of the exercise (i.e. height of the shelf in this case) in subsequent sessions such that the patient does not notice this increase. In doing so the patient is able to progressively, but imperceptibly, regain their mobility.

The machine learning problem: How can we prepare a specialized training regime for a new patient using our database of patients who have successfully completed the sessions. Our database contains movement data captured through our sensors and the patients’ medical history.

That is, how do we decide the difficulty level of the exercises based on their health and current range of mobility.

We implement various supervised learning algorithms for regression in Scikit-learn.

**General steps in creating a regression model**

- Data Exploration
- Data Preparation – data cleaning, handling missing values, data transformations, normalizing.
- Handling Outliers
- Dimensionality Reduction – PCA, feature selection
- Choosing appropriate models based on given data and desired outcomes.
- Training models
- Parameter Tuning
- Measuring Performance on test/unseen data

In this tutorial we will primarily focus on training various regression models in Scikit-learn and measuring their performance.

**1. Data Preparation**

**1.1. Read input data**

# import libraries import pandas as pd import numpy as np from sklearn.metrics import mean_squared_error, r2_score # read in the data data = pd.read_excel(“synthetic_final.xlsx”) # preview of the data data.head(10)

# all column nameslist(data.columns.values) ['patient_id', 'promis29_score', 'age', 'sex', 'heart_health', 'diabetes', 'asthma', 'back_pain', 'arthritis', 'high_blood_pressure', 'height_inches', 'weight_pounds', 'difficulty_1', 'pain_1', 'rom_1', 'success_rate_1', 'difficulty_2', 'pain_2', 'rom_2', 'success_rate_2', 'difficulty_3', 'pain_3', 'rom_3', 'success_rate_3', 'difficulty_4', 'pain_4', 'rom_4', 'success_rate_4', 'difficulty_5', 'pain_5', 'rom_5', 'success_rate_5', 'difficulty_6', 'pain_6', 'rom_6', 'success_rate_6', 'difficulty_7', 'pain_7', 'rom_7', 'success_rate_7', 'difficulty_8', 'pain_8', 'rom_8', 'success_rate_8']

**1.2. Data Dictionary**

# patient_id - masked patient ID # promis29_score - a patient reported score based on a questionnaire on their health # age - age of the patient at the time of the first session # sex - sex of the patient # heart_health - 0 = healthy heart, 1 = unhealthy heart # diabetes - 0 = does not suffer from diabetes, 1 = suffers from diabetes # asthma - 0 = does not suffer from asthma, 1 = suffers from asthma # back_pain - 0 = no back pain, 1 = has back pain # arthritis - 0 = does not suffer from arthritis, 1 = suffers from arthritis # high_blood_pressure - 0 = normal BP, 1 = high BP # height_inches - height in inches # weight_pounds - wieght in pounds # difficulty_* - difficulty setting for the particlaur session. # pain_*' - pain reported by patient during session # rom_* - range of movement of the patient. # success_rate_* - success rate in completing the exercise. data.describe()

**1.3. Splitting data into train and test sets**

# split data into training and testtrain = data.iloc[:80,:] test = data.iloc[80:,:]# Predicting difficulty level for 8th session# Preparing data by dropping unnecessary columnstrain8X = train.drop(['patient_id', 'success_rate_1', 'success_rate_2', 'success_rate_3', 'success_rate_4', 'success_rate_5', 'success_rate_6', 'success_rate_7', 'difficulty_8', 'pain_8', 'rom_8','success_rate_8'], axis = 1) test8X = test.drop(['patient_id', 'success_rate_1', 'success_rate_2', 'success_rate_3', 'success_rate_4', 'success_rate_5', 'success_rate_6', 'success_rate_7','difficulty_8','pain_8','rom_8','success_rate_8'], axis = 1) train8Y = train.drop(['patient_id', 'promis29_score', 'age', 'sex', 'heart_health', 'diabetes', 'asthma', 'back_pain', 'arthritis', 'high_blood_pressure', 'height_inches', 'weight_pounds','difficulty_1', 'pain_1', 'rom_1', 'success_rate_1','difficulty_2', 'pain_2', 'rom_2', 'success_rate_2', 'difficulty_3', 'pain_3',rom_3','success_rate_3', 'difficulty_4', 'pain_4', 'rom_4', success_rate_4','difficulty_5', 'pain_5', 'rom_5', 'success_rate_5', 'difficulty_6', 'pain_6', 'rom_6', 'success_rate_6', 'difficulty_7', 'pain_7', 'rom_7', 'success_rate_7', 'pain_8', 'rom_8','success_rate_8'], axis = 1) test8Y = test.drop(['patient_id', 'promis29_score', 'age', 'sex', 'heart_health', 'diabetes', 'asthma', 'back_pain', 'arthritis', 'high_blood_pressure', 'height_inches', 'weight_pounds', 'difficulty_1','pain_1', 'rom_1', 'success_rate_1', 'difficulty_2', 'pain_2', 'rom_2','success_rate_2','difficulty_3','pain_3','rom_3','success_rate_3','difficulty_4','pain_4', 'rom_4', 'success_rate_4','difficulty_5','pain_5','rom_5','success_rate_5','difficulty_6','pain_6','rom_6','success_rate_6','difficulty_7','pain_7','rom_7',success_rate_7', 'pain_8','rom_8','success_rate_8'], axis = 1)

**2. Regression Models**

The regression models aim to make a prediction about what the difficulty setting should be for the eighth session, i.e. a value between 0 and 100.

**2.1. Support Vector Machines**

# import libraryfromsklearnimportsvm# train model on the training datamodel_svm = svm.SVR() model_svm.fit(train8X, train8Y)# make predictions on the test datapredicted_values = model_svm.predict(test8X)# calculate error between the predicted values and the ground truthprint("Mean squared error:%.5f"% mean_squared_error(test8Y, predicted_values)) print('Variance score:%.2f' % r2_score(test8Y, predicted_values)) Mean squared error: 1.16618 Variance score: -0.02

**2.2. Linear Regression**

fromsklearnimportlinear_model reg_model = linear_model.LinearRegression() reg_model.fit(train8X, train8Y) reg_model.coef_ predicted_reg = reg_model.predict(test8X) mean_squared_error(test8Y, predicted_reg) print("Mean squared error:%.5f"% mean_squared_error(test8Y, predicted_reg)) print('Variance score:%.2f' % r2_score(test8Y, predicted_reg)) Mean squared error: 1.96415 Variance score: -0.71

**2.3. Ridge Regression**

ridge_reg = linear_model.Ridge(alpha = .5) ridge_reg.fit(train8X, train8Y) predicted_ridge = ridge_reg.predict(test8X) mean_squared_error(test8Y, predicted_ridge) print("Mean squared error:%.5f" % mean_squared_error(test8Y, predicted_ridge)) print('Variance score:%.2f' % r2_score(test8Y, predicted_ridge)) Mean squared error: 1.78342 Variance score: -0.55

**2.4. Lasso Regression**

lasso_reg = linear_model.Lasso(alpha = 0.1) lasso_reg.fit(train8X, train8Y) predicted_lasso = lasso_reg.predict(test8X) mean_squared_error(test8Y, predicted_lasso) print("Mean squared error:%.5f"% mean_squared_error(test8Y, predicted_lasso)) print('Variance score:%.2f' % r2_score(test8Y, predicted_lasso)) Mean squared error: 1.09308 Variance score: 0.05

**2.5. Bayesian Ridge Regression**

bayesian_ridge_reg = linear_model.BayesianRidge() bayesian_ridge_reg.fit(train8X, train8Y) predicted_bayes = bayesian_ridge_reg.predict(test8X) mean_squared_error(test8Y, predicted_bayes) print("Mean squared error:%.5f" % mean_squared_error(test8Y, predicted_bayes)) print('Variance score:%.2f' % r2_score(test8Y, predicted_bayes)) Mean squared error: 1.25829 Variance score: -0.10

**2.6. Regression Tree**

#regression treefromsklearnimporttree clf = tree.DecisionTreeRegressor() clf = clf.fit(train8X, train8Y) clf_pred = clf.predict(test8X) print("Mean squared error:%.5f" % mean_squared_error(test8Y, clf_pred)) print('Variance score:%.2f' % r2_score(test8Y, clf_pred)) Mean squared error: 2.00000 Variance score: -0.74

**2.7. Gradient Boosting Regression**

fromsklearn.ensembleimportGradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=10, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(train8X, train8Y) print("Mean squared error:%.5f" % mean_squared_error(test8Y, est.predict(test8X))) print('Variance score:%.2f' % r2_score(test8Y, est.predict(test8X))) Mean squared error: 0.78961 Variance score: 0.31

**2.8. Multi Layer Perceptron**

# multi layer perceptronfromsklearn.neural_networkimportMLPRegressor mlp_reg = MLPRegressor(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15,), random_state=1) mlp_reg.fit(train8X, train8Y) print("Mean squared error:%.5f" % mean_squared_error(test8Y, mlp_reg.predict(test8X))) print('Variance score:%.2f' % r2_score(test8Y, mlp_reg.predict(test8X))) Mean squared error: 5.40565 Variance score: -3.71

**3. Conclusion**

We see that the model with the least mean squared error in the ensemble method Gradient Boosting Regression. In the linear regression models we can see that Lasso regression performs the best.