“Exercise is Medicine” — Can we minimize our risk of developing certain diseases?
This project was carried out as part of the TechLabs “Digital Shaper Program” in Dortmund (summer term 2022).
In a nutshell:
Due to working from home, the amount of physical activity decreases among people. This can have an adverse effect on health conditions. Here, we want to provide information about the chance of developing certain diseases based on a model which incorporates critical parameters. The project is based on the NHANES pre-pandemic dataset, which is available here.
One of the biggest changes due to the Covid19- pandemic, was the fact, that people were told to stay home. Especially employees and companies had to adapt quickly and as a result, offered the possibility to work from home. Since then, working from home increased significantly and is still today a preferred way of work by many employees.
The new way of work may lead to less physical activity as commuting to work and working conditions with high physical activity are not as common as before. This raises the question: Does less physical activity has negative effects on health conditions?
Therefore, we analysed the physical activity data as input data and examined possible relations to different output parameters. We investigated the mental health status, high blood pressure, waist-to-hip ratio, BMI, alcohol consumption, educational level, income, smoking, age, gender, race, and physical activity level of our sample.
Our final goal was to provide concise information about the chance of developing certain diseases based on a model which incorporates critical parameters.
The preparatory phase of this project started with a comprehensive search for a suitable dataset that is available for free & online and is eligible to answer our research question. As some of us already got in contact with the dataset of the National Health and Nutrition Examination Survey (NHANES), an American public health project, throughout the track, we had a closer look and finally decided to work with it.
NHANES gathers questionnaire, laboratory, and examination data from the American population for more than 50 years now and thereby focusing on health and nutrition. Due to the rationale of our project, we included only adults aged over 18 years in our project.
Within the upcoming weeks, we screened the NHANES data set for outcomes that were in our interest and prepared the selected data sets (i.e., getting an idea about the shape of each data set and the content of each variable, renaming variables for better understanding, recoding the answers, handling missing values, filtering relevant items, indexing, visualizing proportions, eventually calculating new variables such as sum scores or ratios, and eventually creating new categorical variables ). Then we merged the selected data sets into one file.
This being done, we first performed correlation analyses (pearson and spearman) between parameters displaying a health risk (for example “waist to hip ratio”, “ total mental health score”, “liver stiffness” ) and further participant characteristics (for example “BMI”, “education level ”, “physical activity level”). Associations were visualized in heatmaps, scatterplots and bar charts.
We performed regression analyses to define parameters predicting certain health risk factors.
We further developed supervised machine learning models with PyTorch to predict the effect of a person doing exercise on their health. As pre-processing, we normalized the input data to adjust the distributions of the features more towards a gaussian distribution and we also replaced the few missing values with the median of the feature’s values. We split the data into training and testing sets and use them to train and subsequently evaluate our models. As evaluation, we plot the predicted values and calculate accuracies as well as a confusion matrix for the classification neural network. Firstly, we investigated the effect of the total physical exercise score of a person on their waist-hip ratio which is indicative for cardiovascular diseases. We built the neural network using linear and ReLU layers with the L2-loss as the loss function and stochastic gradient descent as optimizer. We also considered the effect of vigorous physical exercise during leisure time on people’s health. As another indicator for health, we select blood pressure which is a contributing factor for many diseases, such as heart attacks or a stroke. We coded blood pressure as a categorical variable ranging from normal to different degrees of high blood pressure and used it as our target variable. Our target variable is imbalanced, with more than 50% of all observations in the class for normal blood pressure. So, we tried several methods to balance our data like oversampling, such as with SMOTE for the training data and we also tried rescaling the weights given to the classes in the cross-entropy loss function. We finally used the latter one. Then, we built a multiclass classification neural network using the Cross-Entropy loss as the loss function, stochastic gradient descent as the optimizer and linear as well as ReLU layers. Finally, we applied the softmax function to our predicted scores. Another indicator for health which we investigated is liver fat, which is a factor in diseases like hepatic cirrhosis. We tried to predict the effect of vigorous physical exercise on this condition.
Within our project, we implemented several dictionaries and libraries from python that we got introduced to throughout the learning track. These include NumPy, pandas, seaborn, matlplotlib, researchpy etc.
Even though visualization of associations between health risk parameters and lifestyle variables (e.g., amount of physical activity and alcohol consumption) with bar charts were promising, correlation analyses reveal no significant results.
As an example, Figure 1 displays the association between the total amount of physical activity in MET per week and the appearance of high blood pressure. However, as shown in Figure 2 the correlation revealed to be very small.
The regression model predicting the appearance of high blood pressure was significant but the explained variance (i.e., squared R) revealed to be quite small.
In addition, our supervised machine learning models did not yield satisfactory results as they had low accuracy, and the f1 values for each class, the harmonic means of precision and recall, were low too.
The results might be explained by the fact, that the dataset was imbalanced and was characterized by large outliers, especially in the variable “Total Activity Score”. We also experienced an uncommon distribution (i.e., left-skewed).
In the upcoming time, we plan to include further independent variables, re-check outliers, and re-check the method to conduct supervised machine learning models (i.e., choosing different layers).
GitHub repository (or similar):
- Deep Learning: Lisa Bräuker [firstname.lastname@example.org]
- Data Science: Ahmed Karaoglan [email@example.com]
- Data Science: Marit Lea Schlagheck [firstname.lastname@example.org]
- Data Science: Parham
- Luise Weickhmann, Web Development
- Tobias Küper, Data Science
The project is based on the NHANES 2017-March 2020 pre-pandemic dataset:
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2017–2020] [https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?cycle=2017-2020].
Figures / Images