Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we'll assume that you are happy to receive all cookies and you won't see this message again. Click 'Find out more' for information on how to change your cookie settings.

<jats:title>Abstract</jats:title> <jats:p><jats:bold>Background</jats:bold> Collecting data from people with herpes simplex virus is challenging because of poor data quality, low user engagement, and concerns around stigma and anonymity. This project aimed to improve data collection for a real-world HSV registry by identifying predictors of HSV infection and selecting a limited number of relevant questions to ask new registry users in order to determine the HSV infection risk group. <jats:bold>Methods</jats:bold>. The US National Health and Nutrition Examination Survey (NHANES, 2015-16) database has confirmed HSV1 and HSV2 status of American participants (14-49 years) as well as a wealth of demographic and health-related data. Two datasets – for HSV1 and HSV2 – were formed using this database, and an anonymous lifestyle-data based questionnaire with a Random Forest algorithm was devised using Python. The algorithm was optimised to reduce the number of questions and to identify risk groups for HSV. Data was split into subsets to train and test the model. <jats:bold>Results </jats:bold>The model selected a reduced number of questions from the NHANES questionnaire that predicted HSV infection risk with high accuracy scores of 0.91 and 0.96 and high recall scores of 0.88 and 0.98 for HSV1 and HSV2 datasets, respectively. The number of questions was reduced from 150 to an average of 40, depending on age and gender, that together provides high predictability of the infection <jats:bold>Conclusions</jats:bold> This machine-learning algorithm for risk identification of people infected with HSV can be used in a real-world evidence registry to collect relevant lifestyle data. A current limitation is the absence of real user data and integration with electronic medical records that would enable model learning and improvement. Future work will explore model adjustments, anonymisation options, explicit permissions and standardised data schema that meet GDPR, HIPAA and third-party interface connectivity requirements.</jats:p>

Original publication




Journal article


Research Square

Publication Date