Clinical trials clustering based on eligibility criteria. and evaluate them.
data: all clinical trials downloaded into one text file
model tried: doc2vec, TF-IDF.
Evaluation metrics: needed
Attached are text documents and results of the models which I have got.
I need one model as baseline, and show improvements of baseline with any other hyperparameters or implement a new model which performs better than that.
Final output is comparing baseline model with other model which shows differences.
Learning Representations for Clinical Trial Eligibility Criteria to Discover Similar Trials Learning Representations for Clinical Trial Eligibility Criteria to Discover Similar Trials IUPUI SOIC, DEPARTMENT OF BIOHEALTHINFORMATICS AIM & OBJECTIVES SECTION 1 AIM To identify similar studies in the clinical trial database using the inclusion and exclusion criteria (eligibility criteria). Objectives AIM & OBJECTIVES To learn the vector representations of the clinical trial data based on the free text eligibility criteria data. Cluster the Clinical trials which share the similar eligibility features. Information Retrieval of Clinical trial data with similar eligibility features. IUPUI 4 INTRODUCTION SECTION 2 Introduction Introduction Clinical research is essential to the advancement of medical science and is a priority for academic health centers, research funding agencies, and industries working to develop and deploy new treatments (Weng, 2019). Randomized controlled trials (RCTs) provide high-evidence for clinical practice. These studies recruit patients on the eligibility criteria. Information retrieval engineering is important when Information overload remains a significant barrier for patients/ researchers searching for clinical trials online. IUPUI Clinical Trials Introduction Clinicaltrials.gov is the database which stores clinical studies information from all over the world It currently has 337,371 research studies from 211 countries. This website provides information on the status of the current studies as either recruiting or not yet recruiting. IUPUI Methodology SECTION 3 Methodology Workflow Methodology IUPUI Download the documents Extracting the Eligibility criteria Clinical trials data (259430 Data Preprocessing Methods Evaluation Results Conclusion Requirements Results Disk Space (about 8 GB for documents) RAM(4 to 8 GB) Python (latest version) NLTK library Gensim library IUPUI Data Preprocessing Methodology IUPUI extract the eligibility criteria data Remove punctuations Remove Special Characters Sentence boundary detection Remove Stop words Trials from Clinical trials.gov Sentences Sentence tokenizer Lemmatization Processed and Cleaned data Methods- Word Embeddings Methodology In recent years, Word Embeddings have gained a lot of popularity Word Embeddings transform text to machine-readable language that is numbers. These representations of the words in a vector space help to learn algorithms to achieve better performance in NLP tasks by grouping similar words. With the help of these word embeddings, algorithms can train billions of words at a time. Few of the word embedding algorithms include Word2Vec, FastText and GLOVE. These algorithms scan the corpus with a fixed sized window. IUPUI Methods- Word Embeddings Methodology Word embeddings can be composed into more abstract structures such as phrases, sentences, paragraphs or documents (Jon, 2017). As our goal for this project is to cluster the clinical trial documents that share the same eligibility features. According to our data, every clinical trial has a tag to it (name of the document). Document similarity using Doc2Vec can identify the most similar documents that share similar eligibility features. IUPUI Methods- Doc2Vec(paragraph vector) Methodology Doc2vec detect relationships among words and understands the semantics of the text Paragraph vector distributed bag of words (PV-DBOW) and paragraph vector distributed memory (PV-DM). These two models differ in hyperparameters. The Distributed Bag of Words (PV-DBOW) model proposes training the paragraph vectors that tries to predict other words in the paragraph. The distributed memory model looks the problem in different manner. The context is a fixed size window and the target is the word that comes next, so that some sequential information is preserved. These models can be implemented individually and even concatenating the embeddings of both models that can give best results. IUPUI Doc2Vec(paragraph vector) Methodology PV-DM: When training the word vectors W, the document vector D is trained as well, and in the end of training, it holds a numeric representation of the document. Parameter in model : PV-DM- 1, PV-DBOW: Trains the document vector D and predicts the word W. Parameter in model : PV-DBOW- 0 IUPUI Methods- Word2Vec Methodology Doc2vec model is developed using the word2vec as a baseline. Word2vec representation is done using two models. CBOW- continuous bag of words (no document vector). One word is predicted after training a few words. Skip gram- One word is trained to predict all words. It is slower than CBOW but considered more accurate with infrequent words. IUPUI Word2Vec Methodology Continuous bag of words IUPUI Results SECTION 4 PV-DM- building model- 37139849 words Results IUPUI Checking similarity of the documents Results Similar documents for tag “nct00820898” nct00459290- 0.65 nct00939809- 0.63 nct00095979-0.62 nct0003014- 0.61 nct00276796- 0.60 IUPUI Results Most Similar by vectors Clustering- K means- 4 IUPUI Elbow-method Results K-means- no of clusters- 10 IUPUI PV-DBOW- building model- 32342534 words Results IUPUI Checking similarity of the documents Results Similar documents for tag “nct00820898” nct00939809- 0.70 nct00459290- 0.66 nct00095979-0.59 nct00037914- 0.64 nct00276796- 0.61 IUPUI Most Similar by vectors Results Clustering- K means- 4 IUPUI Elbow-method Results K-means- no of clusters- 10 IUPUI Evaluation – Doc2vec Vs Human Annotators Results The most common approach for evaluating word embeddings is to have a dataset of word pairs that are given similarity scores by humans (Jon, 2017). 100 random pairs of clinical trial data have been annotated by annotators. Random documents are picked and clustered for annotation. These annotators need to rate the similarity on the scale of 0-5. (>3 – similar, <3- not similar) iupui comparing human vs dm=1 vs dm = 0 results iupui dm = 0 not similar82similar18dm = 0 total dm= 1 not similar 79similar 21dm = 1 total human annotators not similar 70similar 30human annotators total most similar words to the bigram congenital_adrenal after training the data on word2vec similarly, articles can also be retrieved based on the bigram results results word2vec iupui tf-idf results number of clusters = 5 iupui discussion and conclusion section 4 discussion i have presented methods for clustering the clinical trial data based on the eligibility criteria word2vec and tf-idf will get trained by passing only the text without document tags. doc2vec models cluster the trials based on the document tag. the models of doc2vec, distributed memory model wrongly predicted 9 of the similar documents as not similar, and distributed bag of words has predicted 12 incorrectly. there was a total of 100 pairs of clinical trial data that has been manually annotated. however, the distributed memory model outperformed compared to the distributed bag of words. these clusters can help authors for information retrieval based on the eligibility criteria. iupui challenges and limitations clinical trial text was cumbersome to deal with. generating clusters to perform human annotation was time consuming more pairs of clusters needs to be evaluated for accuracy. challenges iupui applications introduction modeling based on the clinical terms can be helpful in retrieving patient information from the data sources which usually have patientid tags. on considering the present scenario, the doc2vec clustering can help in clustering the articles published on the topic “corona” that would help in retrieving the relevant information. iupui references references dai et al (2015), document embedding with paragraph vectors. arxiv.org hao, t., rusanov, a., boland, m. r., & weng, c. (2014). clustering clinical trials with similar eligibility criteria features. journal of biomedical informatics, 52, 112–120. https://doi.org/10.1016/j.jbi.2014.01.009 bhattacharya s, cantor mn. analysis of eligibility criteria representation in industry-standard clinical trial protocols. journal of biomedical informatics. 2013;46(5):805–813 http://clinicaltrials.gov/ iupui acknowledgment thank you professor, jiaping zheng assistant professor, department of biohealthinformatics iupui q & a iupui not="" similar)="" iupui="" comparing="" human="" vs="" dm="1" vs="" dm="0" results="" iupui="" dm="0" not="" similar="" 82="" similar="" 18="" dm="0" total="" dm="1" not="" similar="" 79="" similar="" 21="" dm="1" total="" human="" annotators="" not="" similar="" 70="" similar="" 30="" human="" annotators="" total="" most="" similar="" words="" to="" the="" bigram="" congenital_adrenal="" after="" training="" the="" data="" on="" word2vec="" similarly,="" articles="" can="" also="" be="" retrieved="" based="" on="" the="" bigram="" results="" results="" word2vec="" iupui="" tf-idf="" results="" number="" of="" clusters="5" iupui="" discussion="" and="" conclusion="" section="" 4="" discussion="" i="" have="" presented="" methods="" for="" clustering="" the="" clinical="" trial="" data="" based="" on="" the="" eligibility="" criteria="" word2vec="" and="" tf-idf="" will="" get="" trained="" by="" passing="" only="" the="" text="" without="" document="" tags.="" doc2vec="" models="" cluster="" the="" trials="" based="" on="" the="" document="" tag.="" the="" models="" of="" doc2vec,="" distributed="" memory="" model="" wrongly="" predicted="" 9="" of="" the="" similar="" documents="" as="" not="" similar,="" and="" distributed="" bag="" of="" words="" has="" predicted="" 12="" incorrectly.="" there="" was="" a="" total="" of="" 100="" pairs="" of="" clinical="" trial="" data="" that="" has="" been="" manually="" annotated.="" however,="" the="" distributed="" memory="" model="" outperformed="" compared="" to="" the="" distributed="" bag="" of="" words.="" these="" clusters="" can="" help="" authors="" for="" information="" retrieval="" based="" on="" the="" eligibility="" criteria.="" iupui="" challenges="" and="" limitations="" clinical="" trial="" text="" was="" cumbersome="" to="" deal="" with.="" generating="" clusters="" to="" perform="" human="" annotation="" was="" time="" consuming="" more="" pairs="" of="" clusters="" needs="" to="" be="" evaluated="" for="" accuracy.="" challenges="" iupui="" applications="" introduction="" modeling="" based="" on="" the="" clinical="" terms="" can="" be="" helpful="" in="" retrieving="" patient="" information="" from="" the="" data="" sources="" which="" usually="" have="" patientid="" tags.="" on="" considering="" the="" present="" scenario,="" the="" doc2vec="" clustering="" can="" help="" in="" clustering="" the="" articles="" published="" on="" the="" topic="" “corona”="" that="" would="" help="" in="" retrieving="" the="" relevant="" information.="" iupui="" references="" references="" dai="" et="" al="" (2015),="" document="" embedding="" with="" paragraph="" vectors.="" arxiv.org="" hao,="" t.,="" rusanov,="" a.,="" boland,="" m.="" r.,="" &="" weng,="" c.="" (2014).="" clustering="" clinical="" trials="" with="" similar="" eligibility="" criteria="" features. journal="" of="" biomedical="" informatics, 52,="" 112–120.="" https://doi.org/10.1016/j.jbi.2014.01.009="" bhattacharya="" s,="" cantor="" mn.="" analysis="" of="" eligibility="" criteria="" representation="" in="" industry-standard="" clinical="" trial="" protocols. journal="" of="" biomedical="" informatics. 2013;46(5):805–813="" http://clinicaltrials.gov/="" iupui="" acknowledgment="" thank="" you="" professor,="" jiaping="" zheng="" assistant="" professor,="" department="" of="" biohealthinformatics="" iupui="" q="" &="" a="">3- not similar) iupui comparing human vs dm=1 vs dm = 0 results iupui dm = 0 not similar82similar18dm = 0 total dm= 1 not similar 79similar 21dm = 1 total human annotators not similar 70similar 30human annotators total most similar words to the bigram congenital_adrenal after training the data on word2vec similarly, articles can also be retrieved based on the bigram results results word2vec iupui tf-idf results number of clusters = 5 iupui discussion and conclusion section 4 discussion i have presented methods for clustering the clinical trial data based on the eligibility criteria word2vec and tf-idf will get trained by passing only the text without document tags. doc2vec models cluster the trials based on the document tag. the models of doc2vec, distributed memory model wrongly predicted 9 of the similar documents as not similar, and distributed bag of words has predicted 12 incorrectly. there was a total of 100 pairs of clinical trial data that has been manually annotated. however, the distributed memory model outperformed compared to the distributed bag of words. these clusters can help authors for information retrieval based on the eligibility criteria. iupui challenges and limitations clinical trial text was cumbersome to deal with. generating clusters to perform human annotation was time consuming more pairs of clusters needs to be evaluated for accuracy. challenges iupui applications introduction modeling based on the clinical terms can be helpful in retrieving patient information from the data sources which usually have patientid tags. on considering the present scenario, the doc2vec clustering can help in clustering the articles published on the topic “corona” that would help in retrieving the relevant information. iupui references references dai et al (2015), document embedding with paragraph vectors. arxiv.org hao, t., rusanov, a., boland, m. r., & weng, c. (2014). clustering clinical trials with similar eligibility criteria features. journal of biomedical informatics, 52, 112–120. https://doi.org/10.1016/j.jbi.2014.01.009 bhattacharya s, cantor mn. analysis of eligibility criteria representation in industry-standard clinical trial protocols. journal of biomedical informatics. 2013;46(5):805–813 http://clinicaltrials.gov/ iupui acknowledgment thank you professor, jiaping zheng assistant professor, department of biohealthinformatics iupui q & a iupui>