The text data presents the description of 125 clinical trials coming from 4 different bacterial infection conditions, cirrhosis, sepsis, covid and meningococcal. The dataset has 2 columns, one is label and another is the text. This dataset is a subset of collected data in Data Gathering part. The text data are transformed into matrix of token counts.
Three trees are created with different parameter values. The first tree has criterion of entropy and the best
split. The second tree is set on criterion of Gini
and best
splitter.
And the third one has the criteria gini
, random splitter
and max depth of 5. Since the third tree has the limit on max depth, it has smaller size than other 2.
The first two trees presents all pure child nodes. What is more, the first tree and the second tree share a common root node vaccine
. The word liver
appears in high level in all these three trees. It clearly distinguish covid
from other three labels.
Each tree has its own confusion matrix. Generally, all three trees has good prediction on label cirrhosis
and label covid
, while relatively poor on sepsis
and least on meningococcal
.
The record dataset shows the record of 400 clinical trials, which has 200 trials labelled as Completed
and another 200 trials labelled as Terminated
. This dataset is a sample taken from the gathered and cleaned data earlier in this project, and has eight attributes.
It has mixed quantitative and qualitative data.
According to the ranks of the importance of attributes from the code, the key feature is the enrolment count of the trial, and the second important one is age groups of participants. Other attributes weigh way lower importance.
The first tree includes all 8 attributes in training, which gives the most accurate prediction among these three trees, though it do not have high accuray. The second tree is built based on the column of Design Intervention Model and the primary purpose of designing, which have low importance weight.
Thus, this tree has the lowest accuracy.
The third tree is built on the attributes enrolment count and age groups, which are the 2 most important factors. It has a slightly lower accuracy than the first tree.
Generally, the decision tree built for text data is a good model for predicting the clinical trials of cirrhosis
, covid
and sepsis
based on their description.
Word liver
clearly distinguish covid from other three labels, since covid is more about a localized respiratory condition, while other three conditions are systemic conditions, involving with whole body blood circulation which closely related to liver.
What is more, cirrhosis is a type of liver infection.
From the created Wordcloud, it is also interesting to observe that the vaccine
is not a frequent word in description of covid-19
trials, instead, the words like treatment
or words related to virus occurs more frequently, which suggests that many clinical trials on covid-19 are focusing on anti-viral treatment. Sepsis is a type of blood infection.
The WordCloud for sepsis shows the frequent words antibiotic
, blood
, infections
and etc. , which suggests that antibiotics are one of the major focus on this field. What is more, lots of clinical trials on meningococcal are investigating the vaccine.
The decision tree built for record data are capable of predicting whether the clinical trial will complete or terminate based on the attributes of the trial, though its accuracy is median.