The dataset used in this part is a subset of dataset from data gathered from clincialtrial.gov API. It has combination of qualitative and quantitative data, and 244 rows.
The data frame has label column which suggest the overall status of the clinical trials. 122 rows data has label completed
and 122 rows of label terminated
.
The NB model on record data has accuracy of 0.77 on predicting whether a clinical trial is completed or terminated based on the conditions of clinical trialss. Figure 1 has suggest the relatively higher accuracy, 23/27, on predicting the terminated clinical trial. While the accuracy for predicting the completed trials is 2/3.
Figure 2 suggests the ranking of importance on different attributes when predicting the labels. EnrollmentCount has play important role in predicting the label, DesignPrimaryPurpose ranks second, as well as HealthyVolunteers, and AdverseEffectsorDeath ranks after. This indicate that by knowing the information of size of the enrolment, the primary purpose of design, whether accept healthy volunteers, and the rate of adverse effect or death, it is possible to predict whether a trial would be completed or terminated, thus help investigator to make wise decisions on funding or engaging on a clinical trials.
This boxplot has shows that the total number of participants enrolled in completed trials are slightly higher than that in terminated trials.
The trials with design primary purpose on health services research and screening has the highest proportion of completeness. While the trials designed primarily for supportive care has the lowest proportion of completeness, as well as the trials with treatment purpose has the second lowest percentage.
The trials accept healthy volunteers has percentage of completeness more than twice higher than the trials that do not.
The bar chart shows that the trials with median and high adverse effect rates are likely to be completed than the trials with no or very high adverse effects rate.
The text dataset used in this part is the same dataset as the one used in decision tree text data part, four types of labels of bacterial infections with brief summary of clinical trials. The accuracy of naive bayes model here is 0.84. The confusion matrix has suggests high correctness on predicting each label. All summaries of cirrhosis, meningococcal are predicted correctly. 84.6% of covid summaries are assigned into the correct group. While the prediction of sepsis is a bit lower, with accuracy of 63.7%.
According to the ranks of the importance of attributes from the code, the key feature is the enrolment count of the trial, and the second important one is age groups of participants. Other attributes weigh way lower importance.
The first tree includes all 8 attributes in training, which gives the most accurate prediction among these three trees, though it do not have high accuray. The second tree is built based on the column of Design Intervention Model and the primary purpose of designing, which have low importance weight.
Thus, this tree has the lowest accuracy.
The third tree is built on the attributes enrolment count and age groups, which are the 2 most important factors. It has a slightly lower accuracy than the first tree.