Leukemia Candidate Genetic Markers-An Evaluation

Background: The cause of leukemia, the most common type of cancer, remains unknown. Genetic studies have reported more than a thousand of genes as being linked to the disease. Methods: A total of 1,093 leukemia candidate genes, identified from leukemia-gene relations data extracted from the ResNet 11 Mammalian database and supported by 6,524 references were evaluated. Four network metrics were used to evaluate individual gene potential relevance to leukemia. Gene-set enrichment, sub-network enrichment, and network-connectivity analyses were conducted on gene attributes. An expression dataset of 71 leukemia patients, and 76 healthy controls, was employed for validation. Results: A total of 952 out of 1,093 genes were enriched in 100 pathways (p < 3.3e-20), demonstrating strong gene-gene interaction. A network metrics analysis revealed 5 genes (TP53, CTNNB1, AKT1, TNF, and RARA), as measured by both functional diversity and replication frequency, as the top leukemia candidates. Validation, using expression data, showed that the 1,093 genes, as a whole, and the top genes, as identified by the proposed metrics, were efficient in distinguishing leukemia patients from controls (maximum classification ratio = 95.3 % with permutation p-value = 0.0054). Conclusion: The genetic causes of leukemia are linked to a genetic network composed of a large number of genes. This network, together with the network metrics provided in this study, could provide a basis for further molecular studies in the field.


INTRODUCTION
Leukemia is a group of cancers, usually originating in bone marrow, which result in great numbers of abnormal white blood cells.It is the most common type of childhood cancer, even though approximately 90 % of all leukemia cases are in adults [1] .The precise causes of leukemia remain unknown.Inherited and environmental factors are both thought to be involved [2] .
More than a thousand genes related to leukemia, many of which suggested as potential biomarkers for the disease, such as FLT3, WT1, TET2, and KRAS, have been reported [3][4][5] .Several genes, such as IL2 and CSF3, have been studied in clinical trials [6,7] .Many articles have reported genetic changes, and gene quantitative changes, in leukemia [8,9] .Both increased, and decreased, gene expression levels/activities have been observed [10- 12] .Many genes have been reported to influence leukemia pathogenic development via unknown mechanisms [13] .
We found no study reporting a systematic evaluation of the quality, and strength, of these reported genes as a functional network/group in the underlying biological process of leukemia.This study, instead of focusing on specific genes, attempts to provide a comprehensive view of the genetic-map, and use gene set enrichment analysis (GSEA) and sub-network enrichment analysis (SNEA) to study the underlying functional profiles of the genes identified [14] .The hypothesis is that leukemia genes are functionally linked to each other and co-regulate leukemia's pathogenic development via multiple pathways.

MATERIALS AND METHODS
The study workflow was as follows: 1) acquisition of a leukemia-gene relation dataset and identification of leukemia candidate genes; 2) enrichment analysis of the identified genes to study their pathogenic significance to leukemia; 3) network metrics analysis to identify genes having specific significance; 4) network connectivity analysis (NCA) to test functional associations between the reported genes; and, 5) validation using an independent gene expression data set.

Leukemia-Gene relation Data Acquisition
Leukemia-gene relation data were extracted from the Pathway Studio ResNet® Mammalian database updated as of May 2016.The genes identified were used as the candidate network nodes/genes.

Literature metrics analysis
There were 2 scores proposed for each genedisease relationship as a literature metrics analysis.
The reference number underlying a gene-disease relationship as the gene reference score (RScore) is defined by Eq. (1).
RScore = The number of references underlying a relationship (1)   The earliest publication age of a gene-disease relationship is the gene age score (AScore) and is defined by Eq.( 2) where n is the total number of references supporting a gene-disease relation, and (3)

Enrichment metric analysis
Given a disease associated with a set of genetic pathways ℛ the gene-wise enrichment score (EScore) for the kth gene, within a gene set of size n, is defined in Eq. ( 4) as EScore k = ∑ i-1 (-log 10 pValue i ) / max 1<i<n (-log 10 pValue i ) (4)   where pValue i is the enrichment score of the ith pathway with the gene set; m ∈ ℛ is the number of pathways including kth the gene.The PScore for the gene, m, is defined as The number of pathways that form ℛ including the kth gene (5)   The PScore presents how many disease-related pathways were associated with the genes.The EScore shows involved pathway significance.

Enrichment analysis
GSEA and SNEA [15] were performed on: 1) entire gene list (1,093 genes); and 2) 2-subgroups with the highest metric scores to better understand any underlying functional profiles and gene pathogenic significance.An NCA was conducted on the two 2-subgroups.

Validation using gene expression data
The hypothesis is that significant leukemia candidate gene-gene sets should be a factor in distinguishing leukemia patients from healthy controls.A Euclidean distance-based multivariate classification [16] on an expression dataset, followed by a leave-oneout (LOO) cross validation, using the overall gene set and the sub-sets selected by different scores as tentative markers was performed to evaluate the effectiveness of the selected genes and the proposed metrics A permutation of 5,000 runs was then conducted to test the hypothesis that a randomly selected gene set of the same size could lead to equal, or better, classification accuracy.
Expression data from 147 subjects, including samples from 71 chronic, lymphocytic leukemia (CLL) tumors, and 76 sorted CD19pos B cells from healthy donors (NCBI GEO: GSE50006), with 1,031 genes overlapped with the candidate leukemia gene-pool identified within the leukemia-gene dataset.

Identification of candidate genes
There were 1,093 leukemia candidate genes identified from the leukemia-gene relation data set.They are supported by 6,524 articles (Supplementary Material 1).There were 994 (90.94 %) which presented a regulation relationship to the disease; 133 (12.17 %) a genetic change; 61 (5.58 %) a quantitative change; 52 (4.76 %) cell expression; 20 (1.83 %) with Biomarker, 17 (1.56 %) with clinical trial, and 5 (0.46 %) with state changes.There are 148 (13.54 %) genes that have been reported to have multiple relationships with the disease.There were 945 (86.46 %) genes that presented a 1-type relationship to the disease, 113 (10.34 %) with a 2, 31 (2.84 %) with a 3, 3 (0.27 %) with a 4, and 1 (0.09 %) with a 5.For a detailed definition and description of these relation types mentioned above, refer to the 'Relations: Definitions and Annotations' section at http://pathwaystudio. Gousinfo.com/ ResNet Database.html.Genes with 'm*' and 'r*' are genes identified in mice and rats, respectively.

Fig. 1 Gene Relation Type Distribution of the 1,093 Genes
The publication date distribution for these 6,531 articles appears in Fig. 2 (a).Novel genes are reported in each year.These have an average publication age of 6.0 years indicating that most were published recently.Publication date distributions for most of the articles underlying the 1,093 genes were similar (Fig. 2 (b)).m

Marker ranking
Of these 1,093 genes, 31 were reported in the period January through April within this year 2016 (Table 1).

Enrichment analysis on top 31 genes with highest scores
The GSEA and SNEA results of the top 31 genes listed in Table 1 were compared.The top 10 pathways/sub-networks for the AScore group and the RScore group are presented (Table 4 and Table 5).
Complete results appear in Supplementary Material 2 and 3.
Using a p-value threshold (p < 1E-4), the 31 genes with top AScores were enriched within 10 pathways/ groups.The RScore group score was 153.
The top 10 pathways enriched with the 31 genes from the AScore and RScore groups appear in Table 4.A complete listing of these pathways/gene sets appears in Supplementary Material 2.
The top 10 disease-related sub-networks enriched with a p-value < 5E-254 appear in Table 3. Complete results appear in Supplementary Material 3.This suggests that the newly-reported genes are both functionally distinct, and are less significant, compared to those most frequently reported.
It was observed that 4/10 pathways/gene sets enriched by the RScore group (Table 4) in Table 2 also appear in the top 20 pathways/groups enriched with 857/1,093 genes.The AScore group had none.
Results from the SNEA analysis consist of an enrichment analysis against disease sub-networks.Table 5 presents the top 10 disease-related subnetworks enriched by the top 31 genes from the AScore group and the RScore group, respectively.Complete results appear in Supplementary Material 3.

Connectivity analysis
An NCA was performed on the top 31 genes with the highest RScores and AScores (from Table 1) being used to generate gene-gene interaction networks.Results showed that, for the RScore group, there were 441 connections among the 31 genes, which has significant literature support.In contrast, genes within the AScore group demonstrated only 15 relations among 19/31 genes (Fig. 3 (b)) with 12 genes showing no direct relations with other genes in the group (Fig. 3 (b); highlighted in green).This observation was consistent with the GSEA and SNEA, and suggests that genes with the lowest AScore were not as functionally close to each other as the RScore group.

EScore analysis
Using GSEA, two biological metrics, EScore and PScore were generated for each gene.The PScore value represents how many leukemia associated pathways involved the gene.The EScore shows pathway significance.
A correlation analysis using averaged metric values of all 1,093 genes at a group level was conducted to compare the EScore and PScore with the two literature metrics (Fig. 4 (a)).A group size of 36 genes was used.The 1,093 genes were sorted by RScore, then averaged by each type of metrics values using a moving window of length 36.
Results showed that the average scores strongly correlate, especially for the top ones.(Fig. 4 (a) and Table 6).Group-wise PScore and EScore were extremely correlated (p = 0.99) .In addition to the group-wise correlations analysis, a cross-analysis of the top 31 genes selected using different scores was performed and is presented in a Venn Diagram.(Fig. 4

Validation using expression data
Significant leukemia candidate gene-gene sets were hypothesized as contributing to being able to distinguish leukemia patients from healthy controls.If the selected gene set (1,093 genes) and the top genes selected by the proposed metric scores are significant to leukemia pathogenesis, then they should lead to significant higher classification accuracies when compared to randomly selected gene sets.To test this hypothesis that the 1,093-gene-pool and the 4 proposed metrics are effective, classification and leave-one-out (LOO) cross validation was conducted on a gene expression dataset (NCBI GEO: GSE50006).This was followed by a 5,000-run permutation test.
The 1,093 genes were ranked by different metric scores.The top ( = 1, 2, …) genes were then used as input variables for classification and LOO cross validation.LOO results using different number of genes, with the maximum classification ratios (maxCRs) marked at the position of corresponding number of genes appear in Fig. 5 (See Table 7).The top genes selected by different scores can lead to the highest classification accuracies, adding more variable/genes with lower score may not necessarily help, which demonstrates the effectiveness of the proposed metrics (Fig. 5).All four groups (RScore, AScore, PScore, and EScore), obtained the highest scores of 94.6 %, 95.3 %, 92.1 % and 92.1 %, respectively, with a relatively small number of genes.All the permutation p-values of these groups passed the 0.05 threshold.The top 33 genes, by AScore, led to the highest CR (95.3 %), with a permutation p-value of 0.0054.Employing all matched 1,031/1,093 genes, resulted in 92.1 % CR which was reached with a permutation p-value of 0.037.This suggests that the majority of the 1,093 genes were effective for leukemia prediction.The results of LOO cross-validation and permutation approaches for different gene sets appear in Table 7.

DISCUSSION
This study proposed 4 network metrics to evaluate the 1,093 candidate genes within a genetic network for leukemia.It employed an independent gene expression data set to validate their efficiencies.GSEA, SNEA, and NCA were also used to study the pathogenic significance of these candidate genes in the disease.
The 1,093 genes identified were not equal in terms of publication frequency (RScore), novelties (AScore) , or the functional diversity (EScore).Using the proposed quality metrics scores, the genes may be ranked according to different needs/ significance and the top ones selected for further analysis (see Supplementary Material 1).Some frequently replicated genes (with a high RScore) also demonstrate high EScore and PScore, such as TP53, CTNNB1, AKT1, TNF, and RARA (see Fig. 4 (b)).These genes have an average support of 58.80 ± 11.17 references, and were connected to multiple, significantly-enriched, pathways (34.00 ± 2.55).The results suggest that these genes are biologically significant in the disease.There were 23 genes observed in both the PScore group and the EScore group (Fig. 4 (b)) which were not in the RScore group.Although they were older (AScore: 12.48 ± 6.71 years) and were not frequently replicated (10.04 ± 8.94 references), the results suggest that they merit further study.
The results demonstrate that most genes identified in this study were included in previouslyimplicated leukemia pathways.This included 6 cell apoptosis pathways, 9 cell growth and proliferation pathways, 11 transcription factor pathways, 7 protein phosphorylation related pathways, 3 immune system pathways, 8 protein kinase related pathways, and 2 neuronal system pathways [21][22][23][24][25][26][27] .We hypothesize that the majority of these literature-reported genes, especially those identified from significantly enriched pathways, should be functionally linked to leukemia.Although there may be false positives from the separate studies in the publications, it is less likely that a numerous group of genes were falsely perturbed [14] .
When members of a gene set exhibit strong cross-correlation, GSEA boosts the signal-to-noise ratio making it possible to detect modest changes in individual genes [14] .The NCA analysis showed that many of the frequently reported genes related to leukemia are functionally associated with one another (Fig. 3).This is supported by hundreds of scientific reports.It should be noted that 952/1,093 were included in the top 100 pathways enriched (p-value < 3.3e-020), and that 857/1,093 in the top 20 pathways appear in Table 2 (p-value < 3.7e-041) .
If "functionally related" is defined as coexistence within the same genetic pathway, then 87.1 % of the 1,093 genes are functionally related.The results indicate that these functionally-linked genes are more likely to be true discoveries than noise (false positives).It is less likely that these functionally-related genes were falsely identified than a single gene.
A Sub-Network Enrichment Analysis (SNEA) was performed which provided high confidence levels when interpreting experimentally-derived genetic data against a background of previouslypublished results (Pathway Studio Web Help).SNEA results revealed that many of the 1,093 genes ( > 90 %) have also been identified as causal genes for other health disorders such as, breast cancer, hepatocellular carcinoma, and lung cancer, all of which have strong associations with leukemia [28][29][30] .
A LOO cross-validation and permutation process using a gene expression data set (NCBI GEO: GSE50006) identified several significant gene combinations by using different scores, which generated the highest CRs.Permutation results showed that the top genes as determined by these four scores, as well as the 1,031/1,093 genes, were effective in predicting leukemia (p-value < 0.05).This indicates the effectiveness of the proposed metric scores.The top 33 genes selected by AScore reached the highest CR, 95.3 %, with a permutation p-value of 0.0054.This suggests that the genes identified in the earliest stage of leukemia genetic studies play a significant role in leukemia prediction.
This study has several limitations that should be considered in future work.The 1,093 genes were identified from leukemia-gene relation data extracted from the Pathway Studio ResNet database.Although supported by 6,524 articles, it is possible that some leukemia-gene relationships may have not been identified.The 4 proposed metrics were effective in selecting the top genes for leukemia prediction.Further network analysis with more experimental data may extract additional useful features for identifying biologically significant genes.

CONCLUSION
Leukemia is a complex, genetically-caused, disease with the genetic causes linked to a large gene network.Integrating network gene-disease relation data and experimental data, with GSEA, SNEA, and NCA, may provide and effective approach to identifying potential target genes.This study provides an overview map for the current field of genetic research of leukemia, which could be used as the basis in future biological/genetic studies.

Fig. 2
Fig. 2 Histogram of publications reporting gene-disease relationships between leukemia and the 1,093 Genes.(a) Number of articles published by year; (b) Gene-wise publication date distribution of the supporting references, with mean marked as red star.

Fig. 3 ConnectivityFig. 4
Fig. 3 Connectivity Networks built by 31 Genes from Different Groups.(a) 31 genes from RScore group; (b) 31 genes from AScore group.The networks were generated using Pathway Studio.Unrelated genes appear in green.

Table 1
also lists the top 31 genes with the highest RScores (in descending order).Full results appear in Supplementary Material 1.

Table 2 . Molecular Function Pathways/Groups Enriched by 1,093 Genes Reported
Note: A Jaccard similarity is a statistic used to compare the similarity and diversity of two sample sets, which is defined by , where A and B are two sample sets.

Table 3 . Sub-networks Enriched by the 1,093 Genes Gene Set Seed Total # of Neighbors Overlap p-value Jaccard Similarity
Many of these reported leukemia-related genes are associated with other cancers that were linked to Leukemia, with a large overlap (Jaccard similarity > 0.18).