In silico analysis of GATA4 variants demonstrates main contribution to congenital heart disease

Introduction: Congenital heart disease (CHD) is the most common congenital abnormality and the main cause of infant mortality worldwide. Some of the mutations that occur in the GATA4 gene region may result in different types of CHD. Here, we report our in silico analysis of gene variants to determine the effects of the GATA4 gene on the development of CHD. Methods: Online 1000 Genomes Project, ExAC, gnomAD, GO-ESP, TOPMed, Iranome, GME, ClinVar, and HGMD databases were drawn upon to collect information on all the reported GATA4 variations.The functional importance of the genetic variants was assessed by using SIFT, MutationTaster, CADD,PolyPhen-2, PROVEAN, and GERP prediction tools. Thereafter, network analysis of the GATA4protein via STRING, normal/mutant protein structure prediction via HOPE and I-TASSER, and phylogenetic assessment of the GATA4 sequence alignment via ClustalW were performed. Results: The most frequent variant was c.874T>C (45.58%), which was reported in Germany.Ventricular septal defect was the most frequent type of CHD. Out of all the reported variants of GATA4,38 variants were pathogenic. A high level of pathogenicity was shown for p.Gly221Arg (CADD score=31), which was further analyzed. Conclusion: The GATA4 gene plays a significant role in CHD; we, therefore, suggest that it be accorded priority in CHD genetic screening.


Introduction
Congenital heart disease (CHD) is the most common congenital malformation and a significant cause of childhood mortality with an estimated prevalence of 1% of infants born each year. 1,2 Cardiovascular abnormalities are reported in approximately 29% of dead infants. CHD can be caused by variants in different genes whose roles have evolved. The number of genes and variants thereof involved in the CHD pathogenesis has increased, and an accurate determination of the molecular mechanisms of CHD remains particularly challenging due to genetic heterogeneity and incomplete penetrance. 3 Also extremely complex is the differential diagnosis of CHD in that it is a multifactorial disease encompassing both genetic predisposition and environmental components. 4 Thus, it is vitally important to identify disease-causing genetic variants. 5 Some CHD-associated genes encode transcription factors such as GATA4, NKX2-5, and TBX5, and a number of gene variants identified in these genes have been associated with cardiac structure and functional impairment. 1 GATA-binding factor 4 (GATA4) (OMIM: 600576) is one of the 6-member GATA family of transcription factors: GATA1, GATA2, GATA3, GATA4, GATA5, and GATA6. Amongst GATA-binding proteins, GATA1-3 are expressed in hematopoietic stem cells as significant regulators, whereas GATA4-6 are expressed in different mesoderm-and endoderm-derived tissues such as the heart, the lung, the gonad, the gut, and the liver. 6 Variants in the GATA4, GATA5, and GATA6 genes have been found in patients with various types of CHD. [7][8][9] GATA proteins comprise 2 conserved zinc finger domains (ZNI and ZNII), which cover various aspects of functions including DNA attachment, GATA4 preservation, and protein-protein and the target DNA sequence interactions. The GATA4 gene consists of 7 exons located on chromosome 8p23.1-p22. The gene encodes one of the earliest-expressed transcription factors with 442 amino acids and is imperative for normal cardiogenesis. GATA4 is significantly expressed in embryonic development, with the expression continuing in the adult myocardium. [10][11][12] A rise has been reported in the number of patients with CHD who reach adulthood. 13 This transcription factor

Frequency
The frequencies of the selected variants were determined using the aforementioned databases. Furthermore, the number of participants and individuals having variations in the studied populations was reported.

Computational Methods
Given its increasing importance and use to determine the possible effects of genetic variants, computational analysis was employed in the present study. The variants of the GATA4 gene and their correlations with the molecular pathogenesis of CHD were further explored by predicting the pathogenicity/tolerance of the variants through the following bioinformatics tools: SIFT (Sorting Intolerant from Tolerant; https://sift.bii.a-star.edu.sg/www/SIFT_ seq_submit2.html), 32 PolyPhen-2 (Polymorphism Phenotyping, version 2; http://genetics.bwh.harvard.edu/ pph2/), 33 PROVEAN (Protein Variation Effect Analyzer, version 1.1.3; http://provean.jcvi.org/seq_submit.php), 34 CADD (Combined Annotation-Dependent Depletion; https://cadd.gs.washington.edu/), 35 MutationTaster (http://www.mutationtaster.org/), 36 and GERP (Genomic Evolutionary Rate Profiling; http://mendel.stanford.edu/ SidowLab/downloads/gerp/). 37 All these bioinformatics tools are capable of distinguishing pathogenic from nonpathogenic alterations. Protein sequences in the FASTA format (NM_002052.5), the positions and substitutions of amino acids, and the positions of chromosomes were used as input data. A SIFT score of 0.05 or less is regarded as deleterious, and a SIFT score of greater than 0.05 is considered to signify a tolerated variant. 32 PolyPhen-2 results are shown with qualitative levels as benign, possibly damaging, and probably damaging. PolyPhen-2 prediction outputs have a numerical score range of 0 to 1. The cutoff score considered for PolyPhen-2 is 0.5, and variants with scores equal to or greater than 0.5 are predicted to be deleterious. 33,38 The cutoff score for PROVEAN is −2.5, and variants equal to or greater than −2.5 are assigned as deleterious. 34 Also calculated in the current investigation was the CADD score. All genomic features used to calculate the CADD score via a machinelearning model are summarized into a Phred score with a cutoff point of 20. Disease-causing variants display a high Phred score ( > 20), whereas a low score (<20) signifies less pathogenicity. 35,39 MutationTaster, which was applied for all the detected variants in the present study, considers an alteration to be a polymorphism if it is reported as a single-nucleotide polymorphism (SNP) in the HapMap data and the 1000 Genomes Project. Thus, any alteration that could result in premature termination codon and ultimately lead to nonsense-mediated mRNA decay is considered a disease-causing variant. GERP is an evolutionary measurement tool whose results are based on multi-species sequence alignment by comparison with neutral expectation. GERP scores show a reduction in the number of substitutions. Positive scores indicate a substitution deficit, while negative scores show that a site is probably evolving neutrally. 40

GATA4 Network Analysis
The functional association between 2 proteins is the primary purpose of the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database. This web-based tool expresses the interaction of proteins in a particular biological function. 41 STRING (version 11.0; https://string-db.org/) is used to recognize the known and predicted interactions between the GATA4 protein and other related proteins in a cell. 42

Prediction of Normal and Mutant Protein Structures
Structural and functional differences between wildtype and mutated GATA4 were anticipated by using HOPE (Have [y]Our Protein Explained; https://www3. cmbi.umcn.nl/hope/input/) and 43 I-TASSER (Iterative Threading ASSEmbly Refinement; https://zhanglab.ccmb. med.umich.edu/I-TASSER/). [44][45][46] The objective was to analyze a pathogenic variant with a high CADD score. HOPE shows the 3D structural and functional effects of a point mutation in human proteins. The input for this tool is the amino acid sequence of the GATA4 protein and the specific amino acid alteration of the variant. 43 The I-TASSER server predicts secondary structures and 3D models through various alignment methods. The accuracy of the formed models is evaluated based on a confidence score (C-score). Predicted models with a C-score of greater than −1.5 are considered to possess a correct topology. I-TASSER predicts the template modeling score (Tm-score) and the root mean square deviation (RMSD). The TM-score ranges between 0 and 1, with higher values specifying better structural models. 47 Phylogenetic Analysis GATA4 protein sequences from 5 different organisms, namely Homo sapiens (humans), Canis lupus familiaris (dogs), Rattus norvegicus (rats), Gallus gallus domesticus (chickens), and Xenopus laevis (African clawed frogs), were retrieved from UniProt (the Universal Protein Resource; https://www.uniprot.org/). Afterward, all the GATA4 protein sequences were aligned via the multiple sequence alignment program ClustalW (version 1.83; https://www. genome.jp/tools-bin/clustalw). Thereafter, a phylogenetic tree was built by using ClustalW via the neighbor-joining method. As a result of the multiple sequence alignment, the tree showed scores that represented a sequence distance measure. These values determine the length of the branches, with the length showing the distance between the sequences.

Literature Analysis
Using online databases and publications, we succeeded in finding 110 reported variations in the GATA4 gene. We also determined the frequency of the gene variants from online resources. The data are depicted in Table 1. The distributions of the reported variants in the different regions of the GATA4 gene are presented in Figure 1.

Frequency of the Variants
A wide range of GATA4 variants has been reported in different countries such as Japan, Australia, the United States, Brazil, Egypt, India, Germany, Lebanon, France, Iran, Italy, and especially China. Precise data on the reported variations and the phenotype condition of the     individuals studied in different countries are depicted in Table 2. Genetic alterations in c.1129A > G were reported in 3 countries: China (0.33%), Germany (23%), and Australia (19.04%), with the highest frequency in Germany. Additionally, c.874T > C (45.58%), which was reported in Germany, represented the highest frequency among all the reported variations.

Bioinformatics
The results of the identification and analysis of the variations via online prediction tools are shown in Table 3. In this study, c.1075G > A indicated the highest GERP score (5.83), which represents 4.83 fewer substitutions than was expected. No negative GERP scores were reported for these variations.

Differences Between the Wild-Type GATA4 Protein and the Mutant Model
In this study, the effects of the predicted disease-causing p.Gly221Arg variant in GATA4 with the CADD Phred score of 31 were further analyzed. The variant, p.Gly221Arg, with a high level of pathogenicity is a heterozygous missense variant in the conserved N-terminal zinc finger of GATA4. 74 HOPE results showed the alteration of glycine to arginine at position 221 (G221R, CADD Phred = 31). The size, charge, and hydrophobicity value of the 2 residues, as well as the differences between them, are presented in Figure 3A. The mutant residue showed a larger size, with a positive charge, while the wild-type protein charge was neutral. Furthermore, arginine was more hydrophobic than was glycine. These differences in amino acid features could affect the zinc finger site of the protein and its function. Accordingly, this change in the GATA4 sequence might result in the conformation of the protein and exert negative influences on the structure of the protein in this specific residue ( Figure. 3B). I-TASSER produced 3D structures of GATA4 in 5 models with different C-scores. A model with a C-score of −0.5, an estimated TM-score of 0.65, and an estimated RMSD of 8.2 Å was selected. Hence, the findings proved that the solubility of the mutant protein was similar to that of the wild-type one, with a score of 3 ( Figure. 3C).

GATA4 Protein Sequence Alignment and the Phylogenetic Tree
According to the phylogenetic tree generated by ClustalW, the human GATA4 protein had the closest homology with that of Canis lupus familiaris (dogs). Further, the most distant orthologue was Xenopus laevis (African clawed    frogs) (Figure. 4A). The results of the multiple-alignment sequencing of the species are illustrated in Figure 4B.

Discussion
CHD is the most frequent congenital abnormality and the major cause of infant mortality the world over. GATA4, a transcription factor with 2 zinc finger domains, has been reported to play an essential role in embryogenesis and cardiac development. 90 The GATA4 gene is reported to modulate heart hypertrophy in adults. 95 The number of studies seeking to explicate the correlation between GATA4 variants and CHD occurrence is on the rise. Indeed, recent studies have identified several novel variants in the GATA4 gene with potential roles in CHD development. 17 CHD is very heterogeneous, and the etiology of the majority of cases remains greatly unknown. Both genetic and environmental factors contribute to CHD. 96 Therefore, the elucidation of the pathogenesis and differential diagnosis of the disease requires the identification of not only the disease-causing or susceptibility genes but also new genetic variants associated with the different types of CHD. Research has linked several genes to CHD, with NKX2-5, TBX5, and GATA4 comprising the most studied transcription factor genes. 15 These genes interact during embryonic development, and they are involved in the regulation of cardiogenesis and embryonic heart development. 97 Protein-protein interactions between transcription factors play a vital role in biological systems. The results concerning GATA4 protein interactions, generated by STRING, showed that 11 proteins (GATA4, NKX2-5, MEF2C, ZFPM2, TBX5, BMP4, SRF, BMP2, HAND2, NPPA, and HEY2) grouped in a network. GATA4 and NKX2-5 transcription factors are critical to cardiomyocyte hypertrophy; thus, single-point variants could create an imbalance in the interaction between these proteins. 12 GATA4 has been shown to interact with HAND2 to modulate the transcription of the downstream gene by binding to the conserved GATA-binding sites of the HAND2 promoter. 98 NKX2-5, as a central regulator of many aspects of heart development, interacts with SRF and GATA4 to promote the expression of the cardiac sarcomeric protein gene. 99 Mutations in the ZFPM2 gene, which encodes the FOG2 protein (a transcription regulator of the GATA family members), disrupt the interaction with GATA4 or the nucleosome remodeling and deacetylation (NuRD) complex and, thus, lead to CHD. [100][101][102][103] Loss-of-function mutation in the MEF2C gene, which encodes a transcription factor required for normal cardiovascular development, is associated with increased vulnerability to CHD in humans. 104 MEF2C, TBX5, and GATA4 can induce cardiomyocyte differentiation and directly reprogram endogenous cardiac fibroblasts into functional cardiomyocytes. 105 Remarkably, BMP2 and BMP4 are vital for cardiogenesis in that they induce the expression of NKX2-5 and GATA4 transcription factors. These 2 genes play a significant role during the initial induction of cardiogenesis. Nevertheless, no association between BMP2 and BMP4 genetic variations (rs1049007, rs235768, and rs17563) and the risk of CHD was reported by Li FF et al. 106 Variations in the NPPA gene, which encodes the ANP precursor, are correlated with hypertension, stroke, coronary artery disease, and heart failure. 107 The HEY2 transcription factor plays an important function in mammalian heart development.
Three non-synonymous variations, namely c.286A > G (p.Thr96Ala), c.293A > C (p.Asp98Ala), and c.299T > C (p.Leu100Ser), were reported to affect the second helix of HEY2 in the diseased cardiac tissues of 2 cases with atrioventricular septal defect, suggesting its possible function in the regulation of ventricular septation in humans. 108 Somatic mutations were identified in NKX2-5 and its molecular partners, TBX5 and GATA4, as well as the transcription factor HEY2, in formalin-fixed tissues taken from a collection of hearts with atrial septal defect, 109 ventricular septal defect, and atrioventricular canal defect. 90,108,[110][111][112] The GATA4 missense variation (p.G221R), on which we focused in the present study, was identified in three 46, XY DSD patients from a family of French origin. The in vitro assays in that investigation demonstrated the failure of the p.G221R mutant protein to bind to FOG2, which is required for gonad formation. Furthermore, the mutant protein failed to transactivate the anti-Müllerian hormone promoter. 74 Some variants of GATA4 investigated in the present study have been previously analyzed for genotypephenotype correlations. These investigations evaluated families manifesting those variations associated with different CHD types.
Lourenço D et al 74 reported the G221R variant in 5 members of a family with cardiac anomalies including atrial septal defect, tetralogy of Fallot, and congenital cyanotic heart disease.
In a study conducted by Garg V et al, 84 the c.886G > A (G296S) variation of GATA4 was stated in 13 affected members with atrial septal defect in a family with 5 generations. The authors also reported the E359del variation of GATA4 in 5 members of another family with  , and the black part shows the unique part of the amino acids (the side chain). This picture illustrates the structural differences between the 2 amino acids. The G221R alteration is shown by HOPE. B) A photograph generated by HOPE shows that the G221R variation affects the structure of the GATA4 protein. The green color shows the wild-type residue (glycine), and the red color represents the mutant residue (arginine). C) I-TASSER shows the secondary and 3D structure, as well as the predicted solvent accessibility, of the normal (left) and G221R mutant (right) of the GATA4 protein.   A genetic investigation conducted by E. D' Amato et al 88 reported the R319W variation in 3 members of a family: the proband and the proband's sister, both diagnosed with atrial septal defect, and the proband's father, who was considered not affected.
Rajagopal et al 68 studied 107 probands with cardiac abnormalities and identified the c.886G > T (G296C) variant in a proband with atrial septal defect and pulmonary stenosis. They also reported the substitution in the proband's father with persistent left superior vena cava to the coronary sinus. The G296S variation resulted in a reduction in GATA4 DNA-binding activity and disrupted binding to the transcription factor TBX5. Also in their study, the c.1207C > A (L403M) variant was identified in a proband with a hypoplastic right ventricle and sinus venosus atrial septal defect. Their results also demonstrated the c.487C > T (P163S) and c.1037C > T (A346V) variants in probands with endocardial cushion defect. Additionally, a missense variation, c.931C > T (R311W), in GATA4 was identified in a pedigree spanning 3 generations with 7 members diagnosed with CHD. All the affected members presented different cardiac phenotypes, including tetralogy of Fallot, ventricular septal defect, atrial septal defect, and patent ductus arteriosus, indicating that the same genetic alteration could lead to different subtypes of CHD. 87 In the present study, we filtered the literature and online databases for the pathogenic variants of the GATA4 gene. Our search yielded 210 variants; nonetheless, we excluded 100 of these variants due to a dearth of information and continued the study with 110 variations. After analyzing the frequency distributions of all the variants, we employed computational tools with different algorithms to predict the pathogenicity of the variants. As is shown in Table  3, our in silico analysis using MutationTaster, PolyPhen, PROVEAN, and SIFT revealed 38 pathogenic genetic variations. Our findings may broaden the spectrum of the known GATA4 genetic variations associated with different types of CHD.

Conclusions
Several gene deficiencies could contribute to the pathogenesis of CHD. In this study, we drew upon different in silico predictive tools for the analysis of the variants of the GATA4 gene. The most frequent variant was c.874T > C (45.58%), and the most frequent type of CHD was ventricular septal defect. Out of all the reported variants of GATA4, 38 variants were pathogenic. The p.Gly221Arg variant (CADD score = 31) showed a high level of pathogenicity. All the identified pathogenic variations in GATA4 could assist in the rapid identification and better understanding of the mechanisms underlying CHD.