Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Subject "genomi"

Sort by: Order: Results:

  • Tuominiemi, Antti (2020)
    The sequencing methods used to study the genome of organisms have become cheaper, resulting in a significant increase in the amount of genomic data available. Knowing the nucleic acid sequence of the DNA does not tell much about an organism. Not without first annotating the genome, which means searching for the locations of genes and defining their products. The programs used for annotation make mistakes and their results must be evaluated in various ways. The vast amount of genomic data encourages fast production of new annotations and this can increase human made errors. Some annotation programs use gene databases, so the number of wrongly annotated genes they contain may increase in the future if the quality control of annotations is not improved. This study examines correlation between selected quality measures and the quality of annotations. The quality metrics used can be divided into two basic types, the first one is based on the basic structures of genes and the second one on comparing the protein product of a gene against a protein database. The study assumes that comparison to a reference is a reliable way to assess the quality of annotations. The comparison is made at genome, exon and nucleotide levels. A single value describing the comparison is calculated at each level. For each gene aligned with a reference gene, sensitivity and specificity are calculated and used to make f-score at the nucleotide level. Four different versions of the wild strawberry (Fragaria vesca) genome and their six annotations were used as data. They were downloaded from the Genome Database for Rosacaea, which is a genome database specializing in rose plants. The correlation coefficients calculated from quality metrics and f-scores were in several cases small but reliable because the p-value was minimal. Correlation coefficients were higher when quality metrics based on protein homology were examined. The correlation coefficient calculated from the mean of the structure-based quality metrics and the f-score received lower values if the studied annotation had a high f-score value. These results detailed in this paper support the view that the selected structure-based quality metrics are not suitable for evaluation of high-grade annotations. They might possibly be used in automated detection of poor-quality annotations. Quality metrics based on protein homology appeared to be promising subjects for further research.
  • Tuominiemi, Antti (2020)
    The sequencing methods used to study the genome of organisms have become cheaper, resulting in a significant increase in the amount of genomic data available. Knowing the nucleic acid sequence of the DNA does not tell much about an organism. Not without first annotating the genome, which means searching for the locations of genes and defining their products. The programs used for annotation make mistakes and their results must be evaluated in various ways. The vast amount of genomic data encourages fast production of new annotations and this can increase human made errors. Some annotation programs use gene databases, so the number of wrongly annotated genes they contain may increase in the future if the quality control of annotations is not improved. This study examines correlation between selected quality measures and the quality of annotations. The quality metrics used can be divided into two basic types, the first one is based on the basic structures of genes and the second one on comparing the protein product of a gene against a protein database. The study assumes that comparison to a reference is a reliable way to assess the quality of annotations. The comparison is made at genome, exon and nucleotide levels. A single value describing the comparison is calculated at each level. For each gene aligned with a reference gene, sensitivity and specificity are calculated and used to make f-score at the nucleotide level. Four different versions of the wild strawberry (Fragaria vesca) genome and their six annotations were used as data. They were downloaded from the Genome Database for Rosacaea, which is a genome database specializing in rose plants. The correlation coefficients calculated from quality metrics and f-scores were in several cases small but reliable because the p-value was minimal. Correlation coefficients were higher when quality metrics based on protein homology were examined. The correlation coefficient calculated from the mean of the structure-based quality metrics and the f-score received lower values if the studied annotation had a high f-score value. These results detailed in this paper support the view that the selected structure-based quality metrics are not suitable for evaluation of high-grade annotations. They might possibly be used in automated detection of poor-quality annotations. Quality metrics based on protein homology appeared to be promising subjects for further research.