欢迎访问《棉花学报》! 今天是

棉花学报 ›› 2021, Vol. 33 ›› Issue (6): 504-512.doi: 10.11963/cs20200085

• 研究简报 • 上一篇    下一篇

多倍体同源区段二代测序生物信息学分析关键参数优化

武建楠1(),陈肯1,王欢2,庞铂实1,周宇荀1,肖君华1,李凯1,*()   

  1. 1.东华大学化学化工与生物工程学院,上海 201620
    2.上海农林职业技术学院,上海 201699
  • 收稿日期:2020-11-02 出版日期:2021-11-15 发布日期:2022-04-14
  • 通讯作者: 李凯 E-mail:969747260@qq.com;likai@dhu.edu.cn
  • 作者简介:武建楠(1993―),女,硕士研究生, 969747260@qq.com
  • 基金资助:
    上海市自然科学基金(19ZR1436500)

Optimization of key parameters for next-generation sequencing bioinformatics analysis of polyploid homologous segments

Wu Jiannan1(),Chen Ken1,Wang Huan2,Pang Boshi1,Zhou Yuxun1,Xiao Junhua1,Li Kai1,*()   

  1. 1. College of Chemical Engineering & Biological Engineering, Donghua University, Shanghai 201620, China
    2. Shanghai Vocational College of Agriculture and Forestry, Shanghai 201699, China
  • Received:2020-11-02 Online:2021-11-15 Published:2022-04-14
  • Contact: Li Kai E-mail:969747260@qq.com;likai@dhu.edu.cn

摘要:

【目的】多倍体植物同源区段单核苷酸多态性(Single nucleotide polymorphism,SNP)标记分型挑战颇大。本研究以四倍体陆地棉同源区段聚合酶链式反应(Polymerase chain reaction,PCR)靶向扩增子数据集为例,观测同源区段的影响并优化生物信息学分析方案。【方法】首先扩增并测序获得136个陆地棉样本DNA的包含潜在变异的3个区段,其次使用不同参数进行比对与SNP检出,最后比较优化不同方案分析结果的异同。【结果】常规分析发现,区段1的3个SNP位点和区段2的1个SNP位点在样本中均鉴定为野生型或纯合突变型,而几乎所有样本的区段3鉴定为杂合突变型。Blast分析表明,位于A12染色体的区段3与其同源序列(位于D12染色体)相似性为96.28%。仅将区段3的同源序列作为参考序列分析,潜在SNP位点的基因型鉴定结果无变化;而将区段3与其同源序列同时作为参考序列分析,比对到区段3与其同源序列的读长的比例分别为48%、52%,因此存在较多同源区段3的读长是导致分型错误的主要原因,并确定了区段3 潜在SNP位点的基因型应为TT。此外,通过对比GATK结果在区段3发现了2个新SNP且排除了3个因部分同源序列变异造成的假阳性SNP。【结论】本研究验证了同源序列的存在会严重影响多倍体SNP鉴定与生物信息学分析;对关键参数优化特别是将多倍体同源序列同时作为参考序列,能够提高SNP分型的准确度。

关键词: 棉花; 同源区段; 单核苷酸多态性; 靶向二代测序; 基因分型

Abstract:

[Objective] Single nucleotide polymorphism (SNP) detection in polyploidy plant is complicated due to the presence of homologous segments. Here, amplicon dataset of homologous segments from the tetraploid upland cotton is used as an example to observe the influence of the homologous segments and optimize the bioinformatics pipeline. [Method] Firstly, three segments with potential variations were amplified and sequenced from 136 upland cotton (Gossypium hirsutum L.) samples. Then, mapping and SNP detection were conducted with different parameters. Finally, Different schemes were compared and optimized with their advantages. [Result] In routine method, variants in segment 1 and segment 2 were identified as correct genotypes (homozygous), while almost all variants in segment 3 seemed to be undistinguished. Blast analysis showed that the similarity between segment 3 (located on chromosome A12) and its homologous sequence (located on chromosome D12) is 96.28%. While only its homologous sequence was used as reference genome for mapping, the genotyping results didn't change. However, correct genotypes were called if both segment 3 and its homologous segment were used as reference sequence. By this way, the proportion of different sub-genome reads of target segment 3 and its homologous segment was 48% and 52%, respectively. Therefore, the genotyping error of segment 3 is due to the presence of the homologous segment. Actually, the genotype of the potential SNPs in segment 3 should be homozygous TT. In addition, by comparing the GATK results, two new SNPs were found in segment 3 and three false positive SNPs were excluded caused by homoeologous sequence variant. [Conclusion] This study confirms that the existence of homologous sequences seriously affects SNP genotyping and bioinformatics analysis in polyploid plant. Optimization of key parameters, especially using polyploid homologous sequences as reference genomes at the same time, can improve the accuracy of SNP detection in polyploid plant.

Key words: cotton; homologous segments; single nucleotide polymorphism; targeted next-generation sequencing; genotyping