Gene Model Translation / Correspondence

  • Background and methods

    The set of allelic genes found in multiple individuals in a species or closely related species may be called a "pangene set," with the gene models that correspond by homology and position being called a pangene. The pangene set calculated for Glycine accessions at SoyBase can be used to find corresponding genes across assemblies and annotations.

    If you have one or several (fewer than 100) genes to look up, use the Pangene Lookup tool below. This page accepts a list of genes (separated by spaces or line returns).

    If you have hundreds or thousands of genes to look up, you can download a correspondence table for either the reference lines, or for a correspondence table for all pangene accessions.

    Using the the reference lines as an example, the data is organized by genes (rows) and annotation-versions (columns).

    Pangene ID Wm82.gnm1.ann1 / Wm82.a1.v1 Wm82.gnm2.ann1 / Wm82.a2.v1 Wm82.gnm4.ann1 / Wm82.a4.v1 Wm82.gnm6.ann1 / Wm82.a6.v1 more
    Glycine.pan5.pan46446 Glyma01g00210 Glyma.01G000100 Glyma.01G000100 Glyma.01G000100 ...
    Glycine.pan5.pan46447 Glyma01g00291 Glyma.01G000300 Glyma.01G000322 Glyma.01G000322 ...
    Glycine.pan5.pan43005 Glyma01g00300 Glyma.01G000400 Glyma.01G000400 Glyma.01G000400 ...
    Glycine.pan5.pan34709 Glyma01g00321 Glyma.01G000600 Glyma.01G000600 Glyma.01G000600 ...
    Glycine.pan5.pan74052 NONE NONE NONE Glyma.01G000750 ...
    Glycine.pan5.pan99999 ... ... ... ... ...

    To work with either of these files, uncompress it, then open it using Excel or similar spreadsheet program;
    or if you have a little familiarity with a Unix terminal, you can extract data in many ways (a few examples):

                cat Glycine.pan5.MKRS.table_ref_lines.tsv | tr '\t' '\n'   # to see the list of headers
                cut -f1,2,8,10 Glycine.pan5.MKRS.table_ref_lines.tsv | head   # to see four selected columns (the first 10 entries)
                grep -f YOUR_LIST_OF_GENES.txt Glycine.pan5.MKRS.table_ref_lines.tsv  # to search a provided list of gene IDs against the file
              

    The method for generating the pangene correspondences is described briefly here:

    The Pandagma software package (Cannon, Lee, Weeks, Berendzen) was used to identify pangene and gene family sets. The main steps in Pandagma's pangene process are:

    • Make pairwise homology comparisons between each annotation set;
    • Filter by provided percent identity and coverage parameters;
    • Identify synteny blocks among all annotation sets;
    • Cluster genes in synteny blocks;
    • Add back remaining genes based on homology, constraining by chromosome (e.g., chr1 genes to chr1 clusters)
    • Add "extra" annotation sets (those with more fragmentary assemblies or questionable annotation quality) to clusters identified above.

    The Pandagma package is available at https://github.com/legumeinfo/pandagma, including the configuration used to calculate the pangene data above.

    The pangene collection for Glycine, including data in several formats and descriptions of the fies, is in the "Glycine/GENUS/pangenes" section of the Data Store.

    If you have extensive programmatic work and need to translate among arbitrary accessions, the gene_translate.pl utility in pandagma may be helpful.