The set of allelic genes found in multiple individuals in a species or closely related species may be called a "pangene set," with the gene models that correspond by homology and position being called a pangene. The pangene set calculated for Glycine accessions at SoyBase can be used to find corresponding genes across assemblies and annotations.
There are several good options for identifying corresponding genes in different accessions or annotations. If you have ...
cat Glycine.pan5.MKRS.table_ref_lines.tsv | tr '\t' '\n' # to see the list of headers cut -f1,2,8,10 Glycine.pan5.MKRS.table_ref_lines.tsv | head # to see four selected columns (the first 10 entries) grep -f YOUR_LIST_OF_GENES.txt Glycine.pan5.MKRS.table_ref_lines.tsv # to search a provided list of gene IDs against the file
cat Glycine.pan5.MKRS.table.tsv | head -1 | tr '\t' '\n' # to see the list of headers cut -f1,2,8,10 Glycine.pan5.MKRS.table.tsv| head # to see four selected columns (the first 10 entries) grep -f YOUR_LIST_OF_GENES.txt Glycine.pan5.MKRS.table.tsv # to search a provided list of gene IDs against the file
Sample data from the correspondence table for the reference lines:
Pangene ID | Wm82.gnm1.ann1 / Wm82.a1.v1 | Wm82.gnm2.ann1 / Wm82.a2.v1 | Wm82.gnm4.ann1 / Wm82.a4.v1 | Wm82.gnm6.ann1 / Wm82.a6.v1 | more |
---|---|---|---|---|---|
Glycine.pan5.pan46446 | Glyma01g00210 | Glyma.01G000100 | Glyma.01G000100 | Glyma.01G000100 | ... |
Glycine.pan5.pan46447 | Glyma01g00291 | Glyma.01G000300 | Glyma.01G000322 | Glyma.01G000322 | ... |
Glycine.pan5.pan43005 | Glyma01g00300 | Glyma.01G000400 | Glyma.01G000400 | Glyma.01G000400 | ... |
Glycine.pan5.pan34709 | Glyma01g00321 | Glyma.01G000600 | Glyma.01G000600 | Glyma.01G000600 | ... |
Glycine.pan5.pan74052 | NONE | NONE | NONE | Glyma.01G000750 | ... |
Glycine.pan5.pan99999 | ... | ... | ... | ... | ... |
The gene correspondences in the lookup tables above were calculated using the Pandagma package for identifying pangenes from a given collection of annotations. The method is described briefly here:
The Pandagma software package (Cannon, Lee, Berendzen) was used to identify pangene and gene family sets. The main steps in Pandagma's pangene process are:
The Pandagma package is available at https://github.com/legumeinfo/pandagma, including the configuration used to calculate the pangene data above.
The pangene collection for Glycine, including data in several formats and descriptions of the fies, is in the "Glycine/GENUS/pangenes" section of the Data Store.
If you have extensive programmatic work and need to translate among arbitrary accessions, the gene_translate.pl utility in pandagma may be helpful.