Molecular Locus Nomenclature
A major problem with many of the current naming conventions is that they often contain both locus and allele information in the names, are based on a particular technology and/or contain bold, italic, super/subscripts or some other characters or formatting that are incompatible with electronic databases.
For example:
- E(CCA)M(AGC)114 or ACAAAC110 : AFLP locus, name gives discriminatory bases and band size
- OH03_700: RAPD locus with band size given
- U379_500T: RAPD locus with band size, reason for T not known This is obviously not ideal as the same locus in a different variety might well have a different band size.
There are at least two kinds of names needed:
- physical clones
- genetic loci and their alleles
There are of course subclasses of these (i.e. sequences of clones), but these cover the major groups.
The Soybean Genetics Committee recommends the following guidelines which will result in names that are compatible with electronic databases. The ones for physical clones are already in use in several labs and seem to be acceptable.
An EST example:
Gm-c1049-2803
- Gm -- Glycine max - use the appropriate genus & species initials; 2 chars only
- c -- cDNA; r = rerack of a previous library;
- 1049 -- library number; start at 1001; numbers coordinated by SoyBase - thus c1049 is the library name
- 2083 -- clone number
A BAC example:
Gm_UMb001_090_G08
- Gm -- Glycine max - use the appropriate genus & species initials; 2 chars only
- UM -- University of Minnesota; IS = Iowa State Univ; other 2 char abbreviations to be assigned as needed
- b -- BAC
- 001 -- library number; use leading zeroes to pad to 3 chars - thus UMb001 is the library name
- 090 -- plate number
- G08 -- row and column of clone
Subclones of BACs are named by adding appropriate characters to identify subclone to the BAC name:
- Gm_UMb001_090_G08_s_row_column where s indicates a subclone
- F or R appended as needed
Other single character abbreviations might be a = AFLP, f = RFLP, etc.
Individual Sequences
- Append a U or R followed by an integer for the walking step to the BAC or subclone name.
- For example Gm_ISb002_091_F11U3 for the 3rd step from the U sequencing primer.
- For consistency T3/T7, etc. should be referred to as U/R with the actual sequencing primer given in the database record.
- A clone's completed sequence would be Gm_ISb002_091_F11C where the C indicates that this is a completed sequence.
For clones that are already in use and which can't be easily fitted into this scheme a simplified version of the above may be used.
- cDNA Gm-cn-arbitrary text: subclones and sequences would be named as above.
- RFLP Gm-pn-current rflp name: all current G. max probes would be considered 'library' 1. An example would be Gm-p1-A632.
- AFLP Gm-an-arbitrary text: subclones and sequences would be named as above.
- RAPD Gm-r-primer name from supplier: subclones and sequences would be named as above.
Genetic Loci
This is obviously an area badly in need of some consistency. In general it is inappropriate to include such details as band sizes, coordinates, restriction sites, cloning vehicle history, primer IDs, dates, etc. into locus names as these quickly become irrelevant. Additionally, reconciling new and old names as the technique, laboratory, and species change, or when one locus overlaps, includes, or is identical to another is extremely difficult. We recommend that the general rule be that the locus name should not contain an overabundance of information, but should rather be just enough to point to a database entry where all of the rest of the information can be found. This is the model used by the public sequence databases. A similar approach has worked well for soybean genes and I think it can work for other kinds of genetic loci also. Many of the loci we are using will eventually be used in other species and it would be inconvenient to have to either rename loci for each species or carry along a lot of only-historically-interesting baggage.
The Committee recommends that:
- Locus names should be a short as possible. Maps are becoming quite full and long names just exacerbate the display problems. One possibility for naming loci in the future might be to just use a 3 letter+3numeral system starting at AAA001 and running to ZZZ999. A lab could reserve a block of names when desired with the reservation system implemented as part of SoyBase.
- Locus names should contain only plain text letters and numbers, although upper and lower case can be used when needed. Any text formatting (bold, underlined characters, italics, super/subscripts, non-Roman characters, etc.) should absolutely not be used. The dash character should be used to separate parts of locus names, not the space character. The underline character should be reserved for indicating duplicated (= paralogous) loci or for phenotypically related QTL.
- Locus names should not contain references to a specific technology. The locus is a Mendelizing entity and may well be revealed in the future with a completely different technique.
- Locus names should not contain any map information. Map names can change and location on a map definitely changes depending on the particular markers used, population and map calculation procedure.
- Locus names should never contain allele information. Obviously the allele can be different in different varieties and the locus name should not be tied to the first one observed.
- Whenever possible, allele names should also be technology and genotype neutral. A simple approach would be to simply name alleles with a letter suffix appended to the locus name. For example A745_2-ac is the third allele identified for locus A745_2 (which is a paralog of A745_1). Note that the allele suffix is two letters starting with aa, ab, ac, and running to zy, zz. This will provide for up to 676 alleles at each genetic locus.
- When a new locus is completely contained in a previous one the subloci should be named using a "dot number" suffix. For example, if a RFLP clone is sequenced in two varieties and SNPs identified, the alleles would be named A745_2.01, A745_2.02, etc. Note the 2 character allele designator.
- Molecularly defined loci developed from soybean sequences, probes etc. would be named as above. Loci derived form other species would have a two letter code prepended to the name originally used in that species. For example the locus dfg3 in pea would be named Ps_dfg3 on the soybean map. Similarly we would encourage the use of Gm_ when soybean-derived loci are placed on other maps. The rules for duplicate loci decribed above would be applied when needed.