Elizabeth M. Gillet1,2
1 Institut für Forstgenetik und Forstpflanzenzüchtung,
Universität Göttingen,
2 Institut für Forstgenetik und
Forstpflanzenzüchtung,
Bundesforschungsanstalt für Forst- und Holzwirtschaft,
Email: egillet@gwdg.de
Minimum sample sizes for detecting all types present in a deme
TABLE 1: Minimum sample size for detecting all alleles present at frequencies not less than a threshold frequency in a deme
TABLE 2: Minimum sample size for detecting all alleles present at frequencies not less than a threshold frequency in a deme showing Hardy-Weinberg-Proportions
TABLE 3: Minimum sample size for detecting all haplotypes among the haploid gametophytes produced by a single tree that is heterozygous at a given number of loci
TABLE 4: Minimum sample size for detecting all multilocus genotypes in a progeny from self-fertilization of an individual
Minimum sample sizes for qualified estimation of frequency distributions in demes
TABLE 5: Minimum sample sizes for qualified estimation of the frequency distribution of types within a single deme
TABLE 6: Minimum samples sizes for qualified joint estimation of the frequency distributions of types within several demes for estimation of multiple-deme parameters
TABLE 1: Minimum sample size for detecting all alleles present at frequencies not less than a given frequency in a deme
How many different alleles are present among the diploid individuals of a deme at a single gene locus? A deme is defined as a set of individuals, such as a family, stand, subpopulation, species, etc. The only way to be sure that every allele is detected is to study every individual in the deme, hoping of course that any recessive alleles are present in homozygous form. Since it is seldom feasible to investigate the entire deme, the solution is to choose a sampling strategy that yields a high probability of detecting all alleles. If the true frequencies of the genotypes in the deme are known, Gregorius (1980) derived an exact formula for the detection probability for a sample of given size. By varying the sample size in this formula, the smallest size can be determined for which the probability of detecting all alleles reaches a desired value.
Usually, however, the true genotype frequencies are not known. Often, not even the number of different alleles in the deme is known. In such cases, the problem can be reformulated as: What is the minimum sample size for detecting all alleles that are not too rare? More precisely, for desired threshold allele frequency alpha and detection probability epsilon, what is the minimum sample size such that the probability is greater than or equal to epsilon that all alleles will be included in the sample that have frequencies not less than alpha in the deme.
The detection probability depends on how the alleles are associated to make up the diploid genotypes. For a given number of alleles at given frequencies, the detection probability assumes its minimum under complete homozygosity (Gregorius 1980). Reasoning intuitively, at most one new allele can be found per homozygous individual, as opposed to two in a heterozygous individual. The consequence is that the minimum sample size for detecting all alleles is greater under complete homozygosity than for all other forms of allelic association. A formula for the detection probability under complete homozygosity for alleles of known frequency and given sample size was derived by Gregorius (1980, Corollary 1).
If information is available neither about the number of alleles (rare or not) nor about their frequencies, calculations must proceed from the "worst case" single-locus genotype frequency distribution for which the smallest allele frequency is not less than alpha. This is the case of complete homozygosity where the number of alleles equals n=[1/alpha] (i.e., n equals the largest natural number less than or equal to 1/alpha), of which n-1 alleles have the frequency alpha and the nth allele has frequency 1- (n-1) alpha (Gregorius 1980, Corollary 3). A formula for the detection probability for this "worst-case" situation is also given by Gregorius (1980, Corollary 3) and applied to calculate minimum sample sizes for different values of alpha and epsilon (his Table 1). This table is reproduced below.
Sampling strategy: For random sampling with replacement (or without
replacement, if the deme is very large) among the diploid individuals of
a deme, the following table lists the minimum sample size that ensures
with probability epsilon that all alleles at a locus will be
detected that are present at relative frequencies not less than
alpha. Calculations proceed from the "worst case" of
no prior knowledge about the number of alleles nor their frequencies
and under the assumption of complete homozygosity.
|
|
||
|
|
0.999 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1: In order to detect all alleles of a locus that are present
in a deme at frequencies not less than alpha, the minimum sample
size is given such that probability of detection is always greater than
epsilon, regardless of the actual number of alleles and the allele
and genotype frequencies in the deme. (Reproduced from Gregorius 1980,
Table 1)
Detecting haplotypes, cytotypes, and phenotypes
Since the probability of detecting all alleles when they are associated
in complete homozygosity equals the probability of detecting all alleles
in haploid individuals, a useful consequence is that the above table also
lists the minimum sample sizes for detecting all alleles in haploid
individuals. This carries over to the detection of all uniparentally
inherited marker variants, such as chloroplast or mitochondrial DNA
markers. In fact, the minimum sample sizes in the above table
apply to any trait for which it holds that each individual exhibits
exactly one trait state (e.g. phenotype, cytotype, haplotype).
Reference
Gregorius H-R (1980) The probability of losing an allele when diploid genotypes are sampled. Biometrics 36: 632-652.
TABLE 2: Minimum sample size for detecting all alleles present at frequencies not less than a given frequency in a deme showing Hardy-Weinberg-Proportions
Consider again the situation of Table 1, but suppose that the mode of gene action at the locus is codominance and that the genotypes are known to show Hardy-Weinberg-Proportions (HWP) within the deme. This means that if pi is the relative frequency of the allele Ai in the deme (pi>0 for all i and Sumi pi=1), and if Pij is the relative frequency of individuals of genotype AiAj in the deme (Sum i.le.j Pij=1, where ".le." stands for "less than or equal to"), then the relative genotype frequencies equal
Pii=pi2 for all i and Pij= 2pi pj for all i,j with i < j (HWP)
Fulfillment of HWP signifies that the alleles are randomly associated in the genotypes. This information about the organization of the alleles can be used to refine the sampling strategy set forth in Table 1.
Again, a sampling strategy is sought that yields a high probability of detecting all alleles that are not too rare. The precise formulation remains the same: For desired threshold allele frequency alpha and detection probability epsilon, the probability should be greater than or equal to epsilon that all alleles will be included in the sample that have frequencies not less than alpha in the deme.
If the allele frequencies in the deme are known, Gregorius (1980, text following Corollary 1) showed that the minimum sample size when the genotype frequencies fulfill HWP is equal to one-half the minimum sample size for complete homozygosity (see Table 1). If information is available neither about the number of alleles nor about their frequencies, calculations must proceed from the "worst-case" single-locus genotype frequency distribution showing HWP for which the smallest allele frequency is not less than alpha. Again, this occurs when the number of alleles equals n=[1/alpha] , n-1 of which with frequency alpha and the nth allele with frequency 1- (n-1) alpha. The result is that for HWP, the minimum sample size for detection of all alleles is one-half that for complete homozygosity. The table below thus results by halving the minimum sample sizes of Table 1.
Sampling strategy: For random sampling with replacement (or without
replacement, if the deme is very large) among the diploid individuals of
a deme that shows HWP at a locus that shows codominance of gene action,
the table lists the minimum sample size that ensures with probability
epsilon that all alleles at the locus will be detected that are
present at relative frequencies not less than alpha.
Calculations proceed from the "worst case" of no prior knowledge about
the number of alleles nor their frequencies and under the assumption
that the genotype frequencies show HWP.
|
|
||
|
|
0.999 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2: Assume that the genotype frequencies in a deme show
HWP at a locus that shows codominance of gene action. In order
to detect all alleles that are present at this locus at frequencies
not less than alpha, the minimum sample size is given such
that the probability of detection is greater than epsilon,
regardless of the actual number of alleles and their
frequencies. (After Gregorius 1980, by halving the
values in his Table 1)
Reference
Gregorius H-R (1980) The probability of losing an allele when diploid genotypes are sampled. Biometrics 36: 632-652.
TABLE 3: Minimum sample size for detecting all haplotypes among the haploid gametophytes produced by a single tree that is heterozygous at a given number of loci
If a tree is heterozygous at m gene loci, then it can produce haploid gametophytes (ovules whose haplotypes are inferable e.g. from the primary endosperm of conifer seeds, or pollen) that have any of 2m different multilocus haplotypes. Assuming regular segregation and absence of linkage between the loci, each of the 2m haplotypes has equal chances of being formed, and thus the haplotypes should be uniformly distributed among the gametophytes. How should sampling be performed so that all haplotypes will be detected in a single sample? A sampling strategy is needed that yields a chosen high probability epsilon of including all haplotypes in the sample.
Sampling strategy: For random sampling with replacement (or without
replacement, if the base set of gametophytes is very large) among the
gametophytes, the table lists the minimum sample size that ensures
with probability epsilon that all haplotypes will be
detected. These sample sizes were calculated using
a formula of Gregorius (1980) based on the respective number of uniformly
distributed haplotypes.
|
|
|
|
||
|
|
|
|||
|
|
|
|
|
11 |
|
|
|
|
|
29 |
|
|
|
|
|
68 |
|
|
|
|
|
150 |
|
|
|
|
|
327 |
|
|
|
|
|
703 |
Table 3: If a tree is heterozygous at m unlinked loci,
each showing codominance of gene action and regular segregation,
its gametophytes should have 2m different
haplotypes, each with relative frequency 1/2m.
The table lists the minimum sample size such that probability of
detection of all 2m haplotypes is greater than
epsilon. (Reproduced from Gillet 1998)
References
Gillet EM (1998) HAPLOGEN - User's Manual : Qualitative inheritance analysis of zymograms and DNA electropherograms in haploid gametophytes. URL http://www.uni-forst.gwdg.de/forst/fg/index.htm
Gregorius H-R (1980) The probability of losing an allele when diploid genotypes are sampled. Biometrics 36: 632-652.
TABLE 4: Minimum sample size for detecting all multilocus genotypes in a progeny from self-fertilization of an individual
If a tree that is heterozygous at m diploid gene loci showing codominance of gene action can reproduce by self-fertilization, then it can produce offspring with any of 3m different multilocus genotypes over the m loci. Assuming regular segregation at each locus, and absence of linkage between loci, the relative frequencies of the different genotypes will differ. However, the "rarest" types - the complete homozygotes - should have frequencies of approximately (1/4)m. How should sampling be performed so that all multilocus genotypes will be detected in a single sample? Once again, a sampling strategy is needed that yields a chosen high probability epsilon of including all genotypes in the sample.
Sampling strategy: For random sampling with replacement (or without
replacement, if the base set of offspring is very large) among the
gametophytes, the table lists the minimum sample size that ensures
with probability epsilon that all genotypes will be
detected. These sample sizes were calculated using
a formula of Gregorius (1980) based on the number of genotypes and the
frequency of the "rarest" genotype.
|
|
|
|
||
|
0.90 | 0.95 | |||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 4: If a tree is heterozygous at m loci, then the
number of different genotypes among the offspring equals
3m, and the relative frequency of the rarest
genotype equals (1/4)m. The minimum sample
size is listed such that probability of detecting all genotypes is
greater than epsilon. (Reproduced from Gillet 1998)
References
Gillet EM (1998) DIPLOGEN - User's Manual : Qualitative inheritance analysis of zymograms and DNA electropherograms in diploid individuals. URL http://www.uni-forst.gwdg.de/forst/fg/index.htm
Gregorius H-R (1980) The probability of losing an allele when diploid genotypes are sampled. Biometrics 36: 632-652.
TABLE 5: Minimum sample sizes for qualified estimation of the frequency distribution of types within a single deme
How can the relative frequencies of the types (genotypes, phenotypes) present in a deme be estimated? A sampling strategy is needed such that the probability will be large that the sample frequencies will not deviate by more than a given amount from the true frequencies. The following considerations are based on the system analytic approach developed by Gregorius (1998), which is explained below.
A method of investigation - or sampling strategy - includes specification of sample size. If a hypothesis about the true distribution P of the n types in the deme is given, then the method is defined to be qualified for P, if the sample size is large enough such that
ProbP (d0 (S,U) .ge. lambda) .le. epsilon
holds, where ".ge." stands for "greater than or equal to", ".le." stands for "less than or equal to", and where
If no information is available about the true distribution P, then it is necessary to choose a method of investigation that is qualified for all possible distributions of n types. It can be shown that if the method of investigation is qualified for the uniform distribution U = (ui) i=1,...,n where ui=1/n, then it is qualified for all distributions of n types. U can thus be termed the "worst-case" distribution of n types.
Sampling strategy: For sampling with replacement within a single
deme containing a given number n of types, the following table gives
the minimum sample size that ensures qualification (Gregorius 1998) on given
levels for the "worst-case" frequency structure U. For up
to 4 types, sample sizes were calculated using exact probabilities;
for larger numbers of types, sample sizes were estimated by
Monte-Carlo methods.
|
|
|||||
|
|
|
|
|
0.10 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 5: For a deme that contains individuals of n different
types, the minimum sample size for qualified estimation of the frequency
distribution at level of significance epsilon=0.05 is given for
five different critical d0-discrepancies
lambda.
An asterisk (*) signifies that the exact critical probability was
calculated. Otherwise, minimum sample sizes numbers were
estimated using Monte-Carlo methods.
References
Gregorius H-R (1998) The system analytical approach to the study of hypotheses. URL http://www.uni-forst.gwdg.de/forst/fg/index.htm
Gregorius H-R (1974a) On the concept of genetic distance between populations based on gene frequencies. Proceedings, Joint IUFRO Meeting S.02-04, pp. 17-22.
Gregorius H-R (1974b) Genetischer Abstand zwischen Populationen. I. Zur Konzeption der genetischen Abstandsmessung. Silvae Genetica 23: 22-27.
TABLE 6: Minimum samples sizes for qualified joint estimation of the frequency distributions of types within several demes for estimation of multiple-deme parameters
How can the relative frequencies of the types (e.g. genotypes, phenotypes) present in each of several demes be estimated jointly? Joint frequency estimates are necessary for the estimation of multiple-deme parameters. A sampling strategy is needed such that the probability will be large that the joint deviation of the sample frequencies from the true frequencies in the demes will not exceed a chosen amount. The following considerations are based on the system analytic approach developed by Gregorius (1998) (also see explanations for Table 5).
The m demes are assumed to be given, with the relative size of deme j equal to cj, where cj > 0 and Sumj cj =1. Denote pi (j) as the relative frequency of type i in deme j, where Sumi pi (j) =1. These deme frequencies are translated into a weighted relative joint frequency distribution by multiplying each frequency by cj, i.e., pij=cj pi (j). Then, Sumi,j pij = 1 holds.
Assume that the m demes are of equal relative sizes, i.e., cj=(1/m) for all j. If a hypothesis about the true distribution P of the n types in each of the m demes is given, then a method of investigation involving independent random samples from each deme is defined to be qualified for P, if the sample sizes for all demes are large enough such that
ProbP (d0 (S,P) .ge. lambda) .le. epsilon
holds, where ".ge." stands for "greater than or equal to", ".le." stands for "less than or equal to", and where
If no information is available about the true distribution P, then it is necessary to choose a method of investigation that is qualified for all possible distributions of n types. It can be shown that if the method of investigation is qualified for the uniform distribution U = (uij) i=1,...,n, j=1,...,m where ui=1/ (mn), then it is qualified for all distributions of n types in m demes. U can thus be termed the "worst-case" distribution of n types in m demes.
Sampling strategy: For independent random sampling with replacement
within a given number m of demes of equal relative sizes that each
contain individuals possessing any one of a given number n of types,
the following table gives the minimum sample size that ensures qualification
(Gregorius 1998) on given levels for the "worst-case" frequency structure
U. The sample sizes were estimated using Monte-Carlo methods.
of equal relative sizes |
present in the demes |
|
|
0.05 | 0.10 | ||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 6: For given number of demes of equal relative sizes that
contain individuals of n different types, the minimum sample size
for each deme for qualified joint estimation of the demes' frequency
distributions at level of significance epsilon=0.05 is given
for two different values of the critical
d0-discrepancy lambda.
References
Gregorius H-R (1998) The system analytical approach to the study of hypotheses. URL http://www.uni-forst.gwdg.de/forst/fg/index.htm
Gregorius H-R (1974a) On the concept of genetic distance between populations based on gene frequencies. Proceedings, Joint IUFRO Meeting S.02-04, pp. 17-22.
Gregorius H-R (1974b) Genetischer Abstand zwischen Populationen. I. Zur Konzeption der genetischen Abstandsmessung. Silvae Genetica 23: 22-27.
© Institut für Forstgenetik und F orstpflanzenzüchtung, Universität Göttingen, 1999