Title:

Kind
Code:

A1

Abstract:

Identifying the genetic determinants for disease and disease predisposition remains one of the outstanding goals of the human genome project. When large patient populations are available, genetic approaches using single nucleotide polymorphism markers have the potential to identify relevant genes directly. While individual genotyping is the most powerful method for establishing association, determining allele frequencies in DNA pooled on the basis of phenotypic value can also reveal association at a much-reduced cost. Here we analyze pooling methods to establish association between a genetic polymorphism and a quantitative phenotype. Exact results are provided for the statistical power for a number of pooling designs where the phenotype is described by a variance components model and the fraction of the population pooled is optimized to minimize the population requirements. For low to moderate sibling phenotypic correlation, unrelated populations are more powerful than sib pair populations with an equal number of individuals; for sibling phenotypic correlations above 75%, however, designs selecting the sib pairs with the greatest phenotype difference become more powerful. For sibling phenotype correlations below 75%, pooling extreme unrelated individuals is the most powerful design for sib pair populations. The optimal pooling fractions for each design are constant over a wide range of parameters. These results for quantitative phenotypes differ from those reported for qualitative phenotypes, for which unrelated populations are more powerful than sib pairs and concordant designs are more powerful than discordant, and have immediate relevance to ongoing association studies and anticipated whole-genome scans.

Inventors:

Bader, Joel S. (Stamford, CT, US)

Bansal, Aruna (Branford, CT, US)

Sham, Pak (London, GB)

Bansal, Aruna (Branford, CT, US)

Sham, Pak (London, GB)

Application Number:

10/131447

Publication Date:

03/06/2003

Filing Date:

04/22/2002

Export Citation:

Assignee:

BADER JOEL S.

BANSAL ARUNA

SHAM PAK

BANSAL ARUNA

SHAM PAK

Primary Class:

Other Classes:

435/6.12, 702/20

International Classes:

View Patent Images:

Related US Applications:

Primary Examiner:

LIN, JERRY

Attorney, Agent or Firm:

MINTZ, LEVIN, COHN, FERRIS, (Boston, MA, US)

Claims:

1. A method for detecting an association in a population of individuals between a genetic locus and a quantitative phenotype, wherein two or more alleles occur at the locus, and wherein the phenotype is expressed using a numerical phenotypic value whose range falls within a first numerical limit and a second numerical limit, the method comprising the steps of a) obtaining the phenotypic value for each individual in the population; b) selecting a first subpopulation of individuals having phenotypic values that are higher than a predetermined lower limit and pooling DNA from the individuals in the first subpopulation to provide an upper pool; c) selecting a second subpopulation of individuals having phenotypic values that are lower than a predetermined upper limit and pooling DNA from the individuals in the second subpopulation to provide a lower pool; d) for one or more genetic loci, measuring the difference in frequency of occurrence of a specified allele between the upper pool and the lower pool; and e) determining that an association exists if the allele frequency difference between the pools is larger than a predetermined value.

2. The method described in claim 1 wherein the lower limit and the upper limit are chosen such that, for a specified false-positive rate, the frequency of occurrence of false-negative errors is minimized.

3. The method described in claim 1 wherein the population comprises unrelated individuals.

4. The method described in claim 1 wherein the population comprises related individuals.

5. The method described in claim 3 wherein the predetermined lower limit is set so that the upper pool includes the highest 35% of the population and the predetermined upper limit is set so that the lower pool includes the lowest 35% of the population.

6. The method described in claim 3 wherein the predetermined lower limit is set so that the upper pool includes the highest 30% of the population and the predetermined upper limit is set so that the lower pool includes the lowest 30% of the population.

7. The method described in claim 3 wherein the predetermined lower limit is set so that the upper pool includes the highest 27% of the population and the predetermined upper limit is set so that the lower pool includes the lowest 27% of the population.

8. The method described in claim 2 wherein the individuals in the population are sibling pairs and each pair is ranked according to the phenotypic values of the siblings in each pair, and either (i) both members of the sibling pair are selected for the upper pool; (ii) both members of the sibling pair are selected for the lower pool; or (iii) neither member of the sibling pair is selected.

9. The method described in claim 8 wherein each sibling pair is ranked according to a mean value of the phenotypic values of the siblings in each pair, and wherein both members of the sibling pair are in the same pool.

10. The method described in claim 8 wherein the phenotypic values of the siblings in each pair are both above a predetermined lower limit or both below a predetermined upper limit.

11. The method described in claim 8 wherein the predetermined lower limit is set so that the upper pool includes the pairs with the highest 10% of the mean values in the population and the predetermined upper limit is set so that the lower pool includes the lowest 10% of the mean values in the population.

12. The method described in claim 8 wherein the predetermined lower limit is set so that the upper pool includes the pairs with the highest 15% of the mean values in the population and the predetermined upper limit is set so that the lower pool includes the lowest 15% of the mean values in the population.

13. The method described in claim 8 wherein the predetermined lower limit is set so that the upper pool includes the pairs with the highest 20% of the mean values in the population and the predetermined upper limit is set so that the lower pool includes the lowest 20% of the mean values in the population.

14. The method described in claim 8 wherein the predetermined lower limit is set so that the upper pool includes the pairs with the highest 25% of the mean values in the population and the predetermined upper limit is set so that the lower pool includes the lowest 25% of the mean values in the population.

15. The method described in claim 8 wherein the predetermined lower limit is set so that the upper pool includes the pairs with the highest 27% of the mean values in the population and the predetermined upper limit is set so that the lower pool includes the lowest 27% of the mean values in the population.

16. The method described in claim 2 wherein all individuals in the population are members of sibling pairs, and either (i) one member of a sibling pair is selected for the upper pool and the second member of the sibling pair is selected for the lower pool; or (ii) neither member of a sibling pair is selected.

17. The method described in claim 17 wherein the sibling pairs are ranked by the absolute magnitude of the difference in phenotypic value for the siblings within each pair, the percent of pairs with the greatest difference are identified, and the siblings in each pair are distributed such that the sibling with the high phenotypic value is selected for the upper pool and the sibling with the low phenotypic value is selected for the lower pool.

18. The method described in claim 17 wherein the phenotypic value of one member of the sibling pair is above a predetermined lower limit and the phenotypic value of the second member of the sibling pair is below a predetermined upper limit.

19. The method described in claim 17 wherein the percent of pairs is 80% and the distribution provides 10% of the population in each pool.

20. The method described in claim 17 wherein the percent of pairs is 70% and the distribution provides 15% of the population in each pool.

21. The method described in claim 17 wherein the percent of pairs is 60% and the distribution provides 20% of the population in each pool.

22. The method described in claim 17 wherein the percent of pairs is 50% and the distribution provides 25% of the population in each pool.

23. The method described in claim 17 wherein the percent of pairs is 54% and the distribution provides 27% of the population in each pool.

24. The method described in claim 2 wherein the individuals in the population are sibling pairs and the results obtained by performing the methods described in claims

25. The method described in claim 3 wherein the population of unrelated individuals are provided by a process comprising the steps of: a) providing a population of sibling pairs; and b) selecting only one member of a sibling pair to be included in the population of unrelated individuals.

26. The method described in claim 25 further comprising the steps of: a) calculating the overall mean of the phenotypic values in the population; b) subtracting the mean from each phenotypic value; c) ranking each sibling pair according to the result of the calculation conducted according to (pair-mean)

27. The method described in claim 25, further comprising the steps of: a) calculating the overall mean of the phenotypic values in the population; and b) selecting that member of each sibling pair having a phenotypic value such that the absolute value of the difference between the individual's phenotypic value and the overall mean is greater than the difference for the other individual in the pair, thereby providing a population of unrelated individuals.

28. The method described in claim 25 further comprising the steps of: a) rank ordering the members of the population of sibling pairs to generate a list wherein the rank order of each member of a sibling pair is obtained as the smaller of: i) the distance from the first member on the list and ii) the distance from the last member on the list; and b) selecting that member of each sibling pair having a lower ranking; thereby providing a population of unrelated individuals.

29. The method described in claim 25 further comprising the steps of: a) rank ordering the members of the population of sibling pairs to generate a list wherein the rank order of each member of a sibling pair is obtained as the distance from the phenotype mean; and b) selecting that member of each sibling pair having a lower ranking; thereby providing a population of unrelated individuals.

30. The method described in claim 1 wherein the population includes individuals who may be classified into classes.

31. The method described in claim 30 wherein the classes are based on an age group, gender, race or ethnic origin.

32. The method described in claim 31 wherein all the members of a class are included in the pools.

33. The method described in claim 1 for determining the genetic basis of disease predisposition.

34. The method described in claim 33, wherein the genetic locus which is analyzed for determining the genetic basis of disease predisposition contains a single nucleotide polymorphism.

Description:

[0001] This application claims priority to U.S. Ser. No. 09/932,480, filed Aug. 17, 2001; U.S. Ser. No. 60/226,465 filed Aug. 18, 2000 [Cura 396], and to U.S. Ser. No. 60/230,580 filed Sep. 5, 2000 [Cura 396A], both of which are incorporated herein by reference in their entireties.

[0002] The complex diseases that present the greatest challenge to modem medicine, including cancer, cardiovascular disease, and metabolic disorders, arise through the interplay of numerous genetic and environmental factors. One of the primary goals of the human genome project is to assist in the risk-assessment, prevention, detection, and treatment of these complex disorders by identifying the genetic components. Disentangling the genetic and environmental factors requires carefully designed studies. One approach is to study highly homogenous populations (Nillson and Rose 1999; Rabinow, 1999; Frank 2000). A recognized drawback of this approach, however, is that disease-associated markers or causative alleles found in an isolated population might not be relevant for a larger population. An attractive alternative is to use well-matched case-control studies of a more diverse population. A second alternative is to study siblings, inherently matched for environmental effects.

[0003] Even with a well-matched sample set, the genetic factors contributing to an aberrant phenotype may be difficult to determine. Traditional linkage analysis methods identify physical regions of DNA whose inheritance pattern correlates with the inheritance of a particular trait (Liu 1997; Sham 1997, Ott 1999). These regions may contain millions of nucleotides and tens to hundreds of genes, and identifying the causative mutation or a tightly linked marker is still a challenge. A more recent approach is to use a sufficiently dense marker set to identify causative changes directly. Single nucleotide polymorphisms, or SNPs, can provide such a marker set (Cargill et al. 1999). These are typically bi-allelic markers with linkage disequilibrium extending an estimated 10,000 to 100,000 nucleotides in heterogeneous human populations (Kruglyak 1999; Collins et al. 2000). Tens to hundreds of thousands of these closely spaced markers are required for a complete scan of the 3 billion nucleotides in the human genome. Because each SNP constitutes a separate test, the significance threshold must be adjusted for multiple hypotheses (p-value˜10^{−8}

[0004] The most powerful tests of association require that each individual be genotyped for every marker (Fulker et al. 1995, Kruglyak and Lander 1995, Abecasis et al. 2000, Cardon 2000) and remain far too costly for all but testing candidate genes. An alternative that circumvents the need for individual genotypes, related to previous DNA pooling methods for determination of linkage between a molecular marker and a quantitative trait locus (Darvasi and Soller 1994), is to determine allele frequencies for sub-populations pooled on the basis of a qualitative phenotype. Populations of unrelated individuals, separated into affected and unaffected pools, have greater power than related populations. If a population consists of sib-pairs, concordant pairs versus unrelated controls have greater power than discordant pairs separated into affected and unaffected pools (Risch and Teng 1998). Nevertheless, discordant designs might provide a better control for confounding factors such as age, ethnicity, or environmental effects.

[0005] The phenotypes relevant for complex disease are often quantitative, however, and converting a quantitative score to a qualitative classification represents a loss of information that can reduce the power of an association study. The location of the dividing line for affected versus unaffected classification, for example, can affect the power to detect association. Furthermore, pooling designs based on a comparison of numerical scores are not even possible with a qualitative classification scheme. These distinctions can be especially relevant when populations contain related individuals and qualitative tests have a disadvantage (Risch and Teng 1998).

[0006] There remains a need for procedures that provide phenotype associations with diseases or pathologies based on phenotypes that may be ranked on a quantitative scale. In such a scheme there is a strong need to identify procedures for optimally obtaining samples, or pooling, from a subpopulation that provide the highest assurance of displaying associations that are present. In addition there is a need to distinguish among various pooling strategies that may arise in cases with different allele frequencies and different allele correlations. There is a further need to devise a test criterion for establishing the significance of associations between phenotypes and diseases or pathologies that may arise. The present invention addresses these and related deficiencies that currently exist.

[0007] The present invention is based, in part, on the discovery of methods to detect an association in a population of individuals between a genetic locus and a quantitative phenotype, where two or more alleles occur at a given genetic locus, and the phenotype is expressed using a numerical phenotypic value whose range falls within a first numerical limit and a second numerical limit. These limits are used to provide for subpopulations that consist of upper and lower pools.

[0008] In some embodiments, the population of individuals includes individuals who may be classified into classes. In certain aspects of the invention, these classes are based on age, gender, race, or ethnic origin. In other aspects, some or all members of a class are included in the pools.

[0009] In various embodiments, these numerical limits are chosen so that the upper pool includes the highest 10%, 15%, 20%, 25%, 27%, 30%, or 35% of the population. In other embodiments, the numerical limits are chosen such that the lower pool includes the lowest 10%, 15%, 20%, 25%, 27%, 30%, or 35% of the population.

[0010] In one embodiment of the invention, the numerical limits are chosen to minimize false-negative errors.

[0011] In the present invention, the population of individuals can include unrelated individuals or related individuals. In one aspect, these related individuals are sibling pairs (sib pairs). In a further aspect, each member of the sib pair is selected for the upper pool. In a still further aspect, each member of the sib pair is selected for the lower pool. In still yet another aspect, neither member of the sib pair is selected. In another aspect, one member of the sib pair is selected for the upper pool and the other member of the sib pair is selected for the lower pool.

[0012] In one embodiment of the invention, sib pairs are ranked by the absolute magnitude of the difference in phenotypic value for the siblings within each pair. In one aspect, the percent of pairs with the greatest difference are identified, and the siblings in each pair are distributed such that the sibling with the high phenotypic value is selected for the upper pool and the sibling with the low phenotypic value is selected for the lower pool. In an aspect of this embodiment, the phenotypic value of one member of the sibling pair is above a predetermined lower limit and the phenotypic value of the second member of the sibling pair is below a predetermined upper limit. In various other aspects, the percentage of pairs with the greatest difference is 80%, 70%, 60%, 54% or 50%, and the distribution provides 10%, 15%, 20%, 25%, or 27% of the population in each pool.

[0013] In an embodiment of the invention, Mahalanobis ranks are generated among sib pairs. In one aspect, these ranks are used to construct pools composed of the member of the sib pair with the more extreme Mahalanobis rank. In another aspect, the Mahalanobis ranks are used to generate a list in which the order of each member of a sib pair in this list is determined by the smaller of the distance of a member from the first member on the list and the distance of a member from the last member on the list.

[0014] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In the case of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.

[0015] Other features and advantages of the invention will be apparent from the following detailed description and claims.

[0016] _{1}_{2}_{U1}_{U2}_{L1}_{L2 }_{+}_{1}_{2}_{−}_{1}_{2}

[0017] ^{−8}_{1}_{A}^{2}

[0018] ^{−8 }_{1}_{A}^{2}

[0019] _{1}^{−8 }_{A}^{2}_{1 }_{A}^{2}_{1 }_{A}^{2}

[0020] _{A}^{2}^{−8 }_{1}_{A}^{2}_{A}^{2 }_{A}^{2 }_{A}^{2}

[0021] ^{−8}_{1}_{A}^{2}_{D}^{2}_{D}^{2}_{A}^{2}_{1}

[0022] ^{−8}_{1}_{A}^{2}

[0023] _{1}_{A}^{2}^{−5}_{1}_{A}^{2}

[0024] _{R}_{R}

[0025] _{R}

[0026] _{R}

[0027] _{A}^{2}_{R}^{2}^{−8}

[0028] _{A}^{2}_{R}^{2 }^{−8}

[0029]

[0030] _{aff/unaff}_{indiv }_{tail}_{indiv }_{tail}_{indiv }

[0031] ^{−8}

[0032] ^{−8 }

[0033] 1. Definitions

[0034] Glossary of Mathematical Symbols

X | quantitative phenotypic value of an individual |

X_{i} | quantitative phenotypic value of sib i, with i = 1 or 2 for |

sib-pairs | |

X_{±} | (X_{1}_{2} |

r | phenotypic correlation between sibs |

A_{i} | allele inherited at a particular locus. For a bi-allelic |

marker, i = 1 or 2 | |

G | genotype at the locus, either A_{1}_{1}_{1}_{2}_{2}_{2} |

bi-allelic marker | |

G_{i} | genotype for sib i, with i = 1 or 2 for sib-pairs |

P(G) | genotype probability |

P(G_{1}_{2} | joint sib-pair genotype probability |

f(X_{1}_{2} | joint sib-pair phenotype probability distribution |

f[X_{1}_{2}_{1} | joint sib-pair phenotype probability distribution |

G_{2} | conditioned on genotypes |

p | frequency of allele A_{1} |

q | frequency of the remaining alleles, with q = 1 − p |

p_{i} | frequency of allele A_{1} |

autosomal marker | |

p_{±} | (p_{1}_{2} |

a | half the difference in the shift in the mean phenotypic |

value of individuals with genotype A_{1}_{1} | |

A_{2}_{2} | |

d | difference in the mean phenotypic value between |

individuals with genotype | |

A_{1}_{2} | compared to the mid-point of the means for A_{1}_{1} |

and A_{2}_{2} | |

μ | mean phenotypic shift due to the locus, equal to |

a(p − q) + 2pqd | |

σ_{A}^{2} | additive variance of phenotype X due to the genotype G |

σ_{D}^{2} | dominance variance due to the genotype G |

σ_{R}^{2} | residual phenotypic variance, with σ_{A}^{2}_{D}^{2}_{R}^{2} |

N | the total number of individuals whose DNA is available |

for pooling | |

n | number of individuals selected for a single pool |

ρ | pooling fraction defined as n/N |

p_{U}_{L} | frequency of allele A_{1} |

T | test statistic, which is expected to be close to zero when |

the genotype G does not affect the phenotypic value and | |

is expected to be non-zero when individuals with | |

genotypes A_{1}_{1}_{1}_{2}_{2}_{2} | |

phenotypic values. As formulated here, T has a normal | |

distribution with unit variance. Under the null hypothesis | |

that σ_{A}^{1/2} | |

is zero. Under the alternative hypothesis that σ_{A} | |

zero, the mean of T is also non-zero. | |

σ_{0}^{2} | variance of n^{1/2}_{U}_{L} |

σ_{1}^{2} | variance of n^{1/2}_{U}_{L} |

Φ(z) | cumulative standard normal probability, the area under a |

standard normal distribution up to normal deviate z | |

z_{α} | normal deviate corresponding to an upper tail area of α, |

defined as Φ(z_{a} | |

α | type I error rate (false-positive rate). For a one-sided test, |

T > z_{a} | |

typically termed a p-value. A typical threshold for | |

significance is p-value smaller than 0.05 or 0.01. If M | |

independent tests are conducted, a conservative | |

correction that yields a final p-value of α is to | |

use a p-value of α/M for each of the M tests. | |

β | type II error rate (false-negative rate). The power of a test |

is 1-β. | |

H(x) | Heaviside step function |

[0035] As used herein, when two individuals are “related to each other”, they are genetically related in a direct parent-child relationship or a sibling relationship. In a sibling relationship, the two individuals of the sibling pair have the same biological father and the same biological mother. As used herein, the term “sib” is used to designate the word “sibling”, and the sibling relationship is defined above. The term “sib pair” is used to designate a set of two siblings.

[0036] The members of a sib pair may be dizygotic, indicating that they originate from different fertilized ova. A sib pair includes dizygotic twins.

[0037] The focus of the present invention is to examine the statistical power of pooling designs for quantitative phenotypes. A variance components model provides the distribution of phenotypic values for an unselected population of unrelated individuals or sib pairs. The phenotype is partitioned into contributions from a specific causative allele and from residual shared and non-shared familial and genetic factors. The genotype-dependent phenotype distribution for sib pairs under Hardy-Weinberg equilibrium is used as the basis for analyzing the statistical power of various pooling strategies. The test statistic in each case is the allele frequency difference between two pools, appropriately standardized to a normal distribution. Numerically exact results are provided for a range of parameters including the fraction of population pooled, the allele frequency, and the dominant or recessive character of the allele. Furthermore, upon consideration of the relative powers of pooling designs, pooling designs are suggested for particular phenotype characteristics.

[0038] 2. Model 1

[0039] 2.1 Biometrical Genetic Model

[0040] A quantitative phenotype X, standardized to zero mean and unit variance, is hypothesized to be affected by the genotype G at a biallelic locus with alleles A_{1 }_{2}_{1 }_{2}_{1}_{2 }_{2 }_{1}_{1}_{1}_{2}_{2}_{2 }_{1}_{1}_{1}_{2}_{2}_{2 }_{1 }_{G}_{1}_{1}_{1}_{2}_{2}_{2}_{1}_{2}_{1 }_{2}_{1}_{2}_{1}_{2}

[0041] Using the notation defined above, the effect μ_{G }_{1}_{1}_{1}_{2}_{2}_{2 }_{1}_{2}_{1}_{2 }

[0042] The phenotypic variance contributed by the genotype G can be partitioned into an additive component σ_{A}^{2 }_{D}^{2}

_{A}^{2}^{2}

_{D}^{2}^{2}^{2}^{2}

[0043] In a population of unrelated individuals, the distribution f [X] of trait values is a mixture of 3 univariate normals, one for each genotype:

_{G}_{G}

_{G}_{R}^{2}^{−1/2}_{G}^{2}_{R}^{2}

_{R}^{2}_{A}^{2}_{D}^{2}

[0044] Similarly, in a population of sib pairs, the bivariate distribution of trait values f [X_{1}_{2}

[0045] The mean of X_{j }_{G}_{1 }_{1 }_{2 }_{R}^{2}_{A}^{2}_{D}^{2}_{1 }_{2 }_{R }

_{R}_{A}^{2}_{D}^{2}

[0046] It is convenient to re-express the phenotypes of sib pairs in terms of X_{+ }_{−}_{±}_{1}_{2}_{+}_{−}_{1}_{2}_{+}_{1}_{2}_{−}_{1}_{2}_{+}_{−}

_{±}_{1}_{2}_{±}^{2}^{−1/2}_{±}_{±}^{2}_{±}^{2}

_{±}_{1}_{2}_{G}_{1}_{G}_{2}

_{±}^{2}_{R}^{2}_{R}

[0047] Allele frequencies p_{±}_{±}_{1}_{2}_{G}_{1}_{G}_{2}

[0048]

[0049] We consider tests in which an upper and lower pool, each containing n individuals, are selected according to higher and lower phenotypic values from a larger population of N individuals. The frequencies p_{U }_{L }_{1}

_{U}_{L}_{0}

[0050] The variance p_{U}_{L }_{U}_{L}_{0}^{2}_{0 }_{0 }_{C }_{D }_{C}_{D }_{U}_{L}_{C}_{D}^{2}_{G}

_{G}_{1}^{2}_{1}_{2}_{1}^{2}_{1}_{2}

[0051] The contribution of the pooled-together sib-pairs is

_{C}^{2}_{G}_{1}_{G}_{2}_{C}^{2}_{G}_{G}_{1}_{G}_{2}_{C}^{2}_{1}_{2}

[0052] because the covariance between genotypes in a sib-pair is half the individual variance, reflecting that sibs share half their genetic material. Similarly, the contribution of the pooled-apart sib-pairs is

_{D}^{2}_{G}_{1}_{G}_{2}_{D}^{2}_{G}_{G}_{1}_{G}_{2}

[0053] The result for σ_{0}^{2 }

_{0}^{2}_{C}_{D}_{1}_{2}

[0054] with important limiting cases of p_{1}_{2}_{1}_{2 }_{1}_{2}_{1 }_{U}_{L}_{1 }_{0 }

[0055] 2.3 Pooling Design

[0056] A pooling design is a set of rules to determine which sibs are selected for the upper and lower pools. For an unrelated population, these rules take the form of a pair of indicator functions I_{u}_{L}_{u}_{L }

[0057] The rules for sib-pairs may be formulated in terms of four indicator functions which depend on both sibling phenotypic values X_{1 }_{2}_{sj}_{1}_{2}_{sj}_{+}_{−}_{Uj }_{Lj }

[0058] A summary of pooling designs in terms of the indicator functions is provided in Table II. The indicator functions are specified by upper and lower phenotype thresholds X_{U }_{L }

[0059] The values of X_{U }_{L }

[0060] Three types of designs are considered: unrelated pooling designs, in which none of the 2n pooled individuals are related (although the individuals may be drawn from a larger population of related individuals); sib-together pooling designs, in which each pool consists of n/2 sib pairs; and sib-apart poolingdesigns, in which n sib pairs are split between the upper and lower pools.

[0061] Unrelated Pooling Designs

[0062] Two types of unrelated pools are shown. The first, unrelated-random, pools the n individuals with the highest and lowest phenotypic values from a population of N unrelated individuals. The term random arises because the N unrelated individuals may be obtained by selecting one sib at random from an initial population of N sib pairs.

[0063] The second unrelated design, unrelated-extreme, first reduces a population of N/2 sib pairs to N/2 unrelated individuals by selecting the individual with the more extreme phenotypic value from each sib pair. Tails with n individuals are then selected for pooling from this unrelated sub-population. The more extreme sib is defined as having a greater distance |X_{j}

[0064] Sib-Together Pooling Designs

[0065] Two sib-together designs are analyzed, each starting with a population of N individuals in N/2 sib pairs. The first, termed concordant, is analogous to concordant pooling based on a qualitative, affected/unaffected classification. If both sibs have phenotypic values above an upper threshold X_{U}_{L}_{+}_{U }_{L }

[0066] Sib-Apart Pooling Designs

[0067] Two sib-apart designs are also analyzed, each starting with N/2 sib pairs. The first is termed discordant, again analogous to qualitative discordant pooling. If one sib in a pair has a phenotypic value above an upper threshold X_{U }_{L}_{U }_{L }

[0068] The second sib-apart design, termed pair-difference, selects the n sib pairs with the greatest magnitude of difference |X_{1}_{2}

[0069] The depiction of pooling designs in _{1 }_{2}_{1}_{2 }_{U }_{L }_{1}_{2 }_{U}_{L}

[0070] The middle panels depict the two sib-together designs. On the left is the concordant design: to be selected for pooling, both sibs must be above or below a threshold. The upper threshold X_{U }_{U}_{U}_{+}_{U }_{L}_{−}_{+}_{+}_{U }_{+}_{L }_{U }_{L }_{1}_{2 }

[0071] The bottom panels depict the discordant design on the left and the pair-difference design on the right. The discordant design selects sib-pairs from rectangular regions in the upper left and lower right; the pooling boundaries in the pair-difference design are lines of constant X_{−}_{+}

[0072] Despite the close analogy, there is an important difference between the concordant and discordant designs described here for quantitative traits and the designs described elsewhere for qualitative traits (Risch and Teng, 1998). In this formulation for quantitative traits, the upper and lower thresholds define tails of a population distribution and a sizeable population fraction falls between the tails. In a typical formulation for qualitative traits, and especially for qualitative traits without an obvious quantitative basis, a single threshold divides the population into two classes: a smaller affected class and a larger unaffected class holding most of the population. In the terminology used here, such designs have X_{U}_{L}

[0073] 2.4 Distribution of p_{U}_{L }

[0074] The fraction ρ_{S }

[0075] where, as before, S=U or L labels the upper or lower pool and

[0076] The initial factor of (½) arises because the phenotype and genotype distributions are normalized to 1 per sib-pair rather than 2. In practice, the upper and lower thresholds X_{U }_{L }

[0077] For feasible values of ρ, the expected allele frequency in pool S is

[0078] where p_{G}_{j }^{th }^{−1}_{S}_{j}_{1}_{2}^{−1}_{i}_{i}_{i }_{i}_{i }_{1 }_{i}_{i }_{i}_{l}_{i}_{i}_{i }^{−1}_{i}_{i}_{i}^{2}_{i}_{i}_{i}^{2}

_{U}_{L}_{1}^{2}

[0079] where σ_{1}^{2 }

[0080] For the unrelated-extreme design, p_{U }_{L }

[0081] For the unrelated-random design, the index j is irrelevant, yielding simpler expressions:

_{S}^{−1}_{G}_{S}_{G}

_{1}^{2}^{−1}_{G {ρ}_{U}_{G}^{2}_{U }^{2}_{L}_{G}^{2}_{U}^{2}

[0082] For the sib-together designs, I_{S1}_{S2 }

[0083] The corresponding frequencies θ_{i }^{−1}_{S1}_{1}_{2}

[0084] For the sib-apart designs, I_{U1}_{L2 }_{L1}_{U2}

[0085] Due to the symmetry between the two siblings, ρ^{−1}_{G}_{1}_{G}_{2}_{U1}^{−1}_{G}_{1}_{G}_{2}_{L1}_{U }_{L}_{U}_{L}

[0086] When the null hypothesis is valid, each of these expressions for σ_{1 }_{0}_{1 }_{0 }_{0 }

[0087] 2.5 Power

[0088] The statistical power 1−β to reject the null hypothesis for a single one-tailed test with p-value αwhere α is equivalent to the false-positive rate or Type I error rate and β is equivalent to the false-negative rate or Type II error rate, is

_{α}_{0}_{U}_{L}_{1}

[0089] where Φ(z) is the cumulative standard normal distribution, 1−Φ(z_{α}

^{−1}_{α}_{0}_{1−β}_{1}_{U}_{L}^{2}

[0090] where ρ=n/N is the fraction of the total population selected for each pool. In either case, replacing σ_{1 }_{0 }

[0091] 2.6 Computational Methods

[0092] Exact results for the distribution of the test statistic T under the null hypothesis and under the alternative hypothesis, subject only to the approximation that T is normal, were obtained by numerical computations converged to better than 1 part in 10^{6 }_{U }_{L }_{U }_{L }_{1}^{2 }

[0093] The numerical results, and the underlying theory, are robust when n, the number of individuals per pool, is large and 2(p_{U}_{L}

[0094] The properties and characteristics of the methods of the present invention are set forth in the Examples. It is shown, for example, that the optimal design for unrelated individuals is to pool the top and bottom 27% of the population. This design using N unrelated individuals has greater power than designs using N/2 sib pairs when the phenotypic correlation between sibs is low to moderate, below 75%, but has less power than sib pair designs when the correlation is above 75%.

[0095] Of the designs explored for a population of sib pairs, the unrelated-extreme design is the best for low to moderate sibling phenotype correlation. In this design, the more extreme sib is selected from each pair, then the top and bottom 36% of this subset are pooled. When the correlation is high, above 75%, the best design found for sib pairs is to first select the 27% of pairs with the greatest phenotype difference, then split each pair by phenotypic value to form an upper and lower pool. The pair-difference design might also be applied at low to moderate sibling correlation to reduce the rate of spurious association due to population stratification. The optimal pooling fractions for these designs were determined by minimizing the population requirements. The minima were generally quite flat, and pooling fractions close to the optimal fractions give near-optimal results.

[0096] Compared with the results obtained by others for pooling based on qualitative traits, the results derived using the methods of the present invention for quantitative traits are thought to be surprising. For earlier pooling strategies based on qualitative traits, designs using unrelated individuals were found to be more powerful than designs using sib pairs; when populations were restricted to sib pairs, concordant designs were found to have greater power than discordant designs (Risch and Teng 1998). In contrast, for quantitative phenotypes, the methods of the present invention indicate that unrelated individuals become less powerful than sib pairs when sibling correlation is high, and that sib-apart designs become more powerful than sib-together designs when the sibling correlation is above 50%. This result is significant because highly heritable traits that are likely to be the first targets of large-scale genotyping studies often exhibit sibling correlations of 50% or higher. Quantitative phenotypic values also permit the use of the unrelated-extreme design, which does not have an obvious analog for qualitative phenotypes that categorize individuals as affected/unaffected.

[0097] The sib-together and sib-apart pooling designs of the present invention, which draw individuals from extreme-high and extreme-low phenotypes, are anticipated to be more powerful than alternative designs that compare one extreme to the remainder of the population, as in a qualitative affected/unaffected classification. The affected/unaffected classification establishes a single threshold for a quantitative phenotype, and the allele frequency in the large unaffected class is close to the population mean. In contrast, the quantitative designs of the present invention employ two thresholds, and the allele frequencies in the upper and lower pools are approximately equidistant from the population mean. The allele frequency difference between pools is consequently half as large for the qualitative design as for the quantitative design of the present invention, and the population requirements are four times as large, or half as large if the overall allele frequency is assumed to be known exactly. These conclusions are similar to those reached in the context of linkage analysis for quantitative trait localization using extremely concordant and extremely discordant sib pairs (Risch and Zhang 1995, Risch and Zhang 1996, Zhang and Risch 1996, Gu et al. 1996).

[0098] As with most genotyping designs, the pooling strategies described here are primarily sensitive to the additive variance from an allele. Since the additive variance for an allele is approximately equal to the fraction of heterozygotes times the square of half the phenotype shift between the two homozygotes, rare alleles with larger phenotype shifts may be detected with the same power as common alleles with smaller shifts. When the allele frequency becomes smaller than the additive variance of the allele, however, the frequency shift must become very large to compensate and the phenotype begins to resemble a monogenic trait.

[0099] The results provided here also imply the precision required for allele frequency determinations for pooled DNA. Approximately 3000 individuals are required for a genome-wide screen with an optimal pool size n of 600 to 800 individuals. The frequency difference corresponding to significance at α=5×10^{−8 }_{α}_{1 }_{α}_{1}_{1}^{1/2}

[0100] 3. Examples for Model 1

[0101] Overview to the Examples

[0102] In this section, total population sizes are presented for a wide range of parameters and as functions of the pooling fraction ρ. The first parameters explored are the sib-pair phenotype correlation r and the allele frequency p_{1}_{A}^{2}_{D}^{2 }_{G}

[0103] The reference value for sibling phenotype correlation was based on reported values for genetic heritabilities and shared environmental factors. Estimates of the genetic heritability for complex traits range from 20% for cancer (Verkasalo et al. 1999), 20% to 40% for Type 2 diabetes mellitus (NIDDM) (Watanabe et al. 1999), 50% for pulmonary function (Wilk et al. 2000), 10% to 50% for systolic and diastolic blood pressure (Iselius et al. 1983, Perusse 1989), and 70% to 90% for cholesterol level (Austin et al. 1987). Shared environmental factors are estimated to contribute 7% of the overall phenotype variance for cancer (Verkasalo et al. 1999), 20% to 40% for blood pressure (Iselius et al. 1983, Perusse et al. 1989), and 15% for serum lipid levels (Heller et al. 1993). The sibling phenotype correlation, equal to half the genetic heritability plus the shared environmental contribution, varies over a wide range for these traits. A phenotype correlation of 40%, in the middle of the range, was selected to serve as the reference.

[0104] Reported minor-allele frequencies for SNPs found in multiple populations range from 5% to 25%, with lower frequencies for variations which cause non-conservative amino acid changes and higher frequencies for conservative substitutions and changes in non-coding regions (Cargill et al. 1999, Goddard et al. 2000). A reference value of 10% was selected for p_{1}

[0105] The genetic variance arising from a typical SNP was modeled by assuming that the genetic heritability arises from multiple loci, each of which makes an independent contribution with a characteristic size equal to the genetic heritability divided by the total number of contributing loci. Assuming that approximately 20 polymorphic sites contribute to a genetic heritability of 40% yields a reference value of 0.02 for σ_{A}^{2}_{D}^{2}

[0106] In practice, the false-positive rate α is matched to the number of individual tests that are to be conducted in an association study. For a genome scan of 10^{6 }^{4 }^{−8 }^{−5 }^{−6 }_{α}_{1−β}

[0107] Figures depicting the results use a consistent scheme. The unrelated designs are represented as solid lines, thin for unrelated-random and thick for unrelated-extreme; the sib-together designs are represented as equal-spaced dashed lines, thin for concordant and thick for pair-mean; and the sib-apart designs are represented as unequally-spaced dashed lines, thin for discordant and thick for pair-difference.

[0108] The minimum population size N required to detect association as a function of the sibling phenotype correlation r and the pooled fraction ρ is shown in ^{−8}_{1}_{A}^{2}

[0109] In

[0110] The regions near the minima of N for each design are quite flat, indicating that pooling fractions within 0.1 of the minimum may give near-optimal results. The exact values; of these minima are depicted in

[0111] The results of changing the allele frequency p_{1 }^{−8}_{A}^{2}_{1 }_{1}_{1}

[0112] At moderate frequencies of the minor allele, p_{1}_{A}^{2 }_{G }_{1}

[0113] At smaller allele frequencies, p1<1%, the increasingly rare allele has an corresponding large effect μ_{G }_{1 }_{A}^{2}_{D}^{2 }_{1 }

[0114] The population size N required to detect association is shown as Panel A in ^{8}_{1}_{U }_{L}^{−2 }_{U }_{L }_{A}

[0115] The corresponding optimal pooling fractions are shown in _{A}^{2}

[0116] The series of panels in _{D}^{2}_{A}^{2}_{D}^{2}^{−8}_{1}_{A}^{2}

[0117] For pure recessive traits, d/a=−1 in Panel A (82% dominance variance for p_{1}

[0118] These results again signal that pooling methods for quantitative phenotypes are more sensitive to changing additive variance than to changing dominance variance. The dominance variance is only significant in regions where the additive variance vanishes, d/a=1/(p_{1}_{2}

[0119] These effects are shown in greater detail in _{1}

[0120] When the widths of the distribution of the test statistic under the null and alternative hypothesis are approximately equal, the equation for the population necessary to detect association has the form N∝(z_{α}_{1−β}^{2}_{α}^{1/2 }_{1}_{A}^{2}^{−8 }_{α}^{−5 }_{α}_{α}

[0121] The effects of varying the false-negative rate β are similar to the effects of varying α because the population requirements depend predominantly on the difference z_{α}_{1−β}^{−5 }

[0122] 4. Model 2

[0123] 4.1 Variance Components Model

[0124] A standard variance components model is used to describe the joint phenotype-genotype probability distribution. A quantitative phenotype X, standardized to mean 0 and variance 1, is hypothesized to be affected by the genotype G at a biallelic locus with minor allele A_{1 }_{2 }_{2 }^{2}^{2 }_{1}_{1}_{1}_{2}_{2}_{2 }_{1 }_{G}_{1}_{1}_{1}_{2}_{2}_{2}_{p}^{2}

[0125] The frequency of a genotype combination for a sib pair is denoted P(G_{1}_{2}_{1}_{2}

[0126] The effects μ(G) of genotype G are to displace the phenotypic mean by a, d, and −a for genotypes A_{1}_{1}_{1}_{2}_{2}_{2 }_{1}

[0127] The phenotypic variance contributed by the genotype G can be partitioned into an additive component σ_{A}^{2 }_{D}^{2}

_{A}^{2}_{D}^{2}^{2}^{2}^{2}^{2}

[0128] As will be seen below, this partitioning is important because association tests are sensitive primarily to σ_{A}^{2}_{D}^{2}_{A}^{2 }_{D}^{2 }_{R}^{2}_{A}^{2}_{D}^{2}

[0129] The probability density of phenotypic values for sib pairs is denoted f(X_{1}_{2}

[0130] The mean of X_{1 }_{i}_{1 }_{2 }_{R}^{2 }_{R}

_{R}_{A}^{2}_{D}^{2}

[0131] when effects from genotype G are included.

[0132] Although X_{1 }_{2 }_{1 }_{2}_{1 }_{2 }_{+}_{−}

_{±}_{1}_{2}

[0133] The probability distribution in these orthogonal coordinates, f(X_{+}_{−}_{1}_{2}_{+}_{1}_{2}_{−}_{1}_{2}

_{±}_{1}_{2}_{±}^{2}^{−1/2}_{±}_{±}_{1}_{2}^{2}_{±}^{2}

_{±}_{1}_{2}_{1}_{2}

_{±}^{2}_{R}^{2}_{R}

[0134] It is also convenient to define pair-mean and pair-difference allele frequencies p±(G_{1}_{2}_{1}_{2}_{G}_{1}_{G}_{2}

[0135] The variance of the pair-mean and pair-difference variables may be expressed more generally for sib-ships of size s, with genotypic correlation r between any two sibs within a sib-ship, as

_{±}_{R}^{2}_{±}

_{±}_{p}^{2}_{±}

[0136] where

_{±}_{R}

_{±}

[0137] The family size s is 2 for sib-pairs, and the genotypic correlation r is 0.5 for full sibs.

[0138] In addition to X_{1}_{2 }_{+}_{−}

_{+}_{+}

_{−}_{+}

[0139] The probability distribution in Mahalanobis coordinates is

_{1}_{2}^{−1}_{2}_{−}^{2}

_{±}_{±}_{±}

[0140] This distribution satisfies

[0141] In the absence of a contribution from the QTL,f(b,φ|G_{1}_{2}^{−1 }^{2}_{1}_{2 }^{1/2}

[0142] 4.2 Test Statistic and Pool Design

[0143] The tests of association described here depend on detecting differences in allele frequency in DNA pooled from individuals chosen from a large repository DNA repository. The allele frequency in the upper pool, with individuals selected to have higher phenotypic values, is denoted p_{U}_{L}_{U}_{L}

[0144] The overall repository size is denoted N, composed entirely of either N unrelated individuals or N/2 sib pairs. The upper and lower pools each hold n samples, and the pooling fraction ρ is defined as n/N.

[0145] For an unrelated population, only one design is described: selecting the n individuals whose phenotypic values are at the upper and lower tails of the distribution, thus defining upper and lower thresholds X_{U }_{L}

[0146] A corresponding design for sib pairs is termed unrelated-random. In this design, one sib is chosen, at random, from each sib-ship to generate a population of N/2 unrelated individuals. Individuals at the upper and lower tails of this unrelated subset are then selected for pooling. The unrelated-random design for N/2 sib pairs with pooling fraction ρ is essentially equivalent to the unrelated-population design for N/2 individuals with pooling fraction 2p.

[0147] A second design selecting only unrelated individuals is termed the Mahalanobis design. The pair-mean X_{+}_{−}

_{2}_{+}^{2}_{+}^{2}

[0148] The n sib-ships with the largest magnitude b and a positive pair-mean X_{+}

[0149] Two remaining designs select both members of a sib pair for pooling. The pair-mean design selects each sib-ship as a family unit based on the phenotypic mean of the pair. The n/2 pairs at the extreme upper and lower tails of the distribution of phenotypic means for sib-ships, comprising n individuals each, are selected for the upper and lower pools respectively The upper and lower thresholds are again termed X_{U }_{L}

[0150] The pair-difference design selects individuals based on the difference of phenotypic values within each sib-ship, or equivalently on the magnitude of within-family phenotypic variance. The n sib-pairs with the greatest within-family variance are identified. Within each pair, the individual with the higher phenotypic value is selected for the upper pool, and the individual with the lower phenotypic value is selected for the lower pool. The threshold for the magnitude of the difference |X_{1}_{2}_{T}

[0151] Since the X_{+}_{−}

[0152] 4.3 Test Power

[0153] Under the null hypothesis H_{0}_{U }_{L }_{1}_{1}_{1}_{0 }_{1}_{U }_{L}

[0154] Both p_{U }_{L }_{1 }_{1 }_{0 }_{0}^{2}_{1 }_{1}^{2}_{0}^{2 }_{1}^{2 }

_{α}_{0}_{1−β}_{1}^{2}_{1}^{2}

[0155] The terms z_{α}_{1−β}

_{α}_{1−β}

[0156] where Φ(z) is the cumulative probability function for the standard normal distribution,

[0157] The significance level α is for a one-sided test, which is appropriate for association tests for disease-susceptibility markers. If markers for protective polymorphisms are also sought, the significance for a two-sided test is more appropriate.

[0158] The method used here to optimize test designs is to specify the error rates α and β, then calculate the selection criteria that minimize the total repository size N required to achieve these error rates for specific genetic models. The method is outlined below, along with a summary of analytical approximations for the repository sizes required for different population structures and pooling designs. Comparisons of the analytical approximations with essentially exact numerical calculations are found in the Results section, and mathematical details are provided in the Appendix.

[0159] To optimize N, a trial value of the fraction ρ is chosen. Next, the threshold phenotypic values that select n=ρN individuals for each pool are derived from the distribution of phenotypic values. Depending on the pooling design, these threshold values may refer to phenotypes for unrelated individuals, the Mahalanobis measure b, the pair-mean measure X_{+}_{−}_{U}_{L}_{1}_{1}_{0}^{2}_{1}^{2}_{0 }_{1}_{α}_{0}^{1/2 }

^{1/2}^{1/2}_{1}_{α}_{0}_{1}

[0160] Since the terms E_{1}_{0}^{2}^{0}^{2 }

_{α}_{0}_{1−β}_{1}^{2}_{1}^{2}

[0161] Optimization proceeds by a search for the value of ρ giving smallest N.

[0162] For complex traits, the total variance σ_{A}^{2}_{D}^{2 }_{R}^{2 }_{R }_{1}_{1}^{2 }_{R}_{1}^{2 }_{0}^{2 }_{R}^{2}_{A}^{2}_{A}^{2}

[0163] In deriving the optimal test designs and estimating the test power, we assume implicitly that there is no measurement error in either the allele frequency p or the allele frequency difference Δp. For the allele frequency p, we show in the Results that either using the mean value (p_{U}_{L}

[0164] Unrelated Design

[0165] When a repository contains N unrelated individuals, the analytical approximation for the required repository size, derived in the Appendix, is

_{urelated}_{p}^{2}_{α}_{1−β}^{2}_{R}^{2}

[0166] This functions is a minimum at ρ=0.27, with ρ/2y_{p}^{2}

[0167] If the population consists of sib pairs rather than unrelated individuals, an unrelated sub-population of N/2 individuals may be constructed by selecting one sib at random from each pair. A direct extension of the above result for unrelated populations yields

_{random-sib}_{2}_{p}^{2}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2 }

[0168] for the sib-pair population. The repository size required for sib pairs is twice as large as for unrelated individuals, with a pooling fraction half as large.

[0169] Mahalanobis Design

[0170] The analytical approximation for the number of individuals required for the Mahalanobis design, derived in the Appendix, is

_{Mabal}^{−1}_{ρ}_{ρ)/ρ(}^{1/2}^{−2}_{+}_{+}^{1/2}_{−}_{+}^{1/2}^{−2}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0171] The initial geometrical factor depends only on the pooling fraction. It is minimized at ρ=0.188 with a value of 2.90, yielding

_{Mahal}_{+}_{+}^{1/2}_{−}_{−}^{1/2}^{−2}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2 }

[0172] for this pooling design.

[0173] Pair-Mean Design

[0174] The analytical approximation for the repository size required by the pair-mean design is

_{pair-mean}_{ρ}^{2}_{+}_{+}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0175] where s=2 for sib pairs. As with the unrelated design, the factor ρ/y_{ρp}^{2 }

_{pairmean}_{+}_{+}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2 }

[0176] for the required repository size.

[0177] Pair-Difference Design

[0178] An analytical approximation for the repository size required by the pair-difference design is

_{pair-diff}_{ρ}^{2}_{−}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2 }

[0179] The factor ρ/y_{ρ}^{2 }

_{pair-diff}_{−}_{−}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2 }

[0180] is the required repository size.

[0181] Combined pair-mean and pair-difference design Because the sib-mean variables X_{+}_{+}_{−}_{−}_{±}_{A}_{R}_{±}_{A}_{R}

_{±}_{±}^{1/2}_{±}_{ρ}_{ρ}_{±}

_{±}_{ρ}^{2}_{±}_{±}

[0182] from the expressions provided in the Appendix for Var(Δp_{±}_{±}_{±}

[0183] The combined minimum-variance estimator Q having expectation σ_{A}_{R}

_{ρ}_{ρ}_{+}_{+}_{+}_{+}^{−1}_{+}^{−1/2}_{+}_{−}^{−1/2}_{−}

_{ρ}^{2}_{+}_{+}_{−}_{−}_{−1}

[0184] An analytical approximation for the repository size required using the combined estimator is

_{comb}_{ρ}^{2}_{+}_{+}_{−}_{−}^{−1}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0185] At the optimal pooling fraction of ρ=0.27, the factor (sρ/2y_{ρ}^{2}_{0 }_{1}

[0186] 4.4 Regression Tests

[0187] Regression tests requiring individual genotyping provide a benchmark for the efficiency of tests on pooled DNA. A regression test assesses the significance of the regression coefficient m in the model

_{i}_{i}_{i }

[0188] where i labels an observation, X_{i }_{i }_{i }_{1 }_{i }_{+}_{+}

[0189] The expectation of the regression coefficient m is 0 under H_{0 }

_{A}_{p}

[0190] under H_{1}_{R}^{2 }

_{i}_{i}_{R}^{2}_{p}^{2}

[0191] where s=1 for unrelated individuals or 2 for sib-pairs, and T/R=1 for unrelated individuals and T_{±}_{±}

[0192] The expectation and variance of the test statistic are related to the false-positive rate and power through the equation

^{−1 }_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0193] Substitution into this equation yields the repository size requirement for the regression test,

_{reg}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0194] The combined estimator formed from the pair-mean and pair-difference estimators has a repository size requirement of

_{regr}_{+}_{+}_{−}_{−}^{−1}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0195]

[0196] Results for required repository sizes were obtained numerically using computations converged to 1×10^{−6 }^{ii }

[0197] Brent's root-finding algorithm was used to determine the threshold values X_{U }_{L }

[0198] To assess the error made by assuming a normal distribution for Δp, we also performed tests in which Δp was calculated exactly according to a multinomial distribution. Results for the required repository size based on the normal distribution were then compared to the repository size based on a multinomial distribution. The two results for N differed by no more than 5% when the number of copies of the minor allele summed over both pools is greater than 60. They differ by approximately 10% when the number of alleles is 10, with the normal distribution underestimating the exact repository size. These differences are not visible on the scale of the figures.

[0199] Appendix 4A: Mathematical Details

[0200] 4A.1 Unrelated Design

[0201] The unrelated design considers a population of N unrelated individuals. Upper and lower thresholds X_{U }_{L }

_{G}_{U}_{R}

_{G}_{L}_{R}

[0202] which may be inverted numerically to find X_{U }_{L }_{U}_{L}

_{U}^{−1}_{U}_{R}

_{L}^{−1}_{L}_{R}

[0203] The expected allele frequencies under H_{1 }

_{1}_{U}_{G}_{U}_{G }

_{1}_{L}_{G}_{L}_{G}

_{1}_{U}_{L}

[0204] The variance of the test statistic can be obtained from the moments of a multinomial distribution [ ] (^{iii }

_{0}^{2}_{G}_{G}^{2}^{2}_{p}^{2 }

_{1}^{2}_{G}_{U}_{L}_{G}^{2}_{U}^{2}_{L}^{2}

[0205] Thus, when ρ is specified, the terms in the expression for the repository size N, (z_{α}_{0}_{1−β}_{1}^{2}_{1}^{2}

[0206] An approximate analytical expression for N may be obtained when σ_{R}^{2 }

[0207] where y=(2π)^{−1/2}^{2}_{R }

_{U}_{L}_{R}^{−1}

[0208] the expected difference in allele frequency is

_{β}_{G}_{G}_{R}_{ρ}_{ρ}_{A}_{R}

[0209] where y_{p}^{−1/2}^{−1}^{2}_{R}_{0}^{2 }_{1}^{2 }_{p}^{2}

_{unrelated}_{ρ}^{2}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0210] The minimum occurs at ρ=0.27 and y_{ρ}

[0211]

[0212] For the Mahalanobis design, thresholds b_{U }_{L }

[0213] The factor of (½) arises because only one individual is selected from each sib pair. If the radial coordinate b is larger than the threshold value, the phase angle φ determines which sib is selected for which pool: the sibling with genotype G_{1 }_{2 }_{U}_{L}

[0214] where symmetry between siblings has allowed the change in integration limits for φ to consider only the regions where sibling 1 is selected. Once ρ is specified, the thresholds for b may be obtained numerically, and E_{1}_{U }_{L}

[0215] An analytic approximation for the repository size requirement may be obtained by noting that

_{1}_{2}^{−1}_{+}_{−}^{2}

[0216] to lowest order in the gene effect μ(G). The normalization condition leads to the equation

_{ρ}_{2}

[0217] with b_{U}_{L}_{ρ}

_{U,L}_{G′}_{+}_{−}_{ρ}_{ρ}^{1/2}

[0218] where the upper pool has the + sign and the lower pool the − sign. The expected allele frequencies in the upper and lower pools are

_{U,L}_{ρ}_{ρ}^{1/2}_{+}_{+}^{1/2}_{+}_{−}^{1/2}_{p}_{A}_{R}

[0219] where the upper pool has the positive deviation from p and the lower pool the negative deviation. These results are derived using the identities

[0220] where r is the genotypic correlation (0.5 for full-sibs). Since θ_{U}_{L}_{1}^{2 }_{0}^{2}_{p}^{2 }

_{Mahalanobis}^{−1}_{ρ}_{ρ}_{1/2}_{2}_{+}_{+}^{1/2}_{−}_{−}^{1/2}^{−2}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0221] The minimum occurs at ρ=0.188.

[0222]

[0223] The fraction ρ of the total population selected according to pair-mean pooling is defined in terms of the upper threshold X_{U }_{L }

[0224] The genotype distribution describing the individuals selected for each pool follows a multinomial distribution based on sib-pair genotypes rather than individual genotypes, such that

[0225] with

_{U}_{1}_{2}^{−1}_{U}_{R}_{1}_{2}

_{L}_{1}_{2}^{−1}_{L}_{R}_{1}_{2}

[0226] The expected allele frequencies under H_{1 }

_{1}_{U}_{L}

[0227] and p_{+}_{1}_{2}_{0 }_{1 }

[0228] The factor s=2 accounts for the family structure, as n/s rather than n measurements of p_{+}_{+}_{p}^{2}_{p}^{2 }_{p}^{2}_{1}_{1}^{2}

[0229] An analytical approximation follows the same derivation used for the unrelated design, except that individual genotypes are replaced by sib-pair genotypes, and individual phenotypes, phenotype offsets, and allele frequencies are replaced by their pair-mean analogs. The upper and lower pooling thresholds are

_{U}_{L}_{+}^{−1}

[0230] and the allele frequency difference between pools is

[0231] where y_{ρ}^{−1}_{1}^{2 }

_{pair-mean}_{ρ}^{2}_{+}_{+}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0232] 4A.4 Pair-Difference Design

[0233] Under the pair-difference design, a sib pair is selected if the pair-difference X_{−}_{T}

[0234] In the first term, sibling 1 has the higher phenotype and is selected for the upper pool, and sibling 2 is selected for the lower pool. In the second term, the roles of the siblings are reversed. Multinomial distributions are defined as θ_{U}_{1}_{2}_{L}_{1}_{2}

[0235] This normalization implies that

_{U}_{1}_{2}^{−1}_{1}_{2}_{1}_{2}_{T}_{−}

_{L}_{1}_{2}^{−1}_{1}_{2}_{1}_{2}_{T}_{−}

[0236] Due to symmetry, θ_{U}_{1}_{2}_{L}_{2}_{1}

[0237] by symmetry, each term contributes E(Δp)/2. To calculate the variance of Δp, it is important to note that the normalization of θ_{U }_{L }_{U }_{L}_{U }_{L }_{2}

[0238] The value of σ_{0}^{2 }_{p}^{2}

[0239] The repository size required to detect association may be determined exactly by numeric calculation of the threshold value X_{T }^{0}^{2}_{1}^{2}

[0240] An analytic expression accurate when σ_{R}^{2 }

_{T}_{−}^{−1}

[0241] and the allele frequency difference is

[0242] where y_{p }^{−1}_{1}^{2 }_{0}^{2 }

_{parr-diff}_{ρ}^{2}_{−}_{−}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}

[0243] When the effect of a QTL is small and the residual variance σ_{R}^{2 }

[0244] The repository size requirements of pooled DNA methods are shown in

[0245] The repository size required for the Mahalanobis design is shown relative to that required for the combined regression test. This ratio depends on the residual phenotypic correlation t_{R}_{R}_{R}

[0246] In _{R}

[0247] In _{R}

[0248] For designs using only 2 pools, a population of unrelated individuals is more powerful than a population of sib pairs except for large values of the sibling phenotypic correlation, t_{R}

[0249] The slope of the pair-difference repository size requirement is 3× larger than the slope of the pair-mean population requirement. Thus, relative to the pair-mean design, the pair-difference design decreases in power rapidly as t_{R }_{R}

[0250] The combined 4 pool test using pair-mean and pair-difference pools is uniformly the most powerful sib-pair design for all values of t_{R}_{R }^{1/2}^{1/2}_{R}

[0251] According to the analytic theory, the necessary size of the study population for pooling tests is inversely proportional to the additive variance contributed by the QTL relative to the residual phenotypic variance, σ_{A}^{2}_{R}^{2}^{−8 }

[0252] A single representative value for the sibling phenotypic correlation t_{R }^{viii }^{x }_{R}_{R}_{R }

[0253] In _{A}^{2}_{R}^{2 }_{A}^{2}_{R}^{2}_{A}^{2}_{R}^{2}

[0254] The allele frequency difference at the significance threshold, z_{α}_{0}^{1/2}_{A}^{2}_{R}^{2}_{A}^{2}^{2}

[0255] The sensitivity of the results to both the allele frequency p and the inheritance mode are shown in _{A}^{2}_{R}^{2 }_{A}^{2 }_{A}^{2 }

[0256] The repository size is rather insensitive to allele frequency for p>0.01 for dominant and additive inheritance, and for p>0.2 for recessive inheritance, for all but the Mahalanobis design, indicating that the analytic theory is valid in these regions. The repository size required to detect association increases rapidly as the allele frequency decreases below these limits. The Mahalanobis design is more sensitive to the allele frequency than the other designs, losing power rapidly as the allele frequency falls below 0.1 for dominant and additive inheritance and 0.2 for recessive inheritance.

[0257] The allele frequency at which the analytic theory loses accuracy may be estimated by noting that the perturbation parameters used to derive the theory are the terms μ(G)/σ_{R}_{A}^{2}_{A}^{2}_{A}^{2/3}

[0258] In _{1}_{1 }_{1}_{2 }_{1}

[0259] In the bottom panel,

[0260] We have also investigated the sensitivity of the exact numerical results to specified rates of type I and type II error. In the analytical approximations, this behavior is described entirely by the term (z_{α}_{1−β}^{2}_{α}_{1−β}^{2 }_{1=β}^{−8 }_{α}^{−5 }_{α}_{α}

[0261] A marker may show spurious association to a phenotype in the presence of a stratified population. We consider a simple model for stratification in which a population contains at least one sub-population having a mean marker frequency and a mean phenotypic value that both deviate from their respective means in the total population. In individual genotyping, within-family tests such as the transmission disequilibrium test are known to be robust to this type of stratification. Between-family tests, however, may identify spurious associations or miss true associations due to stratification effects.

[0262] Tests of pooled DNA in which family members are balanced between pools, such as the pair-difference design, are analogous to within-family tests. The value of σ_{A}_{R }_{A}_{R }^{2 }

^{2}_{+}^{2}_{ρ}^{2}_{+}_{+}_{−}^{2}_{ρ}^{2}_{−}_{−}

[0263] with one degree of freedom. This stratification estimator may also be expressed as

^{2}_{+}_{−}^{2}_{ρ}^{2}_{+}_{+}_{−}_{−}

[0264] A significant finding for this test, for example at the 0.05 level, indicates that stratification is present and that tests other than the pair-difference test may yield spurious results.

[0265] The preceding analysis has assumed that allele frequency measurement errors are negligible. Allele frequencies measured by most technologies, including PCR amplification [^{xi }^{xll }^{xlii }^{XIV }^{XV }^{XVI }_{U }_{L}_{U}_{L}_{1}^{2}

[0266] The measurement error in Δp, however, has a more deleterious affect on the test power. Again assuming a measurement error of 0.01 for each pool, the measurement error for Δp is {square root}2 larger, approximately 0.014. This error can eventually become larger than the sampling error σ_{0}^{2}_{0.005}_{0.95}_{0.0005}_{0.95}

[0267] The allele frequency measurement error also sets a lower limit for the effect size that can be detected with a pooled test. For example, using the analytical approximation for Δp for pair-mean pools derived in the Appendix,

_{1}_{ρ}_{+}_{+}^{1/2}_{p}_{A}_{R}_{R}^{−1/2}

[0268] where the optimized pooling fraction ρ=0.27 is used and the residual variance σ_{R}^{2}_{R }

[0269] For additive inheritance and allele frequency of 0.5, the threshold phenotypic displacement a is 0.11 and the corresponding additive variance is 0.0063. If the minor allele frequency is 0.1, the threshold displacement a is 0.31 and the corresponding additive variance is 0.017.

[0270] In the presence of population stratification, the pair-mean pools may give spurious results and pair-difference pools are preferred. Using the expectation for Δp derived in the Appendix for pair-difference pools, we require that

_{1}_{ρ}_{−}_{−}^{1/2}_{p}_{A}_{R}_{R}^{−1/2}

[0271] where ρ=0.27 and σ_{R}^{2}_{R}

[0272] For additive inheritance and an allele frequency of 0.5, the critical displacement is 0.20 and the additive variance is 0.02. For a rare minor allele, p=0.1, and additive inheritance, the critical displacement is 0.54, corresponding to an additive variance of 0.05.

[0273] 5. Model 3

[0274] In this model techniques similar to those described in Models 1 and 2 are applied to provide optimized selection criteria for association studies of pooled DNA using the allele frequency difference between pools as a test statistic. It is assumed that samples are drawn from pre-existing population-level DNA repository collected from individuals unselected for any particular phenotype, and that each individual has been measured for a particular phenotype of interest; the goal is to select pools to maximize the power of the test.

[0275] Assuming no experimental error in allele frequency measurements on pooled DNA, we determine the selection thresholds that maximize the power to detect association as a function of the frequency, phenotypic displacement, and inheritance mode of a functional polymorphism. The genetic parameters are also described in terms of a genotype relative risk model. Power calculations are then used to derive the repository size required to detect association at specified false-positive and false-negative rates. These calculations are performed at three decreasing levels of accuracy: exact numerical calculations using the true multinomial distribution of the test statistic; numerical calculations based on an approximate normal distribution of the test statistic; and analytical approximations accurate for complex traits where the polymorphism has a small effect on the phenotype.

[0276] Results are depicted in terms of the repository sizes required for three types of experimental designs for detecting association with a quantitative phenotype: first, a pooled DNA test using a conventional affected/unaffected classification; second, a pooled DNA test of extreme individuals using optimized selection thresholds; third, individual genotyping of the entire population. We conclude with a discussion of the reduction in power of pooled DNA tests due to experimental measurement error and with suggestions for effective use of pooled DNA tests in practice.

[0277] 5.1 Computational Methods

[0278] The calculation of optimized selection thresholds begins with a model for the genotype-dependent distribution of phenotypic values. A quantitative phenotype, denoted X, is standardized to have unit variance and zero mean. The phenotype is hypothesized to be affected by alleles A_{1 }_{2}_{1}_{1}_{1}_{2}_{2}_{2 }_{G }_{1}_{1}_{1}_{2}_{2}_{2 }

[0279] The inheritance mode of the QTL is represented by the displacement d of the heterozygote, for example purely recessive (d=−a), additive (d=0), or dominant (d=+a) inheritance, The inheritance mode partitions the phenotypic variance due to the QTL into the additive variance σ_{A}^{2 }_{D}^{2}

_{A}^{2}_{D}^{2}^{2}^{2}^{2}^{2}

[0280] This partitioning is important because, as will be seen below, pooled tests are sensitive primarily to the additive component of variance. Note that the additive component may be large even when the inheritance is purely dominant or recessive. The contributions to the phenotype from remaining genetic and environmental factors are assumed to follow a normal distribution with residual variance σ_{R}^{2}

[0281] σ_{A}^{2}_{A}^{2}_{D}^{2}

[0282] The genotype-dependent phenotype distributions for each genotype are

_{R}^{2}^{−1/2}_{G}^{2}_{R}^{2}

[0283] normal distributions centered at μ^{G }_{R}

_{G}

[0284] For a complex trait in which the QTL makes a small contribution, the three underlying distributions may be unresolved in the observed P(X).

[0285] This variance components model may be connected to an equivalent affected/unaffected genotype relative risk model by specifying a threshold phenotypic value X_{T }_{T}_{T}_{2}_{2}

[0286] In the tests of pooled DNA considered here, a sample repository of total size N serves as the source of DNA to be selected for one of two pools; not every individual need be selected. The test statistic is the difference in the frequency that a particular allele, here always assumed to be A_{1}_{U }_{L }_{U }_{L}_{L }_{U }_{U }

_{G}_{U}_{G}_{R}

[0287] which is solved numerically to determine X_{U}_{U}_{U}_{U}_{G}_{R}

_{G}_{L}^{−1}_{G}_{L}_{G}_{R}

[0288] using the lower threshold X_{L}

[0289] A pooling design based on an affected/unaffected classification is similar: affected individuals are selected for the upper pool; an equivalent number of suitably matched unaffected individuals are selected for the lower pool. The selection thresholds X_{U }_{L }_{T}_{U}_{U}_{2}_{2}_{2}_{2}

[0290] The repository size N required to detect association between genotype G and either the quantitative phenotype X or the affected/unaffected classification depends on the desired type I error rate α and type II error rate β, the chosen test statistic, and the experimental design, as well as on the underlying genetic model. For a one-sided test of a single marker, α=1−Φ(z_{α}_{−β}^{−8 }_{α}_{1−β}^{5 }_{0 }_{G }_{1 }_{G}

[0291] An exact calculation of the repository size required to attain desired error rates for a specified genetic model proceeds as follows. First, a value of the pooling fraction ρ or the disease prevalence r is selected. A trial repository size N is specified, with the number of individuals n selected per pool set to the integer part of ρN or rN. Next, the probability P_{0}_{1}_{1}_{1}_{2}_{2}_{2}

_{0}^{2}^{i}^{2}^{j}^{2 }^{k}

[0292] The frequency of allele A_{1 }_{0}_{0}

[0293] Significance at level α is attained by increasing Δp until this sum is less than or equal to α. If not even the maximum value Δp=1 is sufficient for significance at level α, then a larger value of N is selected for the current value of ρ and the calculation begins anew. Otherwise, multinomial probabilities for pool compositions are calculated under H_{1 }

_{U}^{U}_{1}_{1}^{i}_{U}_{1}_{2}^{j}_{U}_{2}_{2}^{k}

[0294] for the upper pool and a similar term P_{L}_{L }_{U}_{U}_{L}

[0295] For the affected/unaffected design, this procedure is followed for each value of r. For the tail pool design, the smallest feasible value for N is calculated as a function of ρ, and the entire design is optimized by searching for the pooling fraction ρ with the smallest feasible N.

[0296] When each pool contains a large number of individuals and many copies of each allele, the distribution of allele frequencies for the pool approaches a normal distribution. The difference in allele frequencies between pools, which continues to serve as the test statistic, approaches a normal distribution as well. The pool sizes required to achieve specified error rates are obtained accurately in this case by approximating the multinomial distributions of allele frequencies as normal distributions. Under H_{0}_{0}^{2}

[0297] Under H_{1}

_{U}_{L}_{G}_{U}_{L}_{G}

[0298] where the genotype-dependent allele frequency p_{G }_{1}_{1}_{1}_{2}_{2}_{2}_{1}^{2}_{1}^{2}

_{1}^{2}_{G}_{U}_{L}_{G}^{2}_{U}^{3}_{L}^{2}

[0299] The repository size N required for type I error a and power 1−β is

_{α}_{0}_{1−β}_{1}^{2}^{2}

[0300] For tail pools, p is then varied to find the smallest N. The normal approximation underestimates the repository size requirement relative to the exact results from the multinomial distribution. When the sum of the alleles in both pools is at least 60, the difference in repository sizes is no greater than 5%. We chose 60 alleles in both pools as the criterion for switching from the multinomial to the normal calculation. Standard algorithms were employed to perform the root search for X_{U }_{L}

[0301] In the regime of typical complex traits, the effect of any single QTL is small, the residual variance σ_{R}^{2 }_{G}

^{2}^{2}

[0302] truncated at second order. The first derivative is

[0303] where y is the height of the normal distribution at normal deviate z, and the second derivative is

^{−1/2}^{2}

[0304] Summing these terms,

^{2}

[0305] Substituting this approximation into the expressions for θ(G) using δ=μ_{G}_{R }^{−1}

_{U}_{R}_{G}_{G}_{G}+(}_{R}^{2}_{G}_{G}_{G}^{2}

_{L}_{R}_{G}_{G}_{G}+(}_{R}^{2}_{G}_{G}_{G}^{2}

[0306] The corresponding expressions for the affected/unaffected pools, with z=Φ^{−1}

_{U}_{R}_{G}_{G}_{G}+[}_{R}^{2}_{G}_{G}_{G}^{2}

[0307] The required sums are

_{G}_{G}_{G}_{A}^{½}

_{G}_{G}_{G}^{2}_{R}^{2}^{2}^{2}_{D}^{2}_{A}^{2}

[0308] The approximate value σ_{A}^{2}

[0309] The results for Δp are

^{1/2}_{0}_{A}_{R}

_{−1}_{A}^{3/2}_{0}_{R}_{0}_{A}^{1/2}_{R}

[0310] To the same order of approximation, σ_{1}^{2 }_{0}^{2}

_{α}_{1−β}^{2 }_{0}^{2}^{2}

[0311] The preceding three equations lead directly to our main results, Eqs. 1 and 2.

[0312] The perturbation theory above is valid when the expansion parameters μ_{G}_{R }_{A}^{2}_{R}_{1}_{1 }_{R}_{R }

[0313] If individual genotypes are measured for the N individuals in the population, the regression coefficient b_{1 }

_{1}_{G}

[0314] is a suitable test statistic. The residual contribution ε to the phenotype has mean zero and is uncorrelated with p_{G}_{0}_{1 }

_{1}_{0}^{−1}_{G}

[0315] Under H_{1}_{1 }

_{1}_{1}_{G}_{A}^{1/2 }

_{1}_{1}^{−1}_{G}_{R}^{2}

[0316] The repository size required for a one-sided test of b_{1 }

_{α}_{1}_{0}^{1/2}_{1−β}_{1}_{1}^{1/2}^{2}_{1}_{1}^{2}

[0317] which is presented in simplified form as Eq. 3.

[0318] Two experimental designs are considered using DNA pooled from individuals selected from a pre-existing repository of N samples: affected/unaffected pools, with DNA pooled from n affected and n unaffected individuals; and tail pools, with DNA pooled from the n most extreme individuals at each tail of the phenotype distribution.

[0319] For the affected/unaffected design, the expected number of affected individuals is n=rN, and an additional n suitably matched controls are selected from the remainder of the population.

[0320] An analytical approximation for the repository size is

_{aff/unaff}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}^{2}_{r}^{2}^{−}_{A}^{3/2}_{R}^{1/2}^{1/2}^{2}

[0321] where y_{r }^{−1}

[0322] The tail pools are parameterized by the fraction ρ n/N of population N selected for each pool. An analytical approximation for the repository size is

_{tail}_{α}_{1−β}^{2}_{R}^{2}_{A}^{2}_{ρ}^{2}

[0323] where y_{ρ}^{−1}_{ρ}^{2 }^{tail}

[0324] The repository size required to achieve the same error rates using individual genotyping is

_{indiv}_{α}_{1−β}^{R]}^{2}_{A}^{2}

[0325] based on a regression model of phenotypic value on allele dose (see Materials and Methods for derivation).

[0326] Results of the analytical approximations are shown in

[0327] The effect of varying the inheritance mode is shown in ^{−8}_{1 }_{2 }

[0328] The effect of varying the additive variance directly, or equivalently the genotype relative risk for an allele of known frequency, is shown in ^{−8 }

[0329] This analysis assumes that allele frequency measurement error is negligible. Allele frequencies measured by most technologies, including PCR amplification, kinetic PCR, denaturing high performance liquid chromatography, single-strand conformation polymorphism, pyrophosphate sequencing, and mass spectrometry, are typically reported with standard errors in the range of 0.01 to 0.02. Assuming a measurement error of 0.01, the measurement error in the frequency difference is larger by a factor of {square root}2, yielding a anal error of 0.014. Based on the measurement error, the allele frequency difference of 0.04 in the example above corresponds to a z-score of 2.86 and a type I error rate of 0.002.

[0330] While this error rate is much larger than the error rate of 5×1 0^{−8 }

[0331] This experimental limitation sets a threshold for the effect size that may be identified in a pooled DNA pre-screen. The relationship between the expected value of Δp and the parameters of the genetic model for a SNP with purely additive inheritance is

_{α}_{α}_{1−β}

[0332] where the initial factor of 2.44 arises from the optimized pooled tail design, z_{α}_{1−β}_{α}_{1−β}

[0333] Abecasis, G R, Cardon, L R, Cookson, W O C (2000) A general test of association for quantitative traits in nuclear families. Am J Hum Genet 66: 279-292.

[0334] Alderbom A, Kristofferson A, Hammerling U: Determination of single-nucleotide polymorphisms by real-time pyrophosphate DNA sequencing. Genome Res 2000; 1(0; 1249-1258.

[0335] Austin M A, King M C, Bawol R D, Hulley S B, Friedman G D (1987) Risk factors for coronary heart disease in adult female twins. Genetic heritability and shared environmental influences. Am J Epidemiol 125: 308-18.

[0336] Barcellos L F, Klitz W, Field L L, Tobias R, Bowcock A M, Wilson R, Nelson M P et al. (1997) Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet 61:734-747.

[0337] Beyer W H (ed) (1984) CRC Standard Mathematical Tables, 27^{th }

[0338] Buetow K H, Edmonson M, MacDonald R, Clifford R, Yip P, Kelley J, Little D P, Strausberg R, Koester H, Cantor C R, Braun A: High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc Nat Acad Sci USA 2001; 98; 581-584.

[0339] Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Shaw N et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 1999 July;22(3):231-238.

[0340] Cardon L R (2000) A Sib-Pair Regression Model of Linkage Disequilibrium for Quantitative Traits. Hum Hered. 50:350-358.

[0341] Chandler D. Introduction to Modern Statistical Mechanics. New York: Oxford Univ. Press; 1987

[0342] Collins A, Lonjou C, Morton N E (2000) Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci USA 96:15173-15177.

[0343] Darvasi A, Soller M (1994) Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus. Genetics 138: 1365-1373.

[0344] Falconer, D S, MacKay, T F C (1996) Introduction to quantitative genetics. Addison-Wesley, Boston.

[0345] Fulker D W, Chemy S S, Cardon L R (1995) Multipoint interval mapping of quantitative trait loci, using sib pairs. Am J Hum Genet 56:1224-1233.

[0346] Frank, L (2000) Storm brews over gene bank of Estonian population. Science 286: 1262.

[0347] Germer S, Holland M J, Higuchi R. High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Gen Res 2000; 10; 258-266.

[0348] Goddard K A, Hopkins P J, Hall J M, Witte J S (2000) Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am J Hum Genet 66:216-34.

[0349] Gu C, Todorov A, Rao D C (1996) Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of quantitative trait loci. Genet Epidemiol 13:513-533

[0350] Heller D A, de Faire U, Pedersen N L, Dahlen G, McCleam G E (1993) Genetic and environmental influences on serum lipid levels in twins. N Engl J Med 328: 1150-6.

[0351] Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere M L, Spurlock G, Austin J, Stephens M K, Buckland P R, Owen M J, O'Donovan M C: Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Gen 2000; 107; 488-493.

[0352] Iselius L, Morton N E, Rao D C (1983) Family resemblance for blood pressure. Hum Hered 33: 277-286.

[0353] Kruglyak, L (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genetics 22: 139-144.

[0354] Kruglyak L, Lander E S (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet 57:439-454.

[0355] Liu, B -H (1997) Statistical Genomics. CRC Press, Boca Raton.

[0356] Mathews J, Walker R L (1970) Mathematical methods of physics, second edition. Benjamin/Cummings, London.

[0357] Neale, M C and Cardon, L R (1992). Methodology for Genetic Studies of Twins and Families, NATO ASI Series D, Behavioural and Social Sciences, Vol. 67. Kluwer Academic, Dordrecht.

[0358] Nilsson A, Rose J (1999) Sweden takes steps to protect tissue banks. Science 286: 894.

[0359] Ott J (1999) Analysis of human genetic linkage. Johns Hopkins Univ Pr, Baltimore.

[0360] Perusse L, Rice T, Bouchard C, Vogler G P, Rao D C (1989) Cardiovascular risk factors in a French-Canadian population: resolution of genetic and familial environmental effects on blood pressure by using extensive information on environmental correlates. Am J Hum Genet 45: 240-251.

[0361] Press, W H, Teukolsky, S A, Vetterling, W T, and Flannery, B P (1997) Numerical Recipes in C, The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK.

[0362] Rabinow, P (1999) French DNA: Trouble in Purgatory. University of Chicago Press, Chicago.

[0363] Risch N J (2000) Searching for genetic determinants in the new millennium. Nature 405: 847-856.

[0364] Risch N J, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516-1517.

[0365] Risch N J, Teng J (1998) The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res 8:1273-1288.

[0366] Risch N J, Zhang H (1996) Mapping quantitative trait loci with extreme discordant sib pairs: sampling considerations. Am J Hum Genet 58:836-843.

[0367] Sasaki T, Tahira T, Suzuki A, Higasa K, Kukita Y, Baba S, Hayashi K: Precise estimation of allele frequencies of single-nucleotide polymorphisms by a quantitative SSCP analysis of pooled DNA. Am J Hum Gen 2001; 68; 214-218.

[0368] Sham, P (1997) Statistics in Human Genetics. Arnold.

[0369] Shaw S H, Carrasquillo M M, Kashuk C, Puffenberger E G, Chakravarti A: Allele frequency distributions in pooled DNA samples: applications to mapping complex disease genes. Gen Res 1998; 8; 111-123.

[0370] Snedecor and Cochran Snedecor G W, Cochran W G. Statistical Methods. 8^{th }

[0371] Verkasalo P K, Kaprio J, Koskenvuo M, Pukkala E (1999) Genetic predisposition, environment and cancer incidence: a nationwide twin study in Finland, 1976-1995. Int J Cancer 83: 743-749.

[0372] Watanabe R M, Valle T, Hauser E R, Ghosh S, Enlksson J, Kohtamaki K, Ehnholm C et al. (1999) Familiality of quantitative metabolic traits in Finnish families with non-insulin-dependent diabetes mellitus. Finland-United States Investigation of NIDDM Genetics (FUSION) Study investigators. Hum Hered 49: 159-168.

[0373] Wilk J B, Djousse L, Arnett D K, Rich S S, Province M A, Hunt S C, Crapo R O et al. (2000) Evidence for major genes influencing pulmonary function in the NHLBI family heart, study. Genet Epidemiol 19: 81-94.

[0374] Zhang H, Risch N (1995) Extreme discordant sib pairs for mapping quantitative trait. loci in humans. Science 268:1584-1589.

[0375] Zhang H, Risch N (1996) Mapping quantitative-trait loci in humans by use of extreme concordant sib pairs: selected sampling by parental phenotypes. Am J Hum Genet 59:951-957.

[0376]

TABLE I | ||

Sib-pair genotype probabilities | ||

Sib Genotype | ||

G_{1} | G_{2} | P(G_{1}_{2} |

A_{1}_{1} | A_{1}_{1} | p_{1}^{4}_{1}^{3}_{2}_{1}^{2}_{2}^{2} |

A_{1}_{1} | A_{1}_{2} | p_{1}^{3}_{2}_{1}^{2}_{2}^{2} |

A_{1}_{1} | A_{2}_{2} | p_{1}^{2}_{2}^{2} |

A_{1}_{2} | A_{1}_{1} | p_{1}^{3}_{2}_{1}^{2}_{2}^{2} |

A_{1}_{2} | A_{1}_{2} | p_{1}^{3}_{2}_{1}^{2}_{2}^{2}_{1}_{2}^{3} |

A_{1}_{2} | A_{2}_{2} | p_{1}^{2}_{2}^{2}_{1}_{2}^{3} |

A_{2}_{2} | A_{1}_{1} | p_{1}^{2}_{2}^{2} |

A_{2}_{2} | A_{1}_{2} | p_{1}^{2}_{2}^{2}_{1}_{2}^{3} |

A_{2}_{2} | A_{2}_{2} | p_{1}^{2}_{2}^{2}_{1}_{2}^{3}_{2}^{4} |

[0377]

TABLE II | ||||

Pooling Designs | ||||

Design Family | Indicators | |||

Design | I_{U1} | I_{U2} | I_{L1} | I_{L2} |

Unrelated | ||||

Unrelated- | H(X_{1}_{U} | — | H(X_{L}_{1} | — |

Random | ||||

Unrelated- | H(X_{1}_{U} | H(X_{2}_{U} | H(X_{L}_{1} | H(X_{L}_{2} |

Extreme | H(|X_{1}_{2} | H(|X_{2}_{1} | H(|X_{1}_{2} | H(|X_{2}_{1} |

Sib-Together | ||||

Concordant | H(X_{1}_{U} | H(X_{1}_{U} | H(X_{L}_{1} | H(X_{L}_{1} |

H(X_{2}_{U} | H(X_{2}_{U} | H(X_{L}_{2} | H(X_{L}_{2} | |

Pair-mean | H(X_{+}_{U} | H(X_{+}_{U} | H(X_{L}_{+} | H(X_{L}_{+} |

Sib-Apart | ||||

Discordant | H(X_{1}_{U} | H(X_{L}_{1} | H(X_{L}_{1} | H(X_{1}_{U} |

H(X_{L}_{2} | H(X_{2}_{U} | H(X_{2}_{U} | H(X_{L}_{2} | |

Pair-difference | H(|X_{−}_{U} | H(|X_{−}_{U} | H(|X_{−}_{U} | H(|X_{−}_{U} |

H(X_{1}_{2} | H(X_{2}_{1} | H(X_{2}_{1} | H(X_{1}_{2} | |

[0378]

TABLE III | ||

Sib-pair genotype probabilities | ||

Sib-Pair | ||

Genotype | ||

G_{1} | G_{2} | P(G_{1}_{2} |

A_{1}_{1} | A_{1}_{1} | p^{4}^{3}^{2}^{2} |

A_{1}_{1} | A_{1}_{2} | p^{3}^{2}^{2} |

A_{1}_{1} | A_{2}_{2} | p^{2}^{2} |

A_{1}_{2} | A_{1}_{1} | p^{3}^{2}^{2} |

A_{1}_{2} | A_{1}_{2} | p^{3}^{2}^{2}^{3} |

A_{1}_{2} | A_{2}_{2} | p^{2}^{2}^{3} |

A_{2}_{2} | A_{1}_{1} | p^{2}^{2} |

A_{2}_{2} | A_{1}_{2} | p^{2}^{2}^{3} |

A_{2}_{2} | A_{2}_{2} | p^{2}^{2}^{3}^{4} |

[0379] While the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.