95 results
Search Results
2. On recovering a population covariance matrix in the presence of selection bias
- Author
-
Zhihong Cai and Manabu Kuroki
- Subjects
Statistics and Probability ,Selection bias ,education.field_of_study ,Covariance function ,Covariance matrix ,Applied Mathematics ,General Mathematics ,media_common.quotation_subject ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Estimation of covariance matrices ,symbols.namesake ,Causal inference ,Statistics ,Econometrics ,symbols ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Fisher information ,media_common ,Mathematics ,Statistical hypothesis testing - Abstract
This paper considers the problem of using observational data in the presence of selection bias to identify causal effects in the framework of linear structural equation models. We propose a criterion for testing whether or not observed statistical dependencies among variables are generated by conditioning on a common response variable. When the answer is affirmative, we further provide formulations for recovering the covariance matrix of the whole population from that of the selected population. The results of this paper provide guidance for reliable causal inference, based on the recovered covariance matrix obtained from the statistical information with selection bias.
- Published
- 2006
3. Local influence in principal components analysis
- Author
-
Lei Shi
- Subjects
Statistics and Probability ,education.field_of_study ,Covariance matrix ,Applied Mathematics ,General Mathematics ,Population ,Function (mathematics) ,Agricultural and Biological Sciences (miscellaneous) ,Principal component analysis ,Statistics ,Applied mathematics ,Symmetric matrix ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Cook's distance ,Statistic ,Eigenvalues and eigenvectors ,Mathematics - Abstract
SUMMARY Based on the definition of generalised influence function and generalised Cook statistic, local influence of small perturbations on the eigenvalues and eigenvectors of a covariance matrix are studied for population and sample versions. The results based on the correlation matrix are also derived and some related topics are discussed. Finally, an example is used for illustration. bations. Although this method has been applied to many models (Beckman, Nachtsheim & Cook, 1987; Lawrance, 1988; Thomas & Cook, 1990), assessing the local influence for multivariate data is still an unexplored area. In this paper, a method based on a generalised influence function and generalised Cook statistic is introduced to assess the local influence of small perturbations on a statistic of interest, and we prove that this method is equivalent to Cook's work based on normal curvature in the likelihood framework. Since local influence on some statistics in multivariate analysis is not easy to study using the likelihood displacement of Cook (1986), this method is expected to have wider applications. In this paper, we employ the generalised Cook statistic to assess local influence in principal components analysis. In ? 2, we introduce the definition of the generalised influ- ence function and generalised Cook statistic. In ? 3, the perturbation theory of eigenvalues and eigenvectors of a real symmetric matrix is studied. In ? 4, the local influence of small perturbations of observations on eigenvalues and the eigenvectors of sample covariance
- Published
- 1997
4. Semiparametric inference for the dominance index under the density ratio model
- Author
-
Jiahua Chen, Weiwei Zhuang, and B Y Hu
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,05 social sciences ,Population ,Estimator ,Stochastic dominance ,Asymptotic distribution ,01 natural sciences ,Agricultural and Biological Sciences (miscellaneous) ,Confidence interval ,010104 statistics & probability ,Empirical likelihood ,0502 economics and business ,Econometrics ,0101 mathematics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,050205 econometrics ,Statistical hypothesis testing ,Quantile ,Mathematics - Abstract
SUMMARY An important and often discussed research problem in statistics is how to compare several populations; examples arise in medical science, engineering, finance and other fields. Often population means or medians are compared. However, one population may have a higher mean income, for example, because of a small number of super-rich individuals; the mean therefore may not reflect the wealth of the general population. Instead, an index of the degree of stochastic dominance of one population over another would better reflect their relative wealth. Currently, we can estimate such an index under restrictive conditions, but there is no generic estimator with a known asymptotic distribution. In this paper, we suggest linking the populations via the density ratio model. Under this model, we develop an empirical likelihood estimator and establish its asymptotic normality. In addition, we improve the estimation efficiency by examining the similarities between the populations. Furthermore, we provide a valid bootstrap method for hypothesis testing and the construction of confidence intervals. Simulation experiments show that the proposed estimator substantially improves the estimation efficiency and power of the test, and leads to confidence intervals with satisfactorily precise coverage probabilities. It is also robust with respect to mild model misspecification. Two examples are given to demonstrate the usefulness of both the method and the concept.
- Published
- 2019
5. Shape-constrained partial identification of a population mean under unknown probabilities of sample selection
- Author
-
José R. Zubizarreta, Stefan Wager, and Luke Miratrix
- Subjects
Statistics and Probability ,General Mathematics ,Population ,Survey sampling ,Ratio estimator ,Mathematics - Statistics Theory ,Sample (statistics) ,Statistics Theory (math.ST) ,01 natural sciences ,010104 statistics & probability ,0502 economics and business ,Statistics ,FOS: Mathematics ,0101 mathematics ,education ,Selection (genetic algorithm) ,Mathematics ,education.field_of_study ,050208 finance ,Applied Mathematics ,05 social sciences ,Agricultural and Biological Sciences (miscellaneous) ,Outcome (probability) ,Identification (information) ,Distribution (mathematics) ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences - Abstract
A prevailing challenge in the biomedical and social sciences is to estimate a population mean from a sample obtained with unknown selection probabilities. Using a well-known ratio estimator, Aronow and Lee (2013) proposed a method for partial identification of the mean by allowing the unknown selection probabilities to vary arbitrarily between two fixed extreme values. In this paper, we show how to leverage auxiliary shape constraints on the population outcome distribution, such as symmetry or log-concavity, to obtain tighter bounds on the population mean. We use this method to estimate the performance of Aymara students---an ethnic minority in the north of Chile---in a national educational standardized test. We implement this method in the new statistical software package scbounds for R.
- Published
- 2017
6. A Bayesian analysis of multiple-recapture sampling for a closed population
- Author
-
B. J. Castledine
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population size ,Bayesian probability ,Population ,Posterior probability ,Sampling (statistics) ,Strong prior ,Sample (statistics) ,Agricultural and Biological Sciences (miscellaneous) ,Sample size determination ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY This paper considers from a Bayesian viewpoint inferences about the size of a closed animal population from data obtained by a multiple-recapture sampling scheme. The method developed enables prior information about the population size and the catch probabilities to be utilized to produce considerable improvements in certain cases on ordinary maximum likelihood methods. Several ways of expressing such prior information are explored and a practical example of the uses of these ways is given. The main result of the paper is an approximation to the posterior distribution of sample size that exhibits the contributions made by the likelihood and the prior ideas. The multiple-recapture sampling scheme involves taking samples from a population of animals, at each stage counting the number of marked animals in the sample, marking the previously unmarked animals and returning the sample to the population. The literature is reviewed by Cormack (1968) and the papers most relevant to our problem are those of Chapman (1952) and Darroch (1958). In these papers, maximum likelihood estimates for the population parameters and variances for these estimates are given. Our approach differs from previous ones by introducing prior information about the population size, and about the propensities of the animals to be captured, called 'catch probabilities' by some previous authors. The incorporation of these propensities is a feature not shared by many previous approaches. Darroch (1958) found their introduction into his model does not change the maximum likelihood estimates of the population size. Our approach will show that strong prior knowledge about the catch probabilities can greatly affect inference about the population size, and it is in this respect that the greatest difference from previous approaches lies. Although it is recognized that models of an open population incorporating death and immigration such as those of Jolly (1965) and Seber (1965) are more practically realistic, we feel that our method, which deals only with a closed population, is worth investigating as a step towards providing a Bayesian treatment of the open population problem.
- Published
- 1981
7. A nonparametric measure of intraclass correlation
- Author
-
P. Rothery
- Subjects
Statistics and Probability ,education.field_of_study ,Intraclass correlation ,Applied Mathematics ,General Mathematics ,Interclass correlation ,Population ,Nonparametric statistics ,Correlation ratio ,Agricultural and Biological Sciences (miscellaneous) ,Joint probability distribution ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Statistic ,Mathematics ,Rank correlation - Abstract
A nonparametric measure of intraclass correlation based on the probability of certain types of concordances among the observations and which can be estimated from ranked data is proposed. One estimator whose variance can be estimated in an unbiased way is shown to be asymptotically normally distributed. Properties of the measure and its estimator are studied for a normal model; in particular the method is shown to provide a relatively powerful test of the null hypothesis of zero intraclass correlation in a normal population. The intraclass correlation coefficient is often used as a measure of the degree to which individuals from the same family resemble one another in some variable such as height or weight. From a random sample of families and a random selection of individuals from each family, both the between and within family components of variance can be estimated and an estimate of the intraclass correlation obtained. When the observations are ranks this approach is not feasible since direct estimates of the components of variance are not available. An example occurs in a study of aggressiveness in male red grouse (Moss, Watson & Parr, 1974). One measure of aggressiveness of an individual bird is a dominance rank, the position of that bird in the dominance hierarchy of a given group of birds. Comparison of the degree of familial resemblance in different populations is complicated by the fact that the data are ranks and that dominance hierarchies can be variable in size with different family sizes. The development of standard test statistics, such as the Kruskal-Wallis statistic (Kruskal, 1952), as measures of intraclass correlation is unsatisfactory because of the complicated dependence on the structure of the sample; see ? 6. In this paper a measure of intraclass correlation is proposed which overcomes this problem. It is independent of the structure of the sample and it can be estimated in an unbiased way from ranked data. The interclass rank correlation coefficients, Kendall's Xr and Spearman's p (Kendall, 1955), can be interpreted in terms of probabilities of concordances among the observations from a bivariate distribution. In response to Hill (1974), Kerridge (1975) has used this probabilistic interpretation of r in an analysis of Football League results. The measure presented in the present paper is based on concordances among observations from a population in which individuals occur in families. Its properties for a general class of populations are discussed and some particular results for a one-way classification model with normally distributed effects are presented. The power of the associated test is assessed for a normal model for a range of sample structures.
- Published
- 1979
8. Optimal allocation in sequential tests comparing the means of two Gaussian populations
- Author
-
Thomas A. Louis
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Gaussian ,Population ,Expected value ,Agricultural and Biological Sciences (miscellaneous) ,symbols.namesake ,Asymptotically optimal algorithm ,Discrete time and continuous time ,Sequential probability ratio test ,Statistics ,symbols ,Statistics, Probability and Uncertainty ,Invariant (mathematics) ,General Agricultural and Biological Sciences ,education ,Mathematics ,Statistical hypothesis testing - Abstract
SUMMARY The invariant sequential probability ratio test used in testing for a difference between the means of two Gaussian populations is set up. The error probabilities for this test are effectively constant over a rich class of data-dependent allocation rules. The additional risk, average sample number plus (y - 1) times the expected number of observations to the inferior population, for y > 1, is introduced and the optimal allocation rule is found for the continuous-time analogue to this problem. Analytical results show this rule to be asymptotically optimal in discrete time, and simulations indicate its near optimal per- formance for the finite case. The problem of two-population hypothesis testing with data-dependent allocation of observations has been treated by several authors. Also, the applications of this decision model, especially to clinical testing, have been well documented. Recent results show that when the test is sequential and the termination rule is of the sequential probability ratio test type, the probability of correct hypothesis selection is constant, ignoring overshoot, for a rich class of data-dependent allocation rules; see Flehinger, Louis, Robbins & Singer (1972) and an as yet unpublished paper of mine. This constancy permits one to search the class for a rule which performs well with respect to some additional cost structure, one usually based on the number of observations taken on the superior and inferior populations. Flehinger & Louis (1972) and Robbins & Siegmund (1974) give simulations showing that a substantial reduction in the expected number of observations on the inferior popu- lation is possible using data-dependent allocation rules, as opposed to equal assignment, for the case of comparing two Gaussian populations with known variances. The simulation results of Flehinger & Louis (1971) show the same reduction for the exponential distri- bution. In the present paper the risk function formed from the average sample number plus (y - 1) x the expected number of observations allocated to the inferior population, for y > 1, is introduced into the Gaussian testing model. Here y is the relative cost of taking an observation from the inferior as opposed to the superior population, and varying y allows one to balance the two components of risk. In ? 2 first the Gaussian allocation and testing problem is set up and previous results are summarized. Using the above risk function, in Appendix A the optimal allocation and its risk are obtained for the continuous-time idealization, that of comparing the drifts of two Brownian motions. Back in ? 2 this optimal rule is related to the discrete testing situation.
- Published
- 1975
9. Randomness and local regularity of points in a plane
- Author
-
P. Rothery and D. Brown
- Subjects
Statistics and Probability ,Discrete mathematics ,education.field_of_study ,Plane (geometry) ,Applied Mathematics ,General Mathematics ,Population ,Boundary (topology) ,Nearest neighbour distribution ,Agricultural and Biological Sciences (miscellaneous) ,Sampling distribution ,Poisson point process ,Applied mathematics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Realization (probability) ,Randomness ,Mathematics - Abstract
SUMMARY Sorne nearest neighbour test procedures, assuming a null hypothesis of a Poisson process in an infinite plane, are shown to be inapplicable when a complete map of individuals is available. Two statistics, the squared coefficient of variation of squared nearest neighbour distances, and the ratio of the geometric mean to the arithmetic mean of the squared dis- tances, are particularly appropriate to testing for local regularity in this situation. Two methods of carrying out the test, the first based on coinputer simulation and the second an approximation not requiring simulation, are presented. Additionally, indices of local regularity are suggested. Existing tests of randomness, based on distance methods, of individuals in a plane are primarily intended for use with large populations. The null hypothesis is that their observed distribution is a realization of a two-dimensional Poisson point process, infinite in extent. For convenience we shall refer to this starting point, and the results which can be derived from it, as infinite plane theory. Tests have been based on distances to nearest neighbours of randomly chosen individuals (Skellam, 1952), distances to first and second nearest neigh- bours of randomly chosen points (Holgate, 1965), and a combination of individual to nearest individual and point to nearest individual distances (Hopkins, 1954). More recently Besag & Gleaves (1973) have proposed tests based on the T-square method of sampling. In some studies, for example of territorial behaviour, a complete map of individual locations in a relatively small population within a given boundary is available. This situation is somewhat different from that outlined above: the null hypothesis of interest then is that individual positions are a realization of a process locating them independently and at random within the given boundary. This paper presents a method for testing this null hypothesis which is designed to detect regularity of spacing of individuals on a small scale, irrespective of the global pattern. Pielou (1974, p. 155) discussed a number of ecological mechanisms which might cause such local regularity and also gave a method, based on infinite plane theory, to test for it. The method presented in this paper utilizes individual to individual nearest neighbour distances, and is based on computer simulation to overcome the problems created by the presence of a boundary and by the lack of statistical independence of the nearest neighbour distances. Two statistics are suggested as being particularly appropriate to the detection of local regularity. Their moments under infinite plane theory are derived and methods of approximation to their sampling distribution discussed. An approximate method is presented which would provide a satisfactory test of randomness in itself for some
- Published
- 1978
10. Simple and highly efficient estimators for a type I censored normal sample
- Author
-
Tore Persson and Holger Rootzén
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Negative binomial distribution ,Estimator ,Sample (statistics) ,Agricultural and Biological Sciences (miscellaneous) ,Standard deviation ,Binomial distribution ,Distribution (mathematics) ,Statistics ,Limit (mathematics) ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY In a type I left censored normal sample the information consists of the observations xl, ..., which fell above the 'observation limit' c and of the number, n - k, of those observations which fell below c. This paper considers the estimation of the parameters of a normal population given a sample which has been censored at the known point c. When the maximum likelihood method is used to produce estimates of It and oC one has to resort to numerical solution of the resulting equations. In this paper simple estimators ,t* and o-* are presented. They are shown to be almost as good as the maximum likelihood estimators both for small and large samples. In the 'fixed ki' case the censored sample is generated by a sequential procedure: independent observations are made, one by one, until a predetermined number k of observations above c is obtained. The distribution of n - k will then be negative binomial. The 'fixed n' situation occurs when in a random sample of fixed size n all observations, if any, below c are deleted. In this case the number of remaining observations, k, will have a binomial distribution. However, it may then happen that the censored sample is void, and then no reasonable estimates of the population parameters can be produced. When we investigate small-sample properties of estimators we prefer to avoid that complication by excluding all such samples from our considerations: if all n observations fall below c a new sample of size n is taken, and so on, until a sample with at least one observation above c is obtained; furthermore we assume that, whenever this happens, we completely ignore how many times a useless sample was obtained. This situation arises naturally in practice: the client will not consult the stati- stician unless k exceeds zero, and the statistician will never know how many times the client did not consult him. When n is large this change in the meaning of 'fixed n'is of little importance. In particular, it does not affect any asymptotic results. Throughout the paper we assume that the population from which the original observations are taken is normal with unknown mean ,u and standard deviation cr, while the 'observation
- Published
- 1977
11. ON A COMPREHENSIVE TEST FOR THE HOMOGENEITY OF VARIANCES AND COVARIANCES IN MULTIVARIATE PROBLEMS
- Author
-
D. J. Bishop
- Subjects
Statistics and Probability ,education.field_of_study ,Studentized range ,Applied Mathematics ,General Mathematics ,Population ,Univariate ,Bartlett's test ,Covariance ,Agricultural and Biological Sciences (miscellaneous) ,Sampling distribution ,Statistics ,F-test of equality of variances ,Multiple correlation ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
Now that satisfactory and probably final solutions have been obtained for a wide variety of statistical problems concerned with a single normally distributed variable, more and more attention has recently been given to the solution of multivariate problems. The multiple correlation methods of the old large sample theory have been replaced in many instances by others for which " studentized " test criteria are available, often having sampling distributions that are already familiar in univariate problems. In a recent paper on " The statistical utilization of multiple measurements", R. A. Fisher (1938a) has shown the connexion between certain of these methods: the D2-statistic work of Mahalanobis, the discriminant function methods of the Galton Laboratory and the generalized " Student's " ratio of Hotelling. A similar very general problem was dealt with some time ago by S. S. Wilks (1932), while mention may also be made of two papers by D. G. Lawley (1938 a, b) and a paper by P. L. Hsu (1938). The purpose of the methods put forward is to obtain information regarding the mean values of a number, say q, of correlated variables in one or more, say k, populations from which random samples have been drawn. If we denote by x8 a value of the sth variable (s = 1, 2, ..., q), then in all this work it has been assumed not only that x8 is normally distributed, but that it has the same variance o2 in every population sampled. Further, it is assumed that if x. is a second variable the correlation coefficients psu between x8 and x. is the same in all populations. The estimates of variance and covariance required in order to "studentize" the function of the sample means are therefore obtained by pooling together the sums of squares and sums of products from all samples. While it is true that even if oa, and psu are not the same in all populations the error involved may not be very large, it is however important to have available some means of testing the basic hypothesis which assumes homogeneity throughout the populations. Such a test has been derived by S. S. Wilks (1932) by an extension of Neyman & Pearson's likelihood ratio method of approach. Hitherto the somewhat lengthy computations required to obtain the moments of the sampling distribution of the test criterion have probably discouraged its use. The objects of the present paper are as follows: (a) In the simple but commonly met case, where the k samples are of the same
- Published
- 1939
12. Sequential estimation of the size of a population
- Author
-
P. R. Freeman
- Subjects
Statistics and Probability ,Sequential estimation ,Lever ,education.field_of_study ,business.product_category ,Applied Mathematics ,General Mathematics ,Maximum likelihood ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Quadratic equation ,Sample size determination ,Statistics ,Ball (bearing) ,Bibliography ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,business ,education ,Mathematics - Abstract
When estimating the size of a finite population, it is possible to consider, as an alternative to the capture-recapture method, a sequential scheme. Suppose an urn contains an unknown number, N, balls, initially all white. A single ball is drawn at random and if it is white it is painted black and returned to the urn, while if it is black, indicating that it has already been sampled at least once before, it is returned unchanged. Thus initially nearly all drawings will be of white balls, but after a long time mostly black balls will appear. Somewhere in between we wish to stop sampling and produce an estimate of N. This scheme was first proposed by Goodman (1949, 1953). The most recent treatment of this problem is by Samuel (1968, 1969), who considered asymptotic distributions of sample sizes and maximum likelihood estimates for each of four stopping rules. The crucial question of the choice between these rules remained unanswered, however, since the balance between cost of sampling and risk due to inaccurate estimation was not considered. It is precisely this which is the aim of the present paper. A bibliography of previous, non-Bayesian, work is given by Samuel (1968). Since preparing the first draft of this paper, the author has seen an unpublished report by D. G. Hoel and W. E. Lever in which this problem is formulated in a Bayesian framework, and a small numerical solution given for a quadratic loss function, and with an upper limit on the value of N.
- Published
- 1972
13. ON THE UTILIZATION OF MARKED SPECIMENS IN ESTIMATING POPULATIONS OF FLYING INSECTS
- Author
-
C. C. Craig
- Subjects
Statistics and Probability ,Estimation ,education.field_of_study ,biology ,Applied Mathematics ,General Mathematics ,Population size ,Population ,Boundary (topology) ,biology.organism_classification ,Agricultural and Biological Sciences (miscellaneous) ,Moment (mathematics) ,Sample size determination ,Statistics ,Butterfly ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Colias eurytheme ,education ,Mathematics - Abstract
Professor William Hovanitz called my attention to the following problem: An observer catches butterflies, marks them, and immediately releases them. It is assumed that a butterfly, no matter how many times it has been caught before, has the same susceptibility to capture as any other butterfly in the population which is supposed stable while the captures are being made. Records are kept of f, the frequency of cases in which the same butterfly is caught x times, x = 1, 2, ..., until a total of s captures of r different butterflies have been made. (Efx = r; Zxfx = s.) The number, fo, of butterflies which escape is not observed; the problem is to estimate from the values offx the total population n of butterflies on the area assumed well defined. The estimation of biological populations by means of capture-recapture data is by no means a new problem, though papers dealing with it from the mathematical-statistical point of view are largely quite recent. (In particular, see the papers by Leslie & Chitty (1951), Bailey (1951), Moran (1951, 1952), and the bibliographies quoted by them.) However, the experimental conditions and the mathematical models for the present study arpear to differ in essential ways from those previously considered. The important point of departure is that each butterfly on being netted is immediately marked (with a spot of nail polish) and released. The butterflies (Colias eurytheme) were caught in one of two isolated alfalfa fields, which they inhabit, in southern California. Each catch was made during the same day at times when the butterflies were freely flying. Thus, it seemed reasonable to assume that the population was stable during a catch. The experimenter, Prof. Hovanitz, endeavoured to give each butterfly an equal chance of capture, walking in straight lines across the field and deviating in direction before reaching a boundary only when he noticed that a butterfly just caught tended to fly down his path. One check of the suitability of a mathematical model is to test the agreement of the experimental results with respect to the number of butterflies caught once, twice, etc., with those predicted from the model. I will return to this point at the end of the paper. Two mathematical models seem appropriate to serve as a basis for discussion of this estimation problem. It is of some interest to see that both lead to approximately the same estimates with little difference in their precision for large samples. It may be of more interest that for both models in which the population size is regarded as a parameter, though maximum likelihood estimates exist and agree substantially with moment estimates in all sixteen of the actual field experiments for which I have data, nevertheless with increasing sample size meaningful solutions of the likelihood equation do not exist.
- Published
- 1953
14. THE EXACT VALUE OF THE MOMENTS OF THE DISTRIBUTION OF x2 USED AS A TEST OF GOODNESS OF FIT, WHEN EXPECTATIONS ARE SMALL
- Author
-
J. B. S. Haldane
- Subjects
Statistics and Probability ,education.field_of_study ,Distribution (number theory) ,Applied Mathematics ,General Mathematics ,Sample (material) ,Population ,Degrees of freedom (statistics) ,Agricultural and Biological Sciences (miscellaneous) ,Combinatorics ,Minimum distance estimation ,Goodness of fit ,Cramér–von Mises criterion ,Statistics ,Multinomial theorem ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
IN genetical practice we are constantly presented with large numbers of small samples from populations consisting of several well-defined classes. For example in the mouse we can readily obtain hundreds of litters containing anything from one up to about twelve members. Their totals may agree satisfactorily with expectation on a Mendelian basis, for example J coloured, i white, or 9 grey, H black, i white. But we desire to know whether the individual litters can be regarded as random samples from such a population. In addition the problem of homogeneity may arise. That is to say the population as a whole may not conform to any particular expectation. But we may desire to know whether the litters can be regarded as random samples of the population given by the totals. It has long been known that when the numbers expected in any observation are small, the distribution of x2 departs from that given by Pearson (1900). The mean appears sometimes, but not always, to be equal to the number of degrees of freedom. But the variance is no longer exactly equal to twice that number. Exact expressions for it in certain cases have been given by Pearson (1932) and Cochran (1936). These are based on an ingenious application of the theory of multiple contingency by Pearson. It will be shown in this paper that the first few moments can often be calculated by entirely elementary methods involving nothing more advanced than the multinomial theorem. In an accompanying paper (Griineberg and Haldane, 1937) they will be applied to actual data on mice. We first study the distribution of x2 in a n-fold table with n 1 degrees of freedom, then in a (m x n)-fold table with m (n-i) degrees of freedom. For genetical work we are particularly interested in the (n x 2)-fold table with n degrees of freedom. As a limiting case of the 2-fold table with 1 degree of freedom we derive the moments of the variance of samples from a Poisson series, and thence the distribution of x2 in a n-fold table with n degrees of freedom. The important case of the (m x n)-fold table with (m 1) (n 1) degrees of freedom remains to be investigated. Consider a sample of 8 individuals falling into n classes. Let the expected and observed numbers in these classes be
- Published
- 1937
15. ON THE USE OF STUDENT'S t-TEST IN AN ASYMMETRICAL POPULATION
- Author
-
Ghurye Sg
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,media_common.quotation_subject ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Normal distribution ,Standard normal deviate ,Sample size determination ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Cumulant ,Normality ,Student's t-test ,media_common ,Mathematics ,Statistical hypothesis testing - Abstract
On account of the unique property of samples from a normal population that the ratio n+1 (x-Fb)/V(n + 1)/s (where jt is the population mean, x = x1/(n + 1) and ns2 =Z(Xi= 1 is the ratio of a normal deviate to a stochastically independent estimate of its variance, Student's t-test is a suitable test of significance for the mean of a normal population. However, in a variety of cases, it is necessary to test for the mean of a population which does not follow the Gaussian law. Efforts have, therefore, been made to see how far Student's distribution may be used for the purpose in non-normal populations. Due, mainly, to the analytical difficulties of the problem, no extensive theoretical discussion has yet been given. Thus, Pearson & Adyanthaya (1929), Rietz (1939) and Nair (1941) have given experimental treatments, while the theoretical discussions of some others (Rider, 1929; Perlo, 1933; Laderman, 1939) have dealt only with trivially small sample sizes. The papers by Bartlett (1935) and Geary (1936, 1947) give results true for any sample size, though they are based on certain assumptions and approximations. The present paper deals with the population considered by Geary in his 1936 paper, subject to the same approximations. The second contribution by Geary (in which is derived the t-distribution in samples from a population which departs more from normality than that considered in the 1936 paper) came to my notice too late to be made use of in the present work; but it is proposed to consider it later on. Geary (1936) has obtained the distribution of the ratio (x -It) (n + 1)/s in the case of an asymmetrical population, whose fourth and higher cumulants are zero, by neglecting squares and higher powers of the third cumulant. We know from this how far the probability of an error of the first kind (i.e. the probability of rejecting the null hypothesis when it is true) in such a population differs from that for a normal distribution, provided we may neglect the square of the standardized third cumulant yl. Here again, on account of analytical difficulties, it is not possible, except for very small sample sizes, to consider the effect of terms containing higher powers of yl. However, we can assume the result derived by Geary to be correct for very small values of yl, as also for large sample sizes-but in such cases the deviation from values of the normal theory is practically negligible. Even then, it is of interest to know whether, in using the usual tables of the t-test (based on the normal distribution), we are committing the greater error in the probability of an error of the first kind or in that of an error of the second kind. In the present paper are derived the values of the probability of an error of the second kind (and hence, of the power of the test) when the usual t-tables are used to define the critical region. It may be mentioned here that this problem is only a special case of a general investigation, on which the writer is engaged, into the effect, on statistical tests, of differences between the actual and the assumed distribution laws of the universe sampled. The solution of these problems is hampered by analytical difficulties in the derivation of the probability laws (and particularly of power functions), and the present case is one of the few in which a mathematical, though only approximate, solution has been found possible.
- Published
- 1949
16. On the distribution of range of samples from nonnormal populations
- Author
-
C. Singh
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Pearson distribution ,Sample (statistics) ,Edgeworth series ,Nonparametric skew ,Agricultural and Biological Sciences (miscellaneous) ,symbols.namesake ,Skewness ,Statistics ,Econometrics ,symbols ,Range (statistics) ,Kurtosis ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY The probability integral for the distribution of range of a sample from a population whose distribution can be represented by the first few terms of an Edgeworth series has been obtained in this paper. The numerical values of the corrective functions arising due to nonnormality are tabulated. The new theoretical results are compared with the earlier results, where available. The distribution of range of samples from nonnormal populations was first studied empirically by Pearson & Adyanthdya (1928) and later, among others, by Pearson (1950), Cox (1954) and David (1954). These studies have been limited mainly to the mean range and to the probability integral in some simple nonnormal cases, and from these likely effects of nonnormality on the distribution of range have been conjectured. Singh (1967) obtained some theoretical results regarding the expectation and the variance of range of samples from a population whose distribution can be represented by the first few terms of an Edgeworth series. These results provided some additional information regarding the effects of parental excess and skewness on the mean and variance of the range. In the present paper the probability integral for the distribution of range of samples from the same type of population has been obtained and evaluated for small samples to examine the effects of parental excess and skewness.
- Published
- 1970
17. EFFECT OF NON-NORMALITY ON THE POWER FUNCTION OF t-TEST
- Author
-
A. B. L. Srivastava
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Edgeworth series ,Agricultural and Biological Sciences (miscellaneous) ,Sample size determination ,Skewness ,Joint probability distribution ,Statistics ,Kurtosis ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Cumulant ,Mathematics ,Type I and type II errors - Abstract
Student's t-statistic provides a suitable test of significance for the mean when the sample comes from a normal population. The power of the normal theory test has been studied by Neyman (1935); Neyman & Tokarska (1936) and Johnson & Welch (1939). As in many cases the samples appear to belong to populations other than the Gaussian, it is necessary to see how far the normal theory test can be assumed to be valid in controlling the Type I and Type II errors of inference on non-normal samples. The effect of non-normality on the Type I error of Student's t-test was studied experimentally by Pearson & Adyanthaya (1929) and theoretically by Bartlett (1935), Geary (1936, 1947) and Gayen (1949). Assuming the parent population to be specified by the first two terms of the Edgeworth series, Geary (1936) obtained the approximate distribution of t for any sample size, and later this work was extended by Gayen (1949) by including in the distribution the effects of parental kurtosis A4 = fl2-3 and A2 = IA. Apart from the pioneer empirical study by Pearson & Adyanthaya (1929, pp. 276-80), the effect of non-normality on the Type II error (and hence on the power) of the t-test was first studied by Ghurye (1949). He has, however, considered only the effect of skewness of the population, for he started with the joint distribution of the mean and the variance for populations specified by the first two terms of the Edgeworth series (Geary, 1936). In this paper, it has been possible to study the effects of kurtosis and skewness of the parent population which may be assumed to cover a larger range of nonnormality. Gayen's (1949) formulae for the joint distribution of the sample mean and variance for the first four terms of the Edgeworth population have been utilized for the derivation of the corrective terms of the power function. The mnethod followed by Ghurye for the evaluation of the corrective term of the power function due to A3 appears to be satisfactory for derivation of the effects due to higher odd-order cumulants. But for those of the even-order cumulants his method does not appear to be useful, as Ghurye himself encountered some 'analytic difficulties'. In this paper, by a different approach it has been possible to evaluate integrals involved in the power function due to A4 and A3. The non-normal population considered here is supposed to be characterized by non-zero values of the standardized third and fourth cumulants. Since the effects of the higher-order terms depending on A5,6, A3A4, A2, ... are assumed to be negligible, the population covered is only moderately non-normal. Too high values of A3 and A4 can also not be permitted as they will make f(x) negative at one or both tails, and will give rise to subsidiary modes. To ensure a positive definite, unimodal frequency function, A4 should lie roughly between 0 and 2-4 and A3 02 (Barton & Dennis, 1952).t Also it is found possible in this paper to calculate in the non-normal case the critical region
- Published
- 1958
18. A multivariate analogue of the one-sided test
- Author
-
Akio Kudo
- Subjects
Statistics and Probability ,Multivariate statistics ,education.field_of_study ,Covariance matrix ,Applied Mathematics ,General Mathematics ,Population ,Univariate ,Multivariate normal distribution ,Agricultural and Biological Sciences (miscellaneous) ,Test (assessment) ,Multivariate analysis of variance ,One sided ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
In this paper we consider the following problem. Given a multivariate normal population with known variance matrix, what test is appropriate to determine whether the means are slipped to the right? In the case when the population is univariate normal, this problem can be solved by the ordinary one-sided test using either the normal or the t-distribution functions. It is the purpose of this paper to develop what may be termed a multivariate analogue of the one-sided test of significance.
- Published
- 1963
19. THE MEAN AND COEFFICIENT OF VARIATION OF RANGE IN SMALL SAMPLES FROM NON-NORMAL POPULATIONS
- Author
-
David Cox
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Coefficient of variation ,Population ,Sampling (statistics) ,Agricultural and Biological Sciences (miscellaneous) ,Upper and lower bounds ,Standard deviation ,Statistics ,Kurtosis ,Range (statistics) ,Statistical dispersion ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
Since Tippett (1925) tabulated the mean range of random samples from a normal population, the range has been extensively used for the rapid estimation of dispersion. Moreover, much work has been done recently on quick significance tests in which the root-mean-square estimate of dispersion in the t-test, analysis of variance, etc., is replaced by an estimate derived from the range. All these uses of the range rest on the assumption of normality,, and so it is of interest to examine the distribution of range from non-normal populations. This was first done by E. S. Pearson & Adyanthaya (1929), and their work, and later work by Shone (1949), has been summarized, discussed and extended by Pearson (1950). The conclusion was that in small samples the ratio of mean range to populIation standard deviation is not much affected by the form of the population, but that the coefficient of variation of range depends fairly critically on the population. In large samnples the distribution of range is, of course, determined bv the tails of the population and so is very sensitive to nonnorlnality. The main object of the present paper is to predict the mean and coefficient of variation of range in small random samples of n (n < 5) fromn a population of given skewness and kurtosis and then to show how these results can be uised to assess the effect of non-normality on the comiimon applications of the range. Such a prediction can only be approximate because there is no functional relation between the distribution of range and population skewness and kurtosis. There are four ways of proceeding: (i) by evaltuating numerically the single and double integrals for the mean and mean square of range for a representative selection of non-normal populations; (ii) by sampling experiments; (iii) by a derivation of upper and lower limnits for the miiean andl coefficient of variation of range of populations with given properties. This was first done by Plackett (1947), who obtained an upper bound to the ratio of mean range to population standard deviation in samples of given size from an arbitrary population. An imiportant extension of Plackett's work (Hartley & David, 1954) appeared as the present paper was being completed
- Published
- 1954
20. Some extensions of Somerville's procedure for ranking means of normal populations
- Author
-
William R. Fairweather
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Sampling (statistics) ,Multivariate normal distribution ,Variance (accounting) ,Agricultural and Biological Sciences (miscellaneous) ,Sample size determination ,Statistics ,Statistics, Probability and Uncertainty ,Special case ,General Agricultural and Biological Sciences ,education ,Expected loss ,Selection (genetic algorithm) ,Mathematics - Abstract
SUMMARY Somerville (1954) proposed a one-stage and a two-stage procedure for the selection of the population with the largest mean from a set of normal populations with unknown means and a common, known variance. The two-stage procedure eliminates one population after the first stage. For these procedures he assumed that a certain loss was incurred when an incorrect selection was made and also that a cost was due to sampling. The sample sizes which would minimize the maximum expected loss, the maximum being taken over all possible configurations of the true population means, were derived. For the special case of three populations he showed that the two-stage procedure, using appropriate allocations of observations between stages, has a smaller maximum expected loss than does the onestage procedure. This paper generalizes Somerville's formulation of the two-stage procedure to an arbitrary finite number of populations and presents results for the special case of four populations, where two two-stage procedures are considered. For the problem of selecting the population with the largest mean from a set of normal populations with unknown means and a common, known variance, Somerville (1954) proposed a one-stage and a two-stage procedure, which eliminates one population after the first stage. He assumed that a certain loss was incurred when an incorrect selection was made and also assumed a cost due to sampling. For these procedures, he derived the sample sizes which minimize the maximum expected loss, the maximum being taken over all possible configurations of the true population means. He showed, for the special case of three populations that the two-stage procedure, using appropriate allocations of observations between stages, has a smaller maximum expected loss than does the one-stage procedure. Other approaches to this problem have been made by Bechhofer, Sobel and Gupta (see Bechhofer, 1954, and Gupta, 1965). In this paper the formulation of Somerville's two-stage procedure is extended to an arbitrary, finite number of populations. For the case of four populations, the expected losses are analysed in detail and numerical results obtained. Two possible two-stage procedures are considered: in the first procedure, one population is discarded and in the second, two populations are discarded after the first stage. It is found that while both procedures, using appropriate allocations, have smaller maximum expected losses than does the corresponding one-stage procedure, which one of the two-stage procedures is the better depends on the allocation. These numerical results required the evaluation of certain multivariate normal integrals with arbitrary correlation matrices and arbitrary limits of integration. This was accomplished for 4- and 5-variate integrals using a method suggested
- Published
- 1968
21. The ultimate size of carrier-borne epidemics
- Author
-
F. Downton
- Subjects
Statistics and Probability ,education.field_of_study ,Exponential distribution ,Stochastic modelling ,Applied Mathematics ,General Mathematics ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Expression (mathematics) ,Deterministic approximation ,Elementary function ,Applied mathematics ,Probability distribution ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Epidemic model ,education ,Demography ,Mathematics - Abstract
SUMMARY An epidemic model is discussed in which there is an initial introduction of a carrier or carriers into a population and further carriers may be created from the susceptibles. The probability distribution for the number of survivors is derived and approximations and numerical illustrations are given. In a recent paper Weiss (1965) has considered an epidemic in a closed population which is spread not by the infected individuals but by carriers. He supposed that one or more carriers were introduced into the population and that these carriers infected susceptible individuals until detected, the time before detection having an exponential distribution. At no time during the epidemic were any additional carriers introduced or created and the epidemic terminated as soon as all the carriers were detected or all susceptibles had been infected. Weiss's model has the mathematical advantage that it is completely soluble in terms of elementary functions; see Dietz (1966) and Downton (1967b). It is, however, unrealistic in its assumption that no new carriers can be created. The present paper discusses a model in which it is supposed that after the initial introduction of a carrier or carriers, no new carriers are introduced from outside the population, but that new carriers may be created from the susceptibles in that population. The model is appropriate for the situation where a proportion 7T of those infected contract the disease in such a mild form that their symptoms are unnoticeable even though they are capable of passing on the disease. Such subelinically infected persons would then act as carriers until detected. For such a case we would expect oT to be small; oT = O corresponds to the Weiss model. The main aim of this paper is to obtain approximate expressions for the probability distribution and moments of the number of susceptibles surviving the epidemic in that case. The ultimate behaviour of the epidemic does not, however, depend upon oT in a simple way so that while valid approximations have been obtained for small it, even if this results in a large epidemic, it turns out that these approximations may be used in certain cases even when or is not small. In particular, they may be employed even if 7t = 1, when the mathematics becomes identical with that of the Kermack-McKendrick (1927) model for the general stochastic epidemic provided attention is confined to small subcritical epidemics. The stochastic model describing the process is defined in ? 2, where an explicit expression for the probability distribution of the number of survivors is given, together with an indication of the ways in which it may be obtained. In ? 3 the deterministic analogue of the stochastic model is discussed, providing the deterministic approximation to the mean number of survivors. This approximation is valid for small epidemics in large populations. Returning to the stochastic model ? 4 shows that the survivor probabilities may be expressed in terms of
- Published
- 1968
22. A generalized Kruskal-Wallis test for comparing K samples subject to unequal patterns of censorship
- Author
-
Norman E. Breslow
- Subjects
Statistics and Probability ,education.field_of_study ,Wilcoxon signed-rank test ,Kruskal–Wallis one-way analysis of variance ,Applied Mathematics ,General Mathematics ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Censoring (clinical trials) ,Statistics ,Probability distribution ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Null hypothesis ,Statistic ,Parametric statistics ,Mathematics - Abstract
SUMMARY A generalization of the Kruskal-Wallis test, which extends Gehan's generalization of Wilcoxon's test, is proposed for testing the equality of K continuous distribution functions when observations are subject to arbitrary right censorship. The distribution of the censoring variables is allowed to differ for different populations. An alternative statistic is proposed for use when the censoring distributions may be assumed equal. These statistics have asymptotic chi-squared distributions under their respective null hypotheses, whether the censoring variables are regarded as random or as fixed numbers. Asymptotic power and efficiency calculations are made and numerical examples provided. A generalization of Wilcoxon's statistic for comparing two populations has been proposed by Gehan (1965a) for use when the observations are subject to arbitrary right censorship. Mantel (1967), as well as Gehan (1965b), has considered a further generalization to the case of arbitrarily restricted observation, or left and right censorship. Both of these authors base their calculations on the permutation distribution of the statistic, conditional on the observed censoring pattern for the combined sample. However, this model is inapplicable when there are differences in the distribution of the censoring variables for the two populations. For instance, in medical follow-up studies, where Gehan's procedure has so far found its widest application, this would happen if the two populations had been under study for different lengths of time. This paper extends Gehan's procedure for right censored observations to the comparison of K populations. The probability distributions of the relevant statistics are here considered in a large sample framework under two models: Model I, corresponding to random or unconditional censorship; and Model II, which considers the observed censoring times as fixed numbers. Since the distributions of the censoring variables are allowed to vary with the population, Gehan's procedure is also extended to the case of unequal censorship. For Model I these distributions are theoretical distributions; for Model II they are empirical. Besides providing chi-squared statistics for use in testing the hypothesis of equality of the K populations against general alternatives, the paper shows how single degrees of freedom may be partitioned for use in discriminating specific alternative hypotheses. Several investigators (Efron, 1967) have pointed out that Gehan's test is not the most efficient against certain parametric alternatives and have proposed modifications to increase its power. Asymptotic power and efficiency calculations made below demonstrate that their criticisms would apply equally well to the test proposed here. Hopefully some of the modifications they suggest can likewise eventually be generalized to the case of K
- Published
- 1970
23. SIGNIFICANCE TESTS FOR DISCRIMINANT FUNCTIONS AND LINEAR FUNCTIONAL RELATIONSHIPS
- Author
-
E. J. Williams
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Regression analysis ,Linear discriminant analysis ,Agricultural and Biological Sciences (miscellaneous) ,Discriminant function analysis ,Discriminant ,Optimal discriminant analysis ,Statistics ,Linear regression ,Statistics, Probability and Uncertainty ,Kernel Fisher discriminant analysis ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
In previous papers (Bartlett, 1951; Williams, 1952 a) certain exact tests for the adequacy of a hypothetical discriminant function were derived. Later papers (Williams, 1952b, 1952c, 1953) showed how these tests couild be applied in a number of situations of practical usefulness. The first object of the present paper is to extend the work in the above-mentioned papers and to show how the results obtained may be interpreted in terms of multiple linear regression. The calculations may indeed be carried out in the manner of a covariance analysis. The second object is to develop, along the same lines, exact tests for an assumed linear relationship among v ariables; this is a problem which has been discussed in various contexts by Koopnians (1937), Tintner (1945, 1946, 1950), Geary (1948, 1949), Bartlett (1948), Anderson (1951) and others. Since the question of determining underlying relationships has been given considerable attention in the literature from a number of different points of viewv the opportunity is taken also to discuss and to atternpt to unify the different approaches made-the use of information provided by instrurmental variates, by grouping of the data, and by higher moments. The reason for discussing discriminant functions and functional relationships in the same paper is because the two problems are really different aspects of the same problem. This has been well demonstrated by Geary (1948). If a single discriminant function is assumed adequate to describe differences among a number of p-variate populations, this assumption is equivalent to assuming that there exist p 1 linear relations among the means for the p var iates; the means then lie on a line. In general, postulating that the differences among the populations are described by r discriminant functions is equivalent to postulating p r linear relationships (provided always that the number of populations considered is not less than p). The quantity r, the number of dimensions in which the population means lie. may be called the rank of the populations, and p r the degeneracy. Thus the test for a single linear relationship is equivalent to the test for the adequacy of p 1 discriminant functions. In deriving significance tests for either a discriminant function or a linear relationship, the same principle is applied, though the function being tested has a different role in the two cases, and thus enters differently into the tests. In the simple bivariate case, the test for a linear relationship is exactly the same as the test for the discriminant function which is orthogonal to it. The problems of this paper have been framed above in terms of an analysis of variance model, for testing the significance of differences between populations. This has been done in order to link them with those discuissed in the earlier work (Williams, 1952 a, b). A more general specification is in terms of a regression model, wherein the interrelationships between a set of p variates and another set of q variates are investigated. In such a model a discriminant function is better described as a canonical variate. Throughout the remainder of
- Published
- 1955
24. A system of models for the life cycle of a biological organism
- Author
-
J. R. Ashford and K. L. Q. Read
- Subjects
Statistics and Probability ,education.field_of_study ,Stochastic modelling ,Process (engineering) ,Applied Mathematics ,General Mathematics ,Population ,Poison control ,Agricultural and Biological Sciences (miscellaneous) ,Type (biology) ,Evolutionary biology ,Probability distribution ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Set (psychology) ,education ,Organism ,Mathematics - Abstract
SUMMARY This paper is concerned with stochastic models for the representation of the development of a biological organism through recognizable distinct stages. The models are based upon the assumption that the periods spent in each stage of development, excluding the possibility of death, are independent random variables with a characteristic form of probability distribution. Within each stage the organism is liable to be taken by predators or to die from other causes, and incidents of this type are assumed to occur as events in a Poisson process. The development of the theory for various forms of distribution of the period spent in a given stage is considered, including the negative exponential and second and third order special Erlangian distributions. An example is given of the application of the proposed models to the analysis of sampling data from a study of the life cycle of the grasshopper, Corthippus parallelus. The main features of the life cycle of a biological organism exhibit a similar pattern over a wide variety of different types and species. The birth of the organism occurs at a clearly defined point in time and the organism then passes through a period of growth and development until it reaches maturity. In certain types of organism this process is characterized by transition through a number of distinct and easily recognizable states in turn. An insect, for example, passes through a succession of larval instars. In other types of organism the process of development is less well defined, although it is usually possible to divide the life cycle into discrete states by reference to the presence or absence or to the size of characteristic features of the organism. At every moment of its life the organism is liable to suffer death, either as a result of the action of predators, of accidents or for other reasons. If the organism does reach maturity, its life will eventually be terminated, either by natural or other causes. In order to carry out quantitative studies of a population of a given type of organism it is often helpful to set up a mathematical model to represent the process of birth, development and death. Since there will generally be variations from one organism to another within the same population, such a model must preferably embody a stochastic or random element, as the assessment of individual variability will form an essential part of the description of the life cycle. The main features which must be taken into account are the distributions of the times of birth, of the periods spent in each stage of development and of mortality in the various stages. The process of growth, as revealed by the size of the organism at any given stage in its development, may also be of interest. This paper is concerned with a model which was originally developed to describe the life cycle of the grasshopper, Corthippus parallelus. The model is, however, of more general application, not only to other biological organisms, but also to studies of the structure of human populations. For example, the 'population' may correspond to a large organization, 'birth' may correspond to the recruit
- Published
- 1968
25. The analysis of several 2× 2 contingency tables
- Author
-
Marvin Zelen
- Subjects
Statistics and Probability ,Contingency table ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Asymptotic distribution ,Conditional probability distribution ,Agricultural and Biological Sciences (miscellaneous) ,Cochran's Q test ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,Constant (mathematics) ,education ,Random variable ,Mathematics ,Statistical hypothesis testing - Abstract
SUMMARY Consider data arranged into k 2 x 2 contingency tables. The principal result is the derivation of a statistical test for making an inference on whether each of the k contingency tables has the same relative risk. The test is based on a conditional reference set and can be regarded as an extension of the Fisher-Irwin treatment of a single 2 x 2 contingency table. Both exact and asymptotic procedures are presented. The analysis of k 2 x 2 contingency tables is required in several contexts. The two principal ones are (i) the comparison of binary response random variables, i.e. random variables taking on the values zero or one, for two treatments, over a spectrum of different conditions or populations; and (ii) the comparison of the degree of association among two binary random variables over k different populations. Cochran (1954) has investigated this problem with respect to testing if the success probability for each of two treatments is the same for every contingency table. Cochran's recommendation is that the equality of the two success probabilities should be tested using the total number, summed over all tables, of successes for one of the treatments. Cochran considers the asymptotic distribution of the total number of successes, for one of the treatments, conditional on all marginals being fixed in every table. He recommends this technique whenever the difference between the two populations on a logistic or probit scale is nearly constant for each contingency table. The constant logistic difference is equivalent to the relative risk being equal for each table. Mantel & Haenlszel (1959), in an important paper discussing retrospective studies, have also proposed an asymptotic method for analysing several 2 x 2 contingency tables. Their worlk on this problem was evidently done independently of Cochran, for their method is exactly the same as Cochran's except for a modification dealing with the correction factor associated with a finite population. Birch (1964) and Cox (1966) clarified the problem by showing, that under the assumption of constant logistic differences for each table, same relative risk, the conditional distribution of the total number of successes, for one of the treatments, leads to a uniformly most powerful unbiased test. Birch and Cox also derived the exact probability distribution of this conditional random variable under the given model. In this paper, we investigate the more general situation where the difference between the logits in each table is not necessarily constant. Procedures are derived for making an inference with regard to the hypothesis of constant logistic differences. Both the exact and asymptotic distributions are derived for the null and nonnull cases. This problem has been discussed by several investigators. A constant logistic difference corresponds to no interaction between the treatments and the k populations. The case k = 2 corresponds to one in which Bartlett (1935) has derived both an exact and an asymptotic procedure. Norton (1945)
- Published
- 1971
26. ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS
- Author
-
Herbert A. Simon
- Subjects
Statistics and Probability ,education.field_of_study ,Zipf's law ,Skew normal distribution ,Applied Mathematics ,General Mathematics ,Population ,Class (philosophy) ,Simon model ,Agricultural and Biological Sciences (miscellaneous) ,Rank-size distribution ,Yule–Simon distribution ,Statistical physics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Generalized normal distribution ,Mathematics - Abstract
It is the purpose of this paper to analyse a class of distribution functions that appears in a wide range of empirical data-particularly data describing sociological, biological and economic phenomena. Its appearance is so frequent, and the phenomena in which it appears so diverse, that one is led to the conjecture that if these phenomena have any property in common it can only be a similarity in the structure of the underlying probability mechanisms. The empirical distributions to which we shall refer specifically are: (A) distributions of words in prose samples by their frequency of occurrence, (B) distributions of scientists by number of papers published, (C) distributions of cities by population, (D) distributions of incomes by size, and (E) distributions of biological genera by number of species. No one supposes that there is any connexion between horse-kicks suffered by soldiers in the German army and blood cells on a microscope slide other than that the same urn scheme provides a satisfactory abstract model of both phenomena. It is in the same direction that we shall look for an explanation of the observed close similarities among the five classes of distributions listed above. The observed distributions have the following characteristics in common: (a) They are J-shaped, or at least highly skewed, with very long upper tails. The tails can generally be approximated closely by a function of the form
- Published
- 1955
27. Birth, death and migration processes
- Author
-
E. Renshaw
- Subjects
Statistics and Probability ,education.field_of_study ,Markov chain ,Applied Mathematics ,General Mathematics ,Population ,Linear model ,Agricultural and Biological Sciences (miscellaneous) ,Birth–death process ,Quantitative Biology::Cell Behavior ,Discrete time and continuous time ,Integer ,Statistics ,Quantitative Biology::Populations and Evolution ,Applied mathematics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Finite set ,Branching process ,Mathematics - Abstract
SUMMARY The effect of migration between a finite number of colonies each of which undergoes a simple birth and death process is studied. The first two moments are obtained for the general process and deterministic solutions are developed for several special models including the finite linear model proposed by Bailey (1968). In a recent paper Bailey (1968) constructs a model for spatially distributed populations by considering a population to be composed of an infinite number of colonies situated at the integer points of a single co-ordinate axis represented by -xo < i < x. Each colony is assumed to be subject to a simple birth and death process with birth and death rates A and Iu respectively, and with migration rates to each of the two neighbouring colonies. He derives the mean number of individuals in each colony at time t together with the variances and covariances, and briefly considers the corresponding models where the population is extended to two and three dimensions. Clearly by assuming the existence of an infinite number of colonies he avoids the problem of 'edge effects' at the boundaries when the number of colonies is finite. We shall later examine this finite analogue of Bailey's model and show how his results for the infinite model follow as a special case. His paper has recently been extended by several authors. Adke (1969) generalizes Bailey's process to include time-dependent birth and death rates whilst Usher & Williamson (1970) use discrete time intervals to analyze the model when the number of colonies is finite. They consider the population to be split into migrants and nonmigrants, each group having different birth and death rates. Davis (1970) presents some results for a general Markov branching-diffusion process and then applies them to Bailey's model. Crump (1970) studies a general age-dependent branching process in which the population is distributed in N colonies with migration between them and he obtains asymptotic expressions for the first two moments in several special cases. The stochastic equations for the general parameter case of the simple birth-death-migration process are considered by Puri (1968).
- Published
- 1972
28. The multi-sample single recapture census
- Author
-
G. A. F. Seber
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,media_common.quotation_subject ,Sample (material) ,Immigration ,Population ,Census ,Agricultural and Biological Sciences (miscellaneous) ,Commercial fishing ,Statistics ,Actual practice ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Set (psychology) ,media_common ,Mathematics - Abstract
The aim of this paper is to set up a capture-recapture method for estimating the population parameters for a population in which there is both 'immigration' (including birth) and 'death' (including emigration). The method used may be readily described as the multisample single recapture census as opposed to the multiple-recapture census which has been used in many previous papers on capture-recapture analysis (see Darroch, 1958, 1959) and which as yet has not been solved satisfactorily for the most general population where both immigration and death are taking place. Although the multiple-recapture method provides some very elegant estimates of the population parameters for populations in which there is either just death or immigration, it has two main disadvantages. First, as the captured individuals are returned to the population, the method has no use commercially. Secondly, careful attention must be given to the method of capture, for, if there is any slight deviation from the underlying assumption that marked and unmarked individuals have the same probability of being caught-which is the basis of all capture-recapture models-then this deviation will be multiplied by repeated recaptures. For example, trap shyness (Leslie, 1952, p. 385) would affect the results in the same way as immigration. Thus the main advantage of the model considered below is that the individuals are only recaptured once and are then removed from the population as, for example, in commercial fishing and trapping. The technique used for this single recapture census is as follows: The experimenter, using differentiated marking, releases batches of marked individuals of sizes a,, a2, ... into the population he is investigating, and after each batch a, is released, a commercial catch of size bi is made and the individuals are killed, thus giving the sequence a1 added, b, removed, a2 added, b2 removed, etc. The numbers of marked individuals from the different ai and unmarked individuals are noted for each catch bi and passed on to the experimenter. Ideally, the marked individuals which are to be released, should either be caught before the whole experiment and stored, or perhaps taken from a similar population not connected with the one under investigation. In actual practice, however, the experimenter could take the samples at from the population during the experiment, because, in general, the ai, although large, will be much smaller than the bi and therefore the recaptures in the successive al, a2, ... will be negligible. Also the overall reduction in the number of un
- Published
- 1962
29. A two-sample procedure for selecting the population with the largest mean from several normal populations with unknown variances
- Author
-
John B. Ofosu
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Sobel operator ,Variance (accounting) ,Agricultural and Biological Sciences (miscellaneous) ,Sample size determination ,Statistics ,Econometrics ,Two sample ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY This paper gives a two-sample procedure for selecting the population with the largest mean from k normal populations with unknown variances. The method is based on a twosample procedure proposed by Stein (1945). Tables necessary for the application of the procedure are given for selected values of k. Comparisons of the minimum values of the expected sample sizes using the proposed procedure are made with the corresponding single-sample sizes for known variances (Bechhofer, 1954). Comparisons are also made of the expected total sample sizes for the single-sample procedure, the two-sample procedure given in this paper and the two-sample procedure proposed by Bechhofer, Dunnett & Sobel (1954) which assumes that the populations have known variance ratios. It is shown that the expected total sample sizes are not much increased by ignorance of the variance ratios.
- Published
- 1973
30. OUP accepted manuscript
- Author
-
Yukun Liu, Pengfei Li, and Jing Qin
- Subjects
Statistics and Probability ,Mean squared error ,General Mathematics ,Population ,01 natural sciences ,010104 statistics & probability ,03 medical and health sciences ,0302 clinical medicine ,Statistics ,Null distribution ,Statistics::Methodology ,030212 general & internal medicine ,Point estimation ,0101 mathematics ,education ,Mathematics ,Abundance estimation ,education.field_of_study ,Applied Mathematics ,Estimator ,Agricultural and Biological Sciences (miscellaneous) ,Statistics::Computation ,Semiparametric model ,Empirical likelihood ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences - Abstract
SummaryCapture-recapture experiments are widely used to collect data needed for estimating the abundance of a closed population. To account for heterogeneity in the capture probabilities, Huggins (1989) and Alho (1990) proposed a semiparametric model in which the capture probabilities are modelled parametrically and the distribution of individual characteristics is left unspecified. A conditional likelihood method was then proposed to obtain point estimates and Wald-type confidence intervals for the abundance. Empirical studies show that the small-sample distribution of the maximum conditional likelihood estimator is strongly skewed to the right, which may produce Wald-type confidence intervals with lower limits that are less than the number of captured individuals or even are negative. In this paper, we propose a full empirical likelihood approach based on Huggins and Alho’s model. We show that the null distribution of the empirical likelihood ratio for the abundance is asymptotically chi-squared with one degree of freedom, and that the maximum empirical likelihood estimator achieves semiparametric efficiency. Simulation studies show that the empirical likelihood-based method is superior to the conditional likelihood-based method: its confidence interval has much better coverage, and the maximum empirical likelihood estimator has a smaller mean square error. We analyse three datasets to illustrate the advantages of our empirical likelihood approach.
- Published
- 2017
31. Asymptotically design-unbiased predictors in survey sampling
- Author
-
S. M. Tam
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Equal probability ,Survey sampling ,Linear prediction ,Sample (statistics) ,Type (model theory) ,Agricultural and Biological Sciences (miscellaneous) ,Statistics ,Sampling design ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Selection (genetic algorithm) ,Mathematics - Abstract
SUMMARY A probability sampling plan is suggested in this paper for drawing the sample so that the linear least-squares predictor of the total of a finite population is made asymptotically design-unbiased. Conditions are given for the sampling plan to reduce to the more familiar sample designs discussed in the literature. The linear least-squares predictor of the total of a finite population (Royall, 1970, 1976) derived under a superpopulation model with explanatory variables omitted is still best under the correct model provided that it is unbiased under the correct model (Tam, 1986). The present paper specifies probability sampling schemes which render the linear least-squares predictor asymptoti- cally design-unbiased. Such probability schemes enable some general form of balancing in the sample for the explanatory variables omitted from the assumed superpopulation model, and provide protection of the predictor against this type of model failure. Conditions are also presented in this paper for the specified sample design to become the equal probability selection sample design or the optimal sample design considered by Godambe (1955, 1982).
- Published
- 1988
32. Estimation of order-restricted means from correlated data
- Author
-
David B. Dunson, Shyamal D. Peddada, and Xiaofeng Tan
- Subjects
Statistics and Probability ,education.field_of_study ,Multivariate analysis ,Biometrics ,Covariance matrix ,Iterative method ,Multivariate random variable ,Applied Mathematics ,General Mathematics ,Population ,Multivariate normal distribution ,Agricultural and Biological Sciences (miscellaneous) ,Simple (abstract algebra) ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Algorithm ,Mathematics - Abstract
xftanghotmail.com SUMMARY In many applications, researchers are interested in estimating the mean of a multivariate normal random vector whose components are subject to order restrictions. Various authors have demonstrated that the likelihood-based methodology may perform poorly under certain conditions for such problems. The problem is much harder when the underlying covariance matrix is nondiagonal. In this paper a simple iterative algorithm is introduced that can be used for estimating the mean of a multivariate normal population when the components are subject to any order restriction. The proposed methodology is illustrated through an application to human reproductive hormone data.
- Published
- 2005
33. Conditional Akaike information for mixed-effects models
- Author
-
Suzette Blanchard and Florin Vaida
- Subjects
Statistics and Probability ,Mixed model ,education.field_of_study ,Restricted maximum likelihood ,Applied Mathematics ,General Mathematics ,Model selection ,Population ,Linear model ,Conditional probability distribution ,Agricultural and Biological Sciences (miscellaneous) ,Bayesian information criterion ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,Akaike information criterion ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY This paper focuses on the Akaike information criterion, AIC, for linear mixed-effects models in the analysis of clustered data. We make the distinction between questions regarding the population and questions regarding the particular clusters in the data. We show that the AIC in current use is not appropriate for the focus on clusters, and we propose instead the conditional Akaike information and its corresponding criterion, the conditional AIC, CAIC. The penalty term in CAIC is related to the effective degrees of freedom p for a linear mixed model proposed by Hodges & Sargent (2001); p reflects an intermediate level of complexity between a fixed-effects model with no cluster effect and a corresponding model with fixed cluster effects. The CAIC is defined for both maximum likelihood and residual maximum likelihood estimation. A pharmacokinetics data appli cation is used to illuminate the distinction between the two inference settings, and to illustrate the use of the conditional AIC in model selection.
- Published
- 2005
34. Efficient balanced sampling: The cube method
- Author
-
Jean-Claude Deville and Yves Tillé
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Balanced sampling ,Population ,Estimator ,Cube (algebra) ,Agricultural and Biological Sciences (miscellaneous) ,Horvitz–Thompson estimator ,Set (abstract data type) ,Sampling design ,Statistics ,Probability distribution ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
A balanced sampling design is defined by the property that the Horvitz-Thompson estimators of the population totals of a set of auxiliary variables equal the known totals of these variables. Therefore the variances of estimators of totals of all the variables of interest are reduced, depending on the correlations of these variables with the controlled variables. In this paper, we develop a general method, called the cube method, for selecting approximately balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables.
- Published
- 2004
35. Estimation of treatment effects in randomised trials with non-compliance and a dichotomous outcome using structural mean models
- Author
-
Andrea Rotnitzky and James M. Robins
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Outcome (probability) ,law.invention ,Clinical trial ,Randomized controlled trial ,law ,Statistics ,Covariate ,Observational study ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Additive model ,Categorical variable ,Mathematics - Abstract
In this paper we consider the estimation of the effect of received treatment in randomised clinical trials with non-compliance and a dichotomous outcome using structural mean models. We allow both for the assigned and received treatments to be continuous, categorical or ordinal and for the possibility that the assigned treatment has a direct effect on the outcome through pathways other than the received treatment. We also consider the application of our results to observational studies. The parameters f of a structural mean model measure, on an appropriate scale, how the effect of the received treatment on the treated population varies across levels of pre-treatment covariates. Thus, these models are useful for assessing whether or not the effect of received treatment is modified by baseline covariates. Robins (1989, 1994) showed that, when the randomisation probabilities are known, both additive and multiplicative structural mean models that respectively impose a linear
- Published
- 2004
36. Stochastic multitype epidemics in a community of households: Estimation of threshold parameter R and secure vaccination coverage
- Author
-
Frank Ball, Tom Britton, and Owen D. Lyne
- Subjects
Statistics and Probability ,Estimation ,education.field_of_study ,Linear programming ,Stochastic modelling ,Threshold limit value ,Applied Mathematics ,General Mathematics ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Upper and lower bounds ,Vaccination ,Standard error ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY This paper is concerned with estimation of the threshold parameter R* for a stochastic model for the spread of a susceptible -+ infective -+ removed epidemic among a closed, finite population that contains several types of individual and is partitioned into house holds. It turns out that R* cannot be estimated consistently from final outcome data, so a Perron-Frobenius argument is used to obtain sharp lower and upper bounds for R*, which can be estimated consistently. Determining the allocation of vaccines that reduces the upper bound for R* to its threshold value of one, thus preventing the occurrence of a major outbreak, with minimum vaccine coverage is shown to be a linear programming problem. The estimates of R*, before and after vaccination, and of the secure vaccination coverage, i.e. the proportion of individuals that have to be vaccinated to reduce the upper bound for R* to 1 assuming an optimal vaccination scheme, are equipped with standard errors, thus yielding conservative confidence bounds for these key epidemiological para meters. The methodology is illustrated by application to data on influenza outbreaks in Tecumseh, Michigan.
- Published
- 2004
37. A Poisson model for the coverage problem with a genomic application
- Author
-
Chang Xuan Mao and Bruce G. Lindsay
- Subjects
Statistics and Probability ,Discrete mathematics ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Estimator ,Sample (statistics) ,Mixture model ,Poisson distribution ,Agricultural and Biological Sciences (miscellaneous) ,Moment (mathematics) ,symbols.namesake ,Minimum distance estimation ,Statistics ,symbols ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Turing ,computer ,Mathematics ,computer.programming_language - Abstract
SUMMARY Suppose a population has infinitely many individuals and is partitioned into unknown N disjoint classes. The sample coverage of a random sample from the population is the total proportion of the classes observed in the sample. This paper uses a nonparametric Poisson mixture model to give new understanding and results for inference on the sample coverage. The Poisson mixture model provides a simplified framework for inferring any general abundance-f coverage, the sum of the proportions of those classes that contribute exactly k individuals in the sample for some k in *, with * being a set of nonnegative integers. A new moment-based derivation of the well-known Turing estimators is presented. As an application, a gene-categorisation problem in genomic research is addressed. Since Turing's approach is a moment-based method, maximum likelihood estimation and minimum distance estimation are indicated as alternatives for the coverage problem. Finally, it will be shown that any Turing estimator is asymptotically fully efficient.
- Published
- 2002
38. Empty confidence sets for epidemics, branching processes and Brownian motion
- Author
-
Tom Britton, Philip D. O'Neill, and Frank Ball
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Hitting time ,Inference ,Agricultural and Biological Sciences (miscellaneous) ,Branching (linguistics) ,Distribution (mathematics) ,Statistics ,Quantitative Biology::Populations and Evolution ,Statistical physics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Basic reproduction number ,Brownian motion ,Branching process ,Mathematics - Abstract
SUMMARY This paper treats some examples where likelihood-based inference for certain model parameters may produce empty confidence sets. The first example concerns epidemics, and the parameter of interest is the basic reproduction number R0, which is to be estimated from the final size of an epidemic in a finite population. The second example treats estimation of the mean of the offspring distribution in a branching process, based on observing the total progeny, i.e. the total number of individuals ever born in the branching process. The final example considers estimation of the linear drift in a Brownian motion, based on observing the first hitting time of some horizontal barrier.
- Published
- 2002
39. Bayesian capture-recapture methods for error detection and estimation of population size: Heterogeneity and dependence
- Author
-
Sanjib Basu and Nader Ebrahimi
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Bayesian probability ,Population ,Estimator ,Bayesian inference ,Agricultural and Biological Sciences (miscellaneous) ,Dirichlet distribution ,Statistics::Computation ,Mark and recapture ,symbols.namesake ,Statistics ,symbols ,Econometrics ,Nuisance parameter ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Gibbs sampling ,Mathematics - Abstract
SUMMARY This paper considers estimation of the unknown size N of a population based on multiple capture-recapture samples. We extend the Bayesian multiple recapture model to accommodate possible heterogeneity and dependence among the samples and possible heterogeneity within the samples. In the dependent model, we show that posterior inference for N is independent of almost all the nuisance parameters. We develop a flexible Bayesian model for heterogeneity within samples and demonstrate how Gibbs sampling can be used to calculate the Bayesian estimator for N and other quantities of interest. The performance of the proposed estimators is evaluated by simulation under both correct and incorrect model specifications, and we illustrate our methods in two examples about software review and estimation of a cottontail rabbit population.
- Published
- 2001
40. Consistency of the bootstrap procedure in individual bioequivalence
- Author
-
Jprgen Kubler, Jun Shao, and Iris Pigeot
- Subjects
Statistics and Probability ,education.field_of_study ,Restricted maximum likelihood ,Applied Mathematics ,General Mathematics ,Population ,Estimator ,Bioequivalence ,Agricultural and Biological Sciences (miscellaneous) ,Moment (mathematics) ,Bias of an estimator ,Consistency (statistics) ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Quantile ,Mathematics - Abstract
Recently, new concepts have been proposed for assessing bioequivalence of two drug formulations, namely the so-called population and individual bioequivalence. Using moment-based and probability-based measures for evaluating the proposed bioequivalence concepts, criteria have been formulated to decide whether two formulations should be regarded as bioequivalent or not. This decision has of course to be based on an adequate statistical method where the Food and Drug Administration (FDA) guidance (1997) recommends the use of a bootstrap percentile interval. In this paper, we discuss theoretical properties such as consistency and accuracy of the recommended bootstrap intervals. We focus our investigations on the concept of individual bioequivalence and here especially on the scaled versions of the moment-based as well as the probability-based measures as recommended by the FDA. As estimates for the former, we consider those obtained from an according analysis of variance and restricted maximum likelihood estimators under mixed effect models, where an unbiased estimator of the latter can be derived from the corresponding relative frequencies.
- Published
- 2000
41. Adaptive cluster sampling with networks selected without replacement
- Author
-
M Mohammad Salehi and George A. F. Seber
- Subjects
Statistics and Probability ,education.field_of_study ,Basis (linear algebra) ,Applied Mathematics ,General Mathematics ,Population ,Initial sample ,Estimator ,Agricultural and Biological Sciences (miscellaneous) ,Cluster design ,Statistics ,Cluster sampling ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Unit (ring theory) ,Mathematics - Abstract
SUMMARY In the adaptive cluster design introduced by Thompson (1990), a finite population of units under investigation is partitioned into networks on the basis of a specified condition for adding neighbourhoods to a sampled unit. An initial sample of units is taken and a network may be sampled more than once. In this paper, we introduce a modification of the design in which networks are sampled only once. Two unbiased estimators are considered and the Rao-Blackwell theorem is used to improve them in terms of efficiency. The various estimators are compared using two examples.
- Published
- 1997
42. Fitting regression models to case-control data by maximum likelihood
- Author
-
Chris J. Wild and Alastair Scott
- Subjects
Statistics and Probability ,education.field_of_study ,Covariance matrix ,Restricted maximum likelihood ,Applied Mathematics ,General Mathematics ,Population ,Regression analysis ,Maximum likelihood sequence estimation ,Logistic regression ,Agricultural and Biological Sciences (miscellaneous) ,Statistics ,Expectation–maximization algorithm ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Likelihood function ,Mathematics - Abstract
SUMMARY We consider fitting categorical regression models to data obtained by either stratified or nonstratified case-control, or response selective, sampling from a finite population with known population totals in each response category. With certain models, such as the logistic with appropriate constant terms, a method variously known as conditional maximum likelihood (Breslow & Cain, 1988) or pseudo-conditional likelihood (Wild, 1991), which involves the prospective fitting of a pseudo-model, results in maximum likelihood estimates of case-control data. We extend these results by showing the maximum likelihood estimates for any model can be found by iterating this process with a simple updating of offset parameters. Attention is also paid to estimation of the asymptotic covariance matrix. One benefit of the results of this paper is the ability to obtain maximum likelihood estimates of the parameters of logistic models for stratified case-control studies, compare Breslow & Cain (1988), Scott & Wild (1991), using an ordinary logistic regression program, even when the stratum constants are modelled.
- Published
- 1997
43. An analysis of categorical repeated measurements in the presence of an external factor
- Author
-
P. I. McCLOUD and J. N. Darroch
- Subjects
Statistics and Probability ,Contingency table ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Extension (predicate logic) ,Agricultural and Biological Sciences (miscellaneous) ,Symmetry (physics) ,Flow (mathematics) ,Joint probability distribution ,Statistics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Categorical variable ,Mathematics - Abstract
SUMMARY This paper provides a log-linear method for the analysis of contingency tables where the population of individuals is divided into distinct subpopulations and where repeated measurements are made on each individual. The proposed model for the full joint probabilities is quasi-symmetry within each subpopulation; the model is derived from one of two within-individual models. The method by which the hypotheses that flow from the model are tested is an extension of the test of marginal symmetry, achieved by testing complete symmetry conditional on quasi-symmetry.
- Published
- 1995
44. Analysis of binary data from a multicentre clinical trial
- Author
-
Trivellore E. Raghunathan and Yoichi
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Fixed effects model ,Random effects model ,Agricultural and Biological Sciences (miscellaneous) ,Confidence interval ,law.invention ,Binomial distribution ,Randomized controlled trial ,law ,Sample size determination ,Statistics ,Econometrics ,Point estimation ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY We develop several methods for estimating the treatment effect difference defined as the overall log-odds ratio of favourable response in a multicentre clinical trial comparing two treatments with binary response. A simulation study compares the bias and mean squared error of the point estimates and the exact coverage probabilities of confidence intervals obtained distributions. Multicentre randomized clinical trials are frequently used to test the efficacy of new medical regimens. These trials are attractive due to the ease of getting the desired sample size in a short period of time and the broad coverage of the patient population and medical practitioners. The analysis of data from such trials requires combining information from the centres in a way that properly accounts for the variation due to the centres and the differential efficacy of treatments between the centres. In this paper we investigate several approaches to pooling the data across centres and for handling the effects associated with the centres. Chakravarthi & Grizzle (1975) and Beitler & Landis (1985) argue for treating these effects as random, and Fleiss (1986) argues against because the centres are seldom chosen at random from a well-defined population of clinics and therefore the effects cannot be associated with any population. We prefer the random effect model approach for the following reasons. The goal of a trial is to extend the results of the study to the general population, duly accounting for variation in the patient population and the medical skills of practitioners. Hence, if we believe that the variability between centres as estimated by the data is typical of the variability in the population of centres, then the inference based on the random effects model is more appropriate than the one based on the fixed effects model. Furthermore, the variation among the centres may be more than that predicted from samples from a binomial distribution. Treating the centre effects as random introduces this intra-class correlation into our model assumptions. Let the outcome variable from each centre be a pair of binomial random variables XAj
- Published
- 1993
45. Empirical likelihood estimation for finite populations and the effective usage of auxiliary information
- Author
-
Jiahua Chen and Jing Qin
- Subjects
Statistics and Probability ,education.field_of_study ,Estimation theory ,Applied Mathematics ,General Mathematics ,Population ,Inference ,Sample (statistics) ,Variance (accounting) ,Agricultural and Biological Sciences (miscellaneous) ,Empirical likelihood ,Statistics ,Econometrics ,Statistics::Methodology ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Jackknife resampling ,Mathematics ,Quantile - Abstract
SUMMARY In finite population inference problems, auxiliary population information is often available. We show in this paper that the empirical likelihood method can be naturally applied to such problems to make effective use of the auxiliary information. We prove that the resulting estimates have smaller asymptotic variances than the usual estimates which do not use auxiliary information. A Bahadur-type representation for empirical likelihood sample quantiles is given. Simulation results show that the empirical likelihood estimates perform well among a number of competitors and are model robust.
- Published
- 1993
46. A kernel method for estimating finite population distribution functions using auxiliary information
- Author
-
Anthony Y. C. Kuk
- Subjects
Statistics and Probability ,education.field_of_study ,Distribution (number theory) ,Applied Mathematics ,General Mathematics ,Population ,Estimator ,Conditional probability distribution ,Agricultural and Biological Sciences (miscellaneous) ,Variable (computer science) ,Kernel method ,Robustness (computer science) ,Kernel (statistics) ,Statistics ,Applied mathematics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY This paper considers the use of auxiliary information to improve the estimation of the distribution function of a finite population. A method is proposed which combines the known distribution of the auxiliary variable with a kernel estimate of the conditional distribution of the survey variable given the value of the auxiliary variable. The resulting estimator compares favourably with existing estimators in terms of efficiency, conditional behaviour and robustness. The proposed method also has the advantage of producing estimates which are bona fide distribution functions. This is not the case for many calibration type estimators suggested in the literature. Finally, the proposed method is applicable under any probability sampling scheme.
- Published
- 1993
47. Some tests for common principal component subspaces in several groups
- Author
-
James R. Schott
- Subjects
Statistics and Probability ,Pure mathematics ,education.field_of_study ,Covariance matrix ,Group (mathematics) ,Applied Mathematics ,General Mathematics ,Population ,Covariance ,Agricultural and Biological Sciences (miscellaneous) ,Linear subspace ,Statistics ,Principal component analysis ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Subspace topology ,Statistical hypothesis testing ,Mathematics - Abstract
SUMMARY An approximate test, based on sample eigenprojections, is obtained for testing the hypothesis that the subspaces spanned by the first m principal components of several different covariance matrices are identical. This procedure can fairly easily be extended to an analogous test of hypothesis concerning correlation matrices. Extensions to a robust principal components analysis are also considered. In recent years principal components analysis has been generalized from the typical single population setting to that of several populations. This paper focuses on an extension that may be useful in those situations in which one needs a variance preserving simultaneous reduction of dimensionality in several groups. This sort of reduction will be possible when the subspace spanned by the most important principal component vectors of a particular group is the same as that of any other group. Formally, suppose the same p variables are being measured on objects in g different groups with the covariance
- Published
- 1991
48. Poststratification using regression estimators when information on strata means and sizes is missing
- Author
-
Abba M. Krieger and Danny Pfeffermann
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Frame (networking) ,Estimator ,Regression estimator ,Cross-sectional regression ,Agricultural and Biological Sciences (miscellaneous) ,Regression ,Variable (computer science) ,Statistics ,Econometrics ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
SUMMARY Poststratification is often used to increase the precision of survey estimators. When applied to regression estimators, it assumes that the regression relationships in the various strata are different. This gives rise to the use of the separate regression estimator. In this paper, we consider situations where the strata affiliation of the population units is not specified in the frame so that the separate regression estimator is not available, the strata means of the regressor variable and the strata sizes being unknown. We propose a new regression type estimator which accounts for the different regression relationships in the various strata, but no longer depends on the unknown strata means and sizes. We show that this estimator is more precise than estimators that ignore the differences in the regression relationships. The theoretical comparisons are illustrated by simulation results.
- Published
- 1991
49. A further note on sampling to locate rare defectives with strong prior evidence
- Author
-
Andrew P. Grieve
- Subjects
Statistics and Probability ,education.field_of_study ,Applied Mathematics ,General Mathematics ,Population ,Bayesian probability ,Sampling (statistics) ,Inference ,Strong prior ,Agricultural and Biological Sciences (miscellaneous) ,Wright ,Sample size determination ,Statistics ,Prior probability ,Econometrics ,Quantitative Biology::Populations and Evolution ,Statistics, Probability and Uncertainty ,General Agricultural and Biological Sciences ,education ,Mathematics - Abstract
In a recent paper on inference about the number of defectives in a finite population, Wright (1992) provides a Bayesian view on the sample size required to give high assurance that the probability of unobserved defectives is low. We reconsider Wright's approach and derive alternative formulae appropriate when there is considerable, but not infinite, prior information
- Published
- 1994
50. Determination of sample size for testing the relation between an incident and a set of random variables in a sample survey
- Author
-
Mark C. K. Yang
- Subjects
Statistics and Probability ,Linear function (calculus) ,education.field_of_study ,Covariance matrix ,Sample (material) ,Applied Mathematics ,General Mathematics ,Population ,Agricultural and Biological Sciences (miscellaneous) ,Dependence relation ,Sample size determination ,Statistics ,Fraction (mathematics) ,Statistics, Probability and Uncertainty ,education ,General Agricultural and Biological Sciences ,Random variable ,Mathematics - Abstract
SUMMARY It is often necessary to determine if, and to what extent, an incident A is related to a set of environmental random variables X = (Xl, ..., Xk)'. In this paper we assume that in the general population, X is normally distributed with mean ,u and covariance matrix E. A sample of X will be taken from the incident and nonincident groups, and, assuming that the incident rate p is approximately a linear function of X1, ..., Xk, we can determine the sample sizes for detecting the dependence between A and X without any prior knowledge of the unknowns , and E. Let Pi be the average incident rate in the population and 0 and po be two given numbers with po
- Published
- 1978
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.