| Title: | Maximum Likelihood and Markov Chain Monte Carlo (MCMC) Methods for Pedigree Reconstruction and Analysis |
| Version: | 2.59 |
| Depends: | coda, genetics, gtools, kinship2 |
| Imports: | methods, stats |
| Date: | 2026-03-13 |
| Maintainer: | Jarrod Hadfield <j.hadfield@ed.ac.uk> |
| Description: | The primary aim of 'MasterBayes' is to use Markov chain Monte Carlo (MCMC) techniques to integrate over uncertainty in pedigree configurations estimated from molecular markers and phenotypic data (Hadfield et al. (2006) <doi:10.1111/j.1365-294X.2006.03050.x>). Emphasis is put on the marginal distribution of parameters that relate the phenotypic data to the pedigree. All simulation is done in compiled 'C++' for efficiency. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| NeedsCompilation: | yes |
| Packaged: | 2026-03-13 10:29:06 UTC; jhadfiel |
| Author: | Jarrod Hadfield [aut, cre], Andrew D. Martin [ctb, cph] (Original author of modified C++ Scythe code), Kevin M. Quinn [ctb, cph] (Original author of modified C++ Scythe code), Daniel Pemstein [ctb, cph] (Original author of modified C++ Scythe code) |
| Repository: | CRAN |
| Date/Publication: | 2026-03-19 13:20:19 UTC |
GdataPed Object
Description
An object containing genotype data and the categories over which error rates may vary.
Usage
GdataPed(G, id = NULL, categories = NULL, perlocus=FALSE, marker.type="MSW")
Arguments
G |
a list of |
id |
a vector of individual identifiers associated with each genotype, individuals can have more than one observed genotype. If |
categories |
an optional vector indicating subsets of genotypes that have different error rates. If |
perlocus |
if |
marker.type |
|
Value
a GdataPed object containing genotype information
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Kalinowski S.T. et al (2007) Molecular Ecology 16 5 1099-1106 Hadfield J. D. et al (2009) in prep
See Also
Examples
data(WarblerG)
GdP<-GdataPed(WarblerG)
Markov chain Monte Carlo (MCMC) Methods for Pedigree Reconstruction and Analysis
Description
Markov chain Monte Carlo methods for estimating the joint posterior distribution of a pedigree and the parameters that predict its structure using genetic and non-genetic data. These parameters can be associated with covariates of fecundity such as a sexually selected trait or age, or can be associated with spatial or heritable traits that relate parents to specific offspring. Population size, allele frequencies, allelic dropout rates, and stochastic genotyping error rates can also be simultaneously estimated.
Usage
MCMCped(PdP=PdataPed(), GdP=GdataPed(), sP=startPed(), tP=tunePed(),
pP=priorPed(), mm.tol=999, nitt = 13000, thin = 10, burnin =
3000, write_postG = FALSE, write_postA=FALSE, write_postP =
"MARGINAL", checkP = FALSE, jointP = TRUE, DSapprox=FALSE, verbose=TRUE)
Arguments
PdP |
optional |
GdP |
optional |
sP |
optional |
tP |
optional |
pP |
optional |
mm.tol |
maximum number of mismatches tollerated |
nitt |
number of MCMC iterations |
thin |
thinning interval of the Markov chain |
burnin |
the number of initial iterations to be discarded |
write_postG |
if |
write_postA |
if |
write_postP |
if |
checkP |
if |
jointP |
if |
DSapprox |
if |
verbose |
if |
Value
beta |
an |
USdam |
an |
USsire |
an |
E1 |
an |
E2 |
an |
G |
list of marginal distributions of true genotypes at each locus |
A |
list of |
P |
either samples from the posterior distribution of the pedigree, or the marginal distribution of parents |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# indivdiuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
tP<-tunePed(beta=30)
model1<-MCMCped(PdP=PdP, GdP=GdP, tP=tP, nitt=300, thin=1, burnin=0)
plot(model1$beta)
Maximum Likelihood Estimation of Beta
Description
Finds Maximum Likelihood estimate (MLE) for beta given a pedigree, via a call to optim. Beta is the paramater vector of a multinomial log-linear model.
Usage
MLE.beta(X.list, ped, beta=NULL, nUSdam=NULL, nUSsire=NULL, shrink=NULL)
Arguments
X.list |
list of design matrices for each offspring derived using |
ped |
pedigree with id, dam and sire in ech column |
beta |
optional starting vector for beta |
nUSdam |
optional number of unsampled females. Only required if unsampled females have known phenotype. |
nUSsire |
optional number of unsampled males. Only required if unsampled males have known phenotype. |
shrink |
optional scalar for the variance defining the ridge-regression likelihood penalisation. |
Value
beta |
vector of MLE's for beta |
C |
large sample variance-covariance matrix of beta MLE's |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31 Smouse P.E. et al (1999) Journal of Evolutionary Biology 12 1069-1077
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
res1<-expression(varPed("offspring", restrict=0))
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
X.list<-getXlist(PdP=PdP, GdP=GdP, E2=0.005)
ped<-MLE.ped(X.list)$P
beta<-MLE.beta(X.list, ped)
beta
Maximum Likelihood Estimation of the Pedigree
Description
Finds the Maximum Likelihood estimate (MLE) of the pedigree using the genetic data only. An approximation is used for genotyping error.
Usage
MLE.ped(X.list, ped=NULL, USdam=FALSE, nUSdam=NULL, USsire=FALSE,
nUSsire=NULL, threshold=0, checkP)
Arguments
X.list |
list of design matrices for each offspring derived using |
ped |
optional pedigree with id, dam and sire in ech column |
USdam |
logical or character; if |
nUSdam |
numeric vector for number of unsampled females |
USsire |
logical or character; if |
nUSsire |
numeric vector for number of unsampled males |
threshold |
threshold probability under which ML parents are replaced by NA |
checkP |
if |
Details
ML estimation of the pedigree is based on the Mendelian transition probabilities in the presence of genotyping error as outlined in Kalinwoski (2006). The probability that the ML parents are the true parents is simply the Mendelian transition probability for those parents divided by the sum of the transition probabilities for the remaining potential parents, both sampled and unsampled. If ped exists and the dam column contains known dam assignemnts and the sire column contains only NA's, then the ML sires will be returned conditional on the dam assignements being true. ML dam estimation with known sires can be performed in the same way. Individuals whose parents cannot be assigned with the required level of certainty (threshold), or whose parents belong to the base or unsampled population, have NA in the dam and sire columns. If each indiviual's potential parents are such that an illegal pedigree could be sampled then checkP=TRUE can be used to ensure legality. This is recommended if the pedigree is to be passed as a starting pedigree to MCMCped. It should be noted that under these circumstances it is possible that multiple pedigrees max exist with the same likelihood and this may not be obvious from the MLE.ped output since assignments are made conditional on earlier assignement being true. As an example, if there are two indiviuals both of which could potentially be each others parents then assigning both to be each others parent is illegal (since each indiviual would be its own grandparent). In simple situations, the parent-offspring and offspring-parent assignements have equal probability, but when checkP=TRUE the first indiviual would have zero probability of being the second individual's parent if the second individual was already assigned as the first individual's parent.
Value
P |
pedigree with id in the first column, and dam and sire in the second and third columns |
prob |
probability of the most likely parental combination |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31 Marshall J.D. et al (1998) Molecular Ecology 7 639-655 Kalinowski S.T. et al, Molecular Ecology in press
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
res1<-expression(varPed("offspring", restrict=0))
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
PdP<-PdataPed(formula=list(res1,res2), data=WarblerP, USsire=TRUE)
X.list<-getXlist(PdP=PdP, GdP=GdP, E2=0.005)
ped<-MLE.ped(X.list, USsire=TRUE, nUSsire=10, threshold=0.75)
Maximum Likelihood Estimation of the Unsampled Population Size
Description
Finds the Maximum Likelihood estimate (MLE) for the number of unsampled males and/or females following Nielsen et al. (2001). The size of the unsampled population can vary over time and space, and genotyping error is accomodated using the CERVUS model of genotyping error (Kalinwoski et al. 2006).
Usage
MLE.popsize(X.list, USdam=FALSE, USsire=FALSE, nUS=NULL,
ped=NULL, shrink=NULL)
Arguments
X.list |
list of design matrices for each offspring derived using |
USdam |
logical or character; if |
USsire |
logical or character; if |
nUS |
optional starting vector for the size of the unsampled population. Parmeters for the unsampled female population come before the male population. |
ped |
optional pedigree with id, dam and sire in ech column |
shrink |
optional scalar for the variance defining the ridge-regression likelihood penalisation. |
Value
nUS |
vector of MLE's for the size of the unsampled population. Lower bound is 1e-5 for numerical stability. |
C |
large sample variance-covariance matrix of |
Note
Nielsen's original model does not account for genotyping error, and estimation of the unsampled population size is VERY sensitive to the level of genotyping error. This function implements a commonly used approxiamtion for genotyping error that ignores pedigree information. For many problems this approximation seems valid, but appears to break down when estimating the size of the unsampled population size. Bayesian estimation of the unsampled population size (see MCMCped) that uses an exact solution for genotyping error is more robust.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Nielsen. R. et.al Genetics (2001) 157 4 1673-1682
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
res1<-expression(varPed("offspring", restrict=0))
PdP<-PdataPed(formula=list(res1), data=WarblerP, USsire=TRUE, USdam=TRUE)
X.list<-getXlist(PdP=PdP, GdP=GdP, E2=0.02)
nUS<-MLE.popsize(X.list, USsire=TRUE, USdam=TRUE)
nUS
Maximum Likelihood (ML) and Markov chain Monte Carlo (MCMC) methods for Pedigree Reconstruction, Analysis and Simulation.
Description
The primary aim of MasterBayes is to use MCMC techniques to integrate over uncertainty in pedigree configurations estimated from molecular markers and phenotypic data. Emphasis is put on the marginal distribution of parameters that relate the phenotypic data to the pedigree. All simulation is done in compiled C++ using the Scythe Statistical Library. More detailed information can be found in vignette("Tutorial", "MasterBayes").
Details
The motivation behind the package is to approximate the following probability distribution using Markov chain Monte Carlo techniques:
p(\bf{\beta} | \bf{G}, \bf{y})
where \bf{\beta} is the vector of parameters of primary interest, \bf{G} are the genetic data and \bf{y} are phenotypic data. Generally, it is not possible to simulate from the posterior distribution of \bf{\beta} when the problem is in this form and so I augment the parameter space with the pedigree, \bf{P}:
\int_{\bf{P}} p(\bf{\beta}, \bf{P} | \bf{G}, \bf{y})d\bf{P}
This simplifies the problem because the likelihood can be expressed more simply:
L(\bf{G}, \bf{y} | \bf{\beta}, \bf{P}) = L(\bf{G} | \bf{P})L(\bf{y} | \bf{P}, \bf{\beta})
This simplification rests on the assumption that the genetic and non-genetic data are independent after conditioning on the pedigree. This will generally be true when markers are not linked to QTL's. The first likelihood, L(\bf{G} | \bf{P}), is easily calculated for arbitrary pedigrees using the Elston-Stewart algorithm (Elston, 1971), and is based around the Mendelian transition probability. The second likelihood is obtained by fitting the multinomial log-linear model:
L(\bf{y} | \bf{\beta}, \bf{P}) \propto p(\bf{P} | \bf{\beta}, \bf{y})p(\bf{P}).
Assuming that the set of possible pedigrees have equal prior probability, and that offspring are independently distributed after conditioning on the predictor variables:
L(\bf{y} | \bf{\beta}, \bf{P}) \propto \prod^{n_{o}}_{i=1} \frac{e^{\bf{X}^{i}_{p_{i}}\bf{\beta}}}{\sum^{n_{p}}_{j=1}e^{\bf{X}^{i}_{j}\bf{\beta}}}.
where \bf{X}^{i}_{j} denotes the j^{th} row of offspring i's design matrix formed from the phenotypic data \bf{y}. Each row of the design matrix corresponds to a parental combination. n_{o} and n_{p} denote the number of offspring and the number of potential parental combinations, respectively. p_{i} denotes the actual parents of individual i (Smouse, 1999).
This likelihood is evaluated over the probability distribution of the pedigree, \bf P:
p(\bf{P} | \bf{G}, \bf{y}, \bf{\beta}).
Most other techniques approximate this distribution as p(\bf{P} | \bf{G}), and even then tend to use the mode rather than the complete distribution, leading to inferential problems (See the information boxes in Hadfield et al. 2006).
Unfortunately, genotype data are rarely observed with out error and the parents of some offspring may not be sampled. If the genetic markers are co-dominant then genotyping errors can be handled following either the model of Wang (2004) or CERVUS (Marshall 1998). When the markers are dominant I model the probabilities of a dominant allele being miss-scored as a recessive and vice versa. Denoting the parameters associated with these two forms of genotyping error as \varepsilon_{1} and \varepsilon_{2}, and the vector of parental allele frequencies as \omega, two solutions are implemented.
An exact solution:
\int_{\bf{P}} \int_{\bf{G}} \int_{\varepsilon_{1}} \int_{\varepsilon_{2}} \int_{\omega} p(\bf{\beta}, \bf{P}, \bf{G}, \varepsilon_{1}, \varepsilon_{2}. \omega | \bf{G}^{(obs)}, \bf{y})d\bf{P}d\bf{G}d\varepsilon_{1}d\varepsilon_{2}d\omega
where the posterior probability distribution of the error rates, the allele frequencies and the true unobserved genotypes, \bf{G}, are estimated and integrated out. The conditional distribution of the true genotypes in the exact form is given by:
p(\bf{G}^{obs} | \bf{G}, \varepsilon_{1}, \varepsilon_{2})p(\bf{G} | \bf{P}, \omega)
The second solution is an approximation to the above equation, and uses point estimates for \omega, \varepsilon_{1} and \varepsilon_{2}. The conditional distribution of \bf{G} is derived ignoring the information present in \bf{P}:
p(\bf{G}^{obs} | \bf{G}, \varepsilon_{1}, \varepsilon_{2})p(\bf{G} | \omega)
The approximation can be derived analytically, whereas the exact solution requires the Markov chain to be augmented with the true genotypes of all individuals. This becomes very computer intensive but the approximation breaks down for dominant markers, or models in which the number of unsampled males and/or females is to be estimated. Unsampled parents are dealt with, and their number estimated using an approximation originally due to Nielsen (2001). An exact solution to the problem has been proposed by Emery et.al. (2001) but becomes impractical as the number of unsampled parents gets large. Nielsen's approximation is based around the Mendelian transition probability when a parental genotype is unknown. This probability is derived using estimates of the allele frequencies at that locus and the assumption of Hardy-Weinberg equilibrium.
I deal with the fact that unsampled individuals have missing phenotype data by approximating the distribution of the sum of linear predictors across unsampled parents. This approximation relies on the assumption that the unsampled individuals come from the same statistical population as sampled individuals, and that population sizes are large enough so that the distribution for the sum tends to a normal distribution under the central limit theorem.
Taking n and N as the number of sampled individuals, and the total number of individuals in the population respectively:
p(\sum^{N-n}\bf{\hat{p}}^{(miss)} | \bf{\hat{p}}^{(obs)}) \approx N(\frac{N-n}{n}\sum^{n}\bf{\hat{p}}^{(obs)}, \frac{N(N-n)}{n}S_{obs}^{2})
where \bf{\hat{p}} are vectors of linear predictors for the unsampled ^{(miss)} and sampled ^{(obs)} individuals, respectively (Gelman et al., 2004). S_{obs}^{2} is the sample variance of the observed linear predictors.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Elston, R. C. and Stewart, J. Human Heredity (1971) 21 523-542 Emery, A. M. et.al Molecular Ecology (2001) 10 1265-1278 Gelman, A. et.al Bayesian Data Analysis Edition II (2004) Chapman and Hall Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31 Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Nielsen. R. et.al Genetics (2001) 157 4 1673-1682 Smouse P.E. et al (1999) Journal of Evolutionary Biology 12 1069-1077 Wang J.L. Genetics (2004) 166 4 1963-1979
See Also
PdataPed Object
Description
PdataPed creates an object of class PdataPed, which typically contains the phenotype data to be passed to MCMCped and the formula that defines the model to be fitted.
is.PdataPed returns TRUE if x is of class PdataPed
Usage
PdataPed(formula, data=NULL, id=data$id, sex=data$sex,
offspring=data$offspring, timevar=data$timevar,
USdam=FALSE, USsire=FALSE)
Arguments
formula |
list of model predictors of the form |
data |
data frame containing the predictor variables |
id |
vector of individual identifiers. If not specified, |
sex |
vector of individual sexes (either 'Male' or 'Female' or |
offspring |
binary vector indicating whether records belong to offspring (1) or not (0) |
timevar |
an optional vector indicating cohorts for multigenerational pedigree reconstruction |
USdam |
logical or character; if |
USsire |
logical or character; if |
Details
If the number of unsampled individuals varies over subpopulations, and the parentage of an offspring is not restricted to ceratin subpopulations then the parameters will not be idenifiable. This can be resolved by using an informative prior (see priorPed) for the number of unsampled individuals in each sub-population, or using the restrict argument in varPed.
Value
list containing the arguments passed
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
id<-1:20
sex<-sample(c("Male", "Female"),20, replace=TRUE)
offspring<-c(rep(0,18),1,1)
lat<-rnorm(20)
long<-rnorm(20)
mating_type<-gl(2,10, label=c("+", "-"))
test.data<-data.frame(id, offspring, lat, long, mating_type, sex)
res1<-expression(varPed("offspring", restrict=0))
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
var2<-expression(varPed(c("mating_type"), gender="Female",
relational="MATE"))
var3<-expression(varPed("mating_type", gender="Male"))
PdP<-PdataPed(formula=list(res1, var1, var2, var3), data=test.data)
Seychelles Warbler Genotypes
Description
Genetype data collected by David Richardson from Cousin Island in 1999.
Format
a data frame with 307 rows and 29 columns. The first column are the unique idenitifiers for each bird, and the following columns are genotype data. Adjacent columns beolng to the same locus.
Source
Richardson D.S.
References
Richardson et.al. (2001) Molecular Ecology 10 2263-2273 Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31
Seychelles Warbler Phenotypes
Description
Phenotypic data collected by David Richardson from Cousin Island in 1999. The data are almost a complete sample of those birds that existed in the population at that time.
Format
a table with 307 rows and 7 columns. The columns, from left to right are: 1) a unique identifier for each bird; 2) a binary variable inbdicating whether the record belongs to an offspring; 3) the sex of each bird; 4) the territory on which the bird was recorded; 5 and 6) the latitude and longitude of that territory; 7) the behavioural status of each bird (Dominant or Subordinate)
Source
Richardson D.S.
References
Richardson et.al. (2001) Molecular Ecology 10 2263-2273 Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31
Autocorrelation Function for Parenatge Assignment
Description
Function for assessing mixing of the Markov chain with respect to parentage assignment.
Usage
autocorrP(postP)
Arguments
postP |
JOINT posterior distribution of parentage |
Details
For each offspring the proportion of transitions is calculated at lags 1, 2, 5, 10, 50 and 100 (i.e. the proportion of times that the parentage assignment at time t is different from the parentage assignment at time t+lag). The difference between these proportions and the proportion at lag 1 is then calculated, and the mean over offspring given. When the parentage assignments in successive Markoc chain Monte Carlo (MCMC) iterations are independent these autocorrelation metrics should be randomly distributed about zero and should not decrease with increasing lag.
Value
matrix
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# individuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
tP<-tunePed(beta=30)
model1<-MCMCped(PdP=PdP, GdP=GdP, tP=tP, nitt=3000, thin=1, burnin=0, write_postP="JOINT")
autocorrP(model1$P)
Log-Likelihood of Beta
Description
Log-likelihood of beta given a pedigree and phenotypic data. Beta is the parameter vector for the multinomial log-linear model. Intended to be used within the function MLE.beta
Usage
beta.loglik(X, dam_pos=NULL, sire_pos=NULL, par_pos=NULL, beta=NULL,
beta_map=NULL, merge=NULL, mergeN=NULL, nUS=c(0,0), shrink=NULL)
Arguments
X |
list of design matrices for each offspring. Each element should either have dam (D) and/or sire (S) matrices, or a composite Dam/Sire (DS) matrix. See |
dam_pos |
position of each offspring's mother in the dam design matrix |
sire_pos |
position of each offspring's mother in the sire design matrix |
par_pos |
position of each offspring's parents in the composite dam/sire matrix |
beta |
parameter vector |
beta_map |
vector that maps |
merge |
optional vector that indicates columns of for which the parameter is transformed using the argument |
mergeN |
optional list of matrices for each offspring the columns of which refer to merged variables and the rows to the number of individuals that fall into each category defined by |
nUS |
vector of the number of unsampled females and males, respectively. Only required if unsampled individuals have known phenotype. |
shrink |
optional scalar for the variance defining the ridge-regression likelihood penalisation. |
Value
log-likelihood of beta given the pedigree and X.
Note
Intended to be used within MLE.beta
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31 Smouse P.E. et al (1999) Journal of Evolutionary Biology 12 1069-1077
See Also
MLE.beta, MCMCped, varPed, getXlist
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
res1<-expression(varPed("offspring", relational=FALSE, restrict=0))
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP)
# probability of paternity is modelled as a function of distance
X.list<-getXlist(PdP=PdP, GdP=GdP)
ped<-MLE.ped(X.list)$P
# get ML pedigree from genetic data alone
X<-lapply(X.list$X, function(x){list(S=x$XSs)})
# Extract Design matrices for Sires
sire_pos<-match(ped[,3][as.numeric(names(X))], X.list$id)
sire_pos<-mapply(function(x,y){match(x, y$sire.id)}, sire_pos, X.list$X)
# row number of each design matrix corresponding to the ML sire.
beta<-seq(-0.065,-0.0325, length=100)
beta_Loglik<-1:100
for(i in 1:100){
beta_Loglik[i]<-beta.loglik(X, sire_pos=sire_pos, beta=beta[i],
beta_map=X.list$beta_map)
}
plot(beta_Loglik~beta, type="l", main="Profile Log-likelihood for beta")
Obtains a consensus genotype from duplicate samples
Description
A function for obtaining a consensus genotype from duplicate samples. The amount of missing data is minimised, and preference is given to samples with lower genotyping error
Usage
consensusG(GdP, cat.levels=NULL, gmax=FALSE, het=FALSE)
Arguments
GdP |
a |
cat.levels |
order of genotyping error rate categories, with most reliable category first |
gmax |
logical; if a most represented genotype exists should it be saved |
het |
logical; should heterozygotes be saved over homozygotes - overrides |
Value
GdP |
a |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Allele Frequencies
Description
extracts allele frequencies from genotype data
Usage
extractA(G, marker.type="MSW")
Arguments
G |
data frame or list of |
marker.type |
|
Value
list of allele frequnecies at each loci
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
genotype.list, genotype
Examples
data(WarblerG)
A<-extractA(WarblerG)
A[[1]]
Mendelian Transition Probabilities
Description
This function is primarily intended for use within getXlist, and fills in the design matrices of the model with the genetic likelihoods.
Usage
fillX.G(X.list, A, G, E1=0.005, E2=0.005, marker.type="MSW")
Arguments
X.list |
list of design matrices for each offspring derived using |
A |
list of allele frequencies |
G |
list of genotype objects; rows must correspond to individuals in the vector |
E1 |
if Wang's (2004) model of genotyping error for co-dominant markers is used this is the probability of an allele dropping out. If CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers is used this parameter is not used. If Hadfield's (2009) model of genotyping error for dominant markers is used this is the probability of a dominant allele being scored as a recessive allele. |
E2 |
if Wang's (2004) or CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers are used this is the probability of an allele being miss-scored. In the CERVUS model errors are not independent for the two alleles within a genotype and so if a genotyping error has occurred at one allele then a genotyping error occurs at the other allele with probability one. Accordingly, |
marker.type |
|
Value
list of design matrices of the form X.list containing genetic likelihoods for each offspring.
Note
If a GdataPed object is passed to getXlist then the genetic likelihoods will be calculated by default.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Kalinowski S.T. et al (2007) Molecular Ecology 16 5 1099-1106 Hadfield J. D. et al (2009) in prep
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG)
ped<-matrix(NA, 5,3)
ped[,1]<-1:5
ped[,2]<-c(rep(NA, 4), 1)
ped[,3]<-c(rep(NA, 4), 2)
genotypes<-simgenotypes(A, ped=ped)
sex<-c("Female", "Male", "Female", "Male","Female")
offspring<-c(0,0,0,0,1)
data<-data.frame(id=ped[,1], sex, offspring)
res1<-expression(varPed(x="offspring", restrict=0))
PdP<-PdataPed(formula=list(res1), data=data)
GdP<-GdataPed(G=genotypes$Gobs, id=genotypes$id)
X.list<-getXlist(PdP)
# creates design matrices for offspring (in this case indivdiual "5")
X.list.G<-fillX.G(X.list, A=A, G=genotypes$Gobs, E2=0.005)
# genetic likelihoods are arranged sires within dams
X.list.G$X$"5"$dam.id
X.list.G$X$"5"$sire.id
# so for this example we have parental combinations
# ("1","2"), ("1","4"), ("3","2"), ("2","4"):
X.list.G$X$"5"$G
# The true parents have the highest likelihood in this case
Genotype Objects for all Loci
Description
Creates a list of genotype objects from a matrix or data.frame of multilocus genotypes.
Usage
genotype.list(G, marker.type="MSW")
Arguments
G |
matrix or data.frame of multilocus genotypes with individuals down the rows and loci across columns. Adjacent columns are taken to be the same locus |
marker.type |
|
Value
list of genotype objects for all loci
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Kalinowski S.T. et al (2007) Molecular Ecology 16 5 1099-1106 Hadfield J. D. et al (2009) in prep
See Also
genotype
Examples
data(WarblerG)
G<-genotype.list(WarblerG[,-1])
summary(G[[1]])
genotypeD Object
Description
Extends the genotype class for dominant marker data
Usage
genotypeD(a1, locus=NULL)
Arguments
a1 |
vector of scored genotypes (0 or 1) for dominant markers |
locus |
object of class locus, gene, or marker, holding information about the source of this genotype. |
Value
the genotypeD class extends the genotype class.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
genotype, summary.genotypeD
Examples
l1<-rbinom(100,1,0.5)
l1<-genotypeD(l1)
Design Matrices for the Multinomial Log-Linear Model
Description
Forms design matrices for each offspring, and stores other relevant information.
Usage
getXlist(PdP, GdP=NULL, A=NULL, E1=0.005, E2=0.005, mm.tol=999)
Arguments
PdP |
|
GdP |
optional |
A |
optional list of allele frequencies. If not specified and |
E1 |
if Wang's (2004) model of genotyping error for co-dominant markers is used this is the probability of an allele dropping out. If CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers is used this parameter is not used. If Hadfield's (2009) model of genotyping error for dominant markers is used this is the probability of a dominant allele being scored as a recessive allele. |
E2 |
if Wang's (2004) or CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers are used this is the probability of an allele being miss-scored. In the CERVUS model errors are not independent for the two alleles within a genotype and so if a genotyping error has occurred at one allele then a genotyping error occurs at the other allele with probability one. Accordingly, |
mm.tol |
maximum number of genotype mismatches tolerated for potential parents |
Details
This is the main R routine for setting up design matrices for the various models that may be defined in the formula argument of PdataPed. If a GdataPed object is passed to getXlist design matrices of genetic likelihoods are calculated (see fillX.G), and the number of mismatches between offspring and parental genotypes are stored (see mismatches). mm.tol specifies the maximum number of mismatches that are tolerated between an offspring and a parent. Parents that exceed this number of mismatches are excluded, and the design matrices for non-excluded parents are reordered by the number of mismatches. This increases the efficiency of sampling from the multinomial distribution of parents, because high probability parents appear first.
Value
id |
vector of unique identifiers taken from |
beta_map |
index relating the vector of unique parameters to the columns of the design matrices |
X |
list of design matrices and other information. |
Note
Each element of X refers to an offspring (names(X)) and contains vectors for the set of potential parents (restdam.id and restsire.id) of each offspring. Also included are the set of individuals that may have been parents but have been excluded for certain reasons (dam.id and sire.id). Exclusion may have been based on the number of genotype mismatches, or it may have been on biological grounds (See the keep argument of varPed). Parental id's are stored as integers which correspond to the actual id's stored in id. Parental id's greater than the length of id refer to unsampled parents.
Six types of design matrix are used (XDus, XDs, XSus, XSs, XDSus, XDSs). XD.. are the design matrices for dams, and XS.. are the design matrices for sires. The rows of each design matrix are associated with individuals in dam.id and sire.id, respectively. When interactions between dam and sire variables are modelled, or a varPed variable is created using the argument relational="MATE", the design matrices vary over parental combinations. XDS.. are the design matrices for parental combinations with sire's varying the fastest. Each of these three types of design matrix have two subclasses: s and us. s are design matrices which are fully observed, either because unsampled parents do not exist or because unsampled parents have known phenotypes (see argument USvar in varPed). us are for design matrices where the phenotypes of unsampled parents are unknown. The matrices XDus and Xsus have a row of NA's which correspond to the unsampled parent category. The design matrix XDSus will typically have many rows of NA's because each sampled parent may be paired to an unsampled individual.
When the argument gender=NULL is passed to varPed the respective columns in the dam and sire design matrices are associated with a single parameter. Because of this the number of parameters to be estimated may be less than the total number of columns in the 6 design matrices. beta_map relates a parameter vector to the columns of the design matrices. The columns of the design matrices are numbered in the order they are introduced in the preceding paragraph (i.e XDus through to XDSs). The parameter vector is ordered identically except parameters associated with genderless variables are omitted for males. par_order is similar to beta_map but relates the order of the parameters specified in the formula argument to PdataPed to the respective columns of the design matrices.
If the argument relational="OFFSPRING" is specified in varPed, or the set of potential parents varies over offspring, the design matrices will vary across offspring. For this reason I create a design matrix for each offspring irrespective of whether the matrices vary or not. The design matrices for the genetic likelihoods will always vary over offspring.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31 Kalinowski S.T. et al (2006) Molecular Ecology in press Hadfield J. D. et al (2007) in prep
See Also
Examples
id<-1:20
sex<-sample(c("Male", "Female"),20, replace=TRUE)
offspring<-c(rep(0,18),1,1)
lat<-rnorm(20)
long<-rnorm(20)
mating_type<-gl(2,10, label=c("+", "-"))
test.data<-data.frame(id, offspring, lat, long, mating_type, sex)
res1<-expression(varPed("offspring", restrict=0))
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
var2<-expression(varPed(c("mating_type"), gender="Female",
relational="MATE"))
var3<-expression(varPed("mating_type", gender="Male"))
PdP<-PdataPed(formula=list(res1, var1, var2, var3), data=test.data)
X.list<-getXlist(PdP)
X.list$X$"19"$XSs
# For the first offspring we have the design matrix for sires
# The first column represents the distance between each male
# and each offspring. The second column indicates the male's
# mating type. Note that contrasts are set up with the first
# male so the indicator variables may be negative.
matrix(X.list$X$"19"$XDSs, ncol=length(X.list$X$"19"$dam.id),
nrow=length(X.list$X$"19"$sire.id))
# incidence matrix indicating whether Females (columns) and Males (rows)
# are the same mating type. Again this is a contrast with the first
# parental combination (which is +/+) so 0 actually represents parents
# with the same mating type.
Inserts Founders into a Pedigree
Description
Inserts Founders into a Pedigree
Usage
insertPed(ped, founders=NULL)
Arguments
ped |
pedigree with id, dam and sire in ech column |
founders |
optional vector of founder id's. If not specified, then parents without their own pedigree row are inserted |
Value
a pedigree pedigree with id, dam and sire in each column
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
pedigree<-matrix(NA, 7,3)
pedigree[,1]<-2:8
pedigree[,2][4:7]<-c(1,1,2,2)
pedigree[,3][4:7]<-c(3,3,4,4)
pedigree2<-insertPed(pedigree)
Legal Genotype Configurations
Description
A function for checking whether a set of genotypes have a positive probability given the pedigree. If not, a legal configuration is found using heuristic methods. Missing genotypes are also replaced with compatible genotypes.
Usage
legalG(G, A, ped, time_born=NULL, marker.type="MSW")
Arguments
G |
list of |
A |
list of allele frequencies |
ped |
pedigree with id in the first column, dam in the second, and sire in the third. The genotypes must be in the same order as the id column |
time_born |
an optional vector for ordering a pedigree more efficiently (see |
marker.type |
|
Value
G |
a list of |
legal |
logical; TRUE if the the genotype configuration passed to |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Kalinowski S.T. et al (2007) Molecular Ecology 16 5 1099-1106 Hadfield J. D. et al (2009) in prep
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG[,16:17])
pedigree<-matrix(NA, 8,3)
pedigree[,1]<-1:8
pedigree[,2][5:8]<-c(1,1,2,2)
pedigree[,3][5:8]<-c(3,3,4,4)
G<-simgenotypes(A, E1=0, E2=0.3, ped=pedigree, no_dup=1)
newG<-legalG(G=G$Gobs,A=A,ped=pedigree)
newG$valid
# The input genotypes had a zero probability given the pedigree
# (because of genotype error) but the output genotypes have
# positive probability
legalG(newG$G,A,pedigree)$valid
Parent-Offspring Genotype Mismatches
Description
Calculates the number of mismatches between parental and offspring genotypes, assuming the genotypes of spouses are unknown. Primarily intended to be used inside the function getXlist where potential parents can be excluded based on the number of mismatches. Dominant markers do not produce mismatches.
Usage
mismatches(X.list, G, mm.tol=999)
Arguments
X.list |
list of design matrices for each offspring derived using |
G |
list of genotype objects, the rows of which must refer to the id vector |
mm.tol |
maximum number of mismatches that are tolerated before exclusion |
Value
list of design matrices of the form X.list, but containing the number of mismatches between parents and offspring. Potential parents that exceed the number of mismatches specified by mm.tol are removed from the vectors of potential parents: restdam.id and restsire.id.
Note
If a GdataPed object is passed to getXlist then the number of mismatches will be calculated by default.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG)
ped<-matrix(NA, 5,3)
ped[,1]<-1:5
ped[,2]<-c(rep(NA, 4), 1)
ped[,3]<-c(rep(NA, 4), 2)
genotypes<-simgenotypes(A, ped=ped)
sex<-c("Female", "Male", "Female", "Male","Female")
offspring<-c(0,0,0,0,1)
data<-data.frame(id=ped[,1], sex, offspring)
res1<-expression(varPed(x="offspring", restrict=0))
PdP<-PdataPed(formula=list(res1), data=data)
X.list<-getXlist(PdP)
# creates design matrices for offspring (in this case indivdiual "5")
X.list.MM<-mismatches(X.list, G=genotypes$Gobs, mm.tol=0)
# genetic likelihoods are arranged sires within dams
X.list.MM$X$"5"$mmD
# number of mismatches between offspring "5" and dams "1" and "3"
X.list.MM$X$"5"$mmS
# number of mismatches between offspring "5" and sires "4" and "5"
X.list.MM$X$"5"$restdam.id
X.list.MM$X$"5"$dam.id
# dams with mismatches are excluded mismatch (mm.tol=0)
X.list.MM$X$"5"$restsire.id
X.list.MM$X$"5"$sire.id
# sires with mismatches are excluded mismatch (mm.tol=0)
Posterior Mode of Genotypes
Description
Finds the mode of the posterior marginal distribution of genotypes
Usage
modeG(postG, threshold=0)
Arguments
postG |
posterior distribution of genotypes from an |
threshold |
threshold probability under which maximum likelihood genotypes are replaced by NA |
Value
G |
list of |
id |
id vector |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al, Molecular Ecology
See Also
MCMCped, genotype
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# indivdiuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
tP<-tunePed(beta=30)
model1<-MCMCped(PdP=PdP, GdP=GdP, tP=tP, nitt=3000, thin=2, burnin=1000, write_postG=TRUE)
G<-modeG(model1$G)$G
summary(G[[1]])
Posterior Mode of Parents
Description
Finds the mode of the posterior marginal distribution of parents
Usage
modeP(postP, threshold=0, marginal=FALSE, USasNA=TRUE)
Arguments
postP |
posterior distribution of parentage |
threshold |
threshold probability under which maximum likelihood parents are replaced by NA |
marginal |
logical; should the marginal mode be calculated from the joint distribution? |
USasNA |
logical; should unsampled parents be replaced by NA? |
Details
Individuals that do not have a parent assignment with a posterior probability exceeding the threshold, or whose parents belong to the base or unsampled population (if USasNA=TRUE), have NA as their parents. Please bear in mind that the mode of the marginal distribution (returned by MCMCped if write_postP="MARGINAL") may be different from the mode of the joint distribution (write_postP="JOINT"). For example the male that has the highest marginal probability (marginal with respect to potential mothers) may not be the male that is in the parental category (i.e. dam/sire combination) with the highest probability. If write_postP="JOINT" was sepcified, then the mode of the marginal distribution can be obtained by specifying marginal=TRUE. The modes are marginal with respect to other offspring and with multigenerational pedigrees may not coincide with the mode of the distribution of pedigrees.
Value
P |
pedigree with id in the first column, and dam and sire in the second and third columns |
prob |
marginal posterior probability of the most likely parental combination (joint) or the most likely mother (marginal) |
prob.male |
marginal posterior probability of the most likely father (marginal) |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# indivdiuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
tP<-tunePed(beta=30)
model1<-MCMCped(PdP=PdP, GdP=GdP, tP=tP, nitt=3000, thin=2, burnin=1000)
ped<-modeP(model1$P, threshol=0.9)
ped
Orders a Pedigree
Description
Orders a pedigree so parents come before offspring
Usage
orderPed(ped, time_born=NULL)
Arguments
ped |
pedigree with id, dam and sire in ech column |
time_born |
an optional vector of birth dates by which the pedigree can be ordered) |
Value
an ordered pedigree pedigree with id, dam and sire in each column
Note
This function has changed name from order.ped in earler versions <2.42. order.ped did not always (rarely) ordered the pedigree correctly. This new function uses the kindepth function from the kinship2 package
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
pedigree<-matrix(NA, 8,3)
pedigree[,1]<-1:8
pedigree[,2][5:8]<-c(1,1,2,2)
pedigree[,3][5:8]<-c(3,3,4,4)
pedigree<-pedigree[sample(1:8),]
pedigree2<-orderPed(pedigree)
Log-Likelihood of Unsampled Population Size
Description
Log-likelihood of the number of unsampled individuals given the genotypes of offspring and potential parents
Usage
popsize.loglik(X, USdam=FALSE, USsire=FALSE, nUS=NULL, ped=NULL, USsiredam=FALSE,
shrink=NULL)
Arguments
X |
list for each offspring with elements |
USdam |
logical or character; if |
USsire |
logical or character; if |
nUS |
vector for the size of the unsampled populations. Parmeters for the unsampled female populations come before the male populations. |
ped |
optional pedigree with id, dam and sire in ech column |
USsiredam |
logical; if |
shrink |
optional scalar for the variance defining the ridge-regression likelihood penalisation. |
Value
log-likelihood of the number of unsampled individuals given the genotype data.
Note
Intended to be used within MLE.popsize
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Nielsen. R. et.al Genetics (2001) 157 4 1673-1682
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG)
sex<-c(rep("Male", 50), rep("Female", 100))
offspring<-c(rep(0,100), rep(1, 50))
terr<-as.factor(rep(1:50, 3))
id<-1:150
res1<-expression(varPed(x="offspring", restrict=0))
res2<-expression(varPed(x="terr", gender="Female", relational="OFFSPRING",
restrict="=="))
test.data<-data.frame(id, sex, offspring, terr)
PdP<-PdataPed(formula=list(res1, res2), data=test.data)
simped<-simpedigree(PdP)
G<-simgenotypes(A, E1=0, E2=0, ped=simped$ped, no_dup=1)
# remove 25 males at random, leaving 25
rm.males<-sample(1:50, 25, replace=FALSE)
data.rm<-test.data[-rm.males,]
GdPrm<-GdataPed(G=lapply(G$Gobs, function(x){x[-rm.males]}),
id=G$id[-rm.males])
# delete genotype and phenotype records
PdPrm<-PdataPed(formula=list(res1, res2), data=data.rm, USsire=TRUE)
X.listrm<-getXlist(PdP=PdPrm, GdP=GdPrm, A=A, E2=0)
X<-lapply(X.listrm$X, function(x){list(N=c(25,0,1,0),
G=c(sum(x$G[1:25]), 0, x$G[26], 0))})
# each offspring has 1 mother and 25 sampled fathers so the 4 classes are:
# a) 1*25 categories with both parents sampled, 0*25 categries with only
# sires sampled b) 1*1 categories with only dams sired and 0*0 categories
# with both sexes unsampled.
nUS<-seq(10,40, length=100)
nUS_Loglik<-1:100
for(i in 1:100){
nUS_Loglik[i]<-popsize.loglik(X, USsire=TRUE, nUS=nUS[i])
}
plot(nUS_Loglik~nUS, type="l", main="Profile Log-likelihood
for number of unsampled males")
Returns pairs of individuals that fall into specific relatedness categories
Description
Computes posterior probabilities of pairs of indiviuals falling into specific relatedness categories (parent-offsping, sibs, full-sibs, half-sibs). Returns those pairs that have a posterior probability greater than some threshold.
Usage
post.pairs(postP, threshold=0, rel="PO")
Arguments
postP |
joint posterior distribution of parentage |
threshold |
threshold probability over which related pairs are returned |
rel |
relatedness category. Currently |
Value
P |
pairs of indiviuals that fall into the |
prob |
posterior probability |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerP)
data(WarblerG)
GdP<-GdataPed(WarblerG)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# indivdiuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
tP<-tunePed(beta=30)
model1<-MCMCped(PdP=PdP, GdP=GdP, tP=tP, nitt=3000, thin=2, burnin=1000, write_postP="JOINT")
fsib<-post.pairs(model1$P, threshol=0.9, rel="FS")
fsib$P
priorPed Object
Description
An object containing the prior specifications for a model fitted using MCMCped. If prior distributions are not specified then improper priors are used, and a proper posterior distribution cannot be gauranteed.
Usage
priorPed(E1=999, E2=999, beta=list(mu=999, sigma=999),
USdam=list(mu=999, sigma=999),
USsire=list(mu=999, sigma=999))
Arguments
E1 |
matrix of parameters for the beta distribution specifying the prior distribution. If Wang's (2004) model of genotyping error for co-dominant markers is used this is the probability of an allele dropping out. If CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers is used this parameter is not used. If Hadfield's (2009) model of genotyping error for dominant markers is used this is the probability of a dominant allele being scored as a recessive allele. Rows correspond to error rate categories, columns to the beta shape parameters. The order of rows in E1 are the order in which the error rate categories appear in the |
E2 |
matrix of parameters for the beta distribution specifying the prior distribution. If Wang's (2004) or CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers are used this is the probability of an allele being miss-scored. In the CERVUS model errors are not independent for the two alleles within a genotype and so if a genotyping error has occurred at one allele then a genotyping error occurs at the other allele with probability one. Accordingly, |
beta |
list containing a vector for the mean, and a matrix for the variance-covariances of a multivariate normal distribution, that specifies the prior distribution for the population level parameters. The order of |
USdam |
list containing vectors of means and standard deviations for log normal distributions that specify the prior distribution for the number of unsampled females. The order of |
USsire |
list containing vectors of means and standard deviations for log normal distributions that specify the prior distribution for the number of unsampled males. The order of |
Value
list containing the arguments passed
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
# When each individual has only been genotyped once, and no pedigree
# information exists, there is virtually no information available
# to estimate error rates. The tiny amount of information comes
# (dangerously) from the assumption of Hardy-Weinburg equilibrium.
# The posterior distribution is similar to the prior:
data(WarblerG)
A<-extractA(WarblerG)
ped<-matrix(NA, 100,3)
ped[,1]<-1:100
G<-simgenotypes(A, E1=0.01, E2=0.01, ped=ped, no_dup=1)
GdP<-GdataPed(G=G$Gobs, id=G$id)
pP<-priorPed(E1=matrix(c(40,1600), nrow=1), E2=matrix(c(40,1600), nrow=1))
model1<-MCMCped(GdP=GdP, pP=pP)
#The posterior distribution recovers the prior distribution
summary(model1$E1)
quantile(rbeta(1000, 40, 1600), prob=c(0.025, 0.25, 0.5, 0.75, 0.975))
Reorders Design Matrices
Description
Reorders design matrices so excluded parents appear last, and high probability parents appear first, thus increasing computational efficiency.
Usage
reordXlist(X.list, marker.type="MSW")
Arguments
X.list |
list of design matrices for each offspring derived using |
marker.type |
|
Details
The design matrices are reordered by the number of mismatches between a parent and offspring for codominant markers, and by the probability of the offspring genotype conditional on parent genotype for dominant markers.
Value
X.list for which parents are reordered
Note
If a GdataPed object is passed to getXlist then the design matrices will be reordered by default.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG)
ped<-matrix(NA, 5,3)
ped[,1]<-1:5
ped[,2]<-c(rep(NA, 4), 3)
ped[,3]<-c(rep(NA, 4), 4)
genotypes<-simgenotypes(A, ped=ped)
sex<-c("Female", "Male", "Female", "Male","Female")
offspring<-c(0,0,0,0,1)
data<-data.frame(id=ped[,1], sex, offspring)
var1<-expression(varPed(x="offspring", restrict=0))
PdP<-PdataPed(formula=list(var1), data=data)
X.list<-getXlist(PdP)
# creates design matrices for offspring (in this case indivdiual "5")
X.list<-mismatches(X.list, G=genotypes$Gobs)
X.list<-fillX.G(X.list, A=A, G=genotypes$Gobs)
X.list.reord<-reordXlist(X.list)
# The design matrices for the genetic likelihoods are reordered
# by the number of mismatches. The true parental combination
# now appears first rather than last.
X.list$X$"5"$G
X.list.reord$X$"5"$G
Genotype and Genotyping Error Simulation
Description
Simulates genotypes given a pedigree and allele frequencies. Option exists to simulate observed genotypes given Wangs's (2004) or CERVUS's model (Marshall 1998) of genotyping error for codominat markers or an asymmetric allele based model for dominant markers (Hadfield, 2009).
Usage
simgenotypes(A, E1 = 0, E2 = 0, ped, no_dup = 1, prop.missing=0, marker.type="MSW")
Arguments
A |
list of allele frequencies at each locus |
E1 |
if Wang's (2004) model of genotyping error for co-dominant markers is used this is the probability of an allele dropping out. If CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers is used this parameter is not used. If Hadfield's (2009) model of genotyping error for dominant markers is used this is the probability of a dominant allele being scored as a recessive allele. |
E2 |
if Wang's (2004) or CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers are used this is the probability of an allele being miss-scored. In the CERVUS model errors are not independent for the two alleles within a genotype and so if a genotyping error has occurred at one allele then a genotyping error occurs at the other allele with probability one. Accordingly, |
ped |
pedigree in 3 columns: id, dam, sire. Base individuals have NA as parents. All parents must be in id. |
no_dup |
integer: number of times genotypes are to be observed |
prop.missing |
proportion of observed genotypes that are missing |
marker.type |
|
Value
G |
list of genotype objects; true genotypes for each locus |
Gid |
vector of id names indexing |
Gobs |
list of genotype objects; observed genotypes for each locus |
id |
vector of |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Marshall, T. C. et al (1998) Molecular Ecology 7 5 639-655 Kalinowski S.T. et al (2007) Molecular Ecology 16 5 1099-1106 Hadfield J. D. et al (2009) in prep
See Also
genotype
Examples
pedigree<-cbind(1:10, rep(NA,10), rep(NA, 10))
gen_data<-simgenotypes(A=list(loc_1=c(0.5, 0.2, 0.1, 0.075, 0.025)),
E1=0.1, E2=0.1, ped=pedigree, no_dup=1)
summary(gen_data$G[[1]])
summary(gen_data$Gobs[[1]])
Simulates a Pedigree given a Log-Linear Model
Description
Given a PdataPed object simulates a pedigree according to the linear model defined by formula and user specified parameter values for the given model.
Usage
simpedigree(PdP, beta=NULL, nUS=NULL)
Arguments
PdP |
a |
beta |
parameter vector for the model defined by the |
nUS |
vector for the size of the unsampled population(s) defined in the |
Value
ped |
pedigree in 3 columns: id, dam, sire. Base individuals have NA as parents |
USsire.data |
binary vector indicating unsampled sire records (1) |
USsire.formula |
variable of the form |
USdam.data |
binary vector indicating unsampled dam records (1) |
USdam.formula |
variable of the form |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31
See Also
MCMCped
startPed Object
Description
An object containing the starting parameterisation of a model, and logical variables indicating whether parameters should be estimated or fixed at the starting parameterisation. By default the starting parameterisation is obtained through a mixture of Maximum Likelihood and heuristic techniques.
Usage
startPed(G=NULL, id=NULL, estG=TRUE, A=NULL, estA=TRUE, E1=NULL,
estE1=TRUE, E2=NULL, estE2=TRUE, ped=NULL, estP=TRUE,
beta=NULL, estbeta=TRUE, USdam=NULL, estUSdam=TRUE,
USsire=NULL, estUSsire=TRUE, shrink=NULL)
Arguments
G |
list of genotype objects |
id |
vector of indivual id's for |
estG |
logical; should genotypes be estimated? |
A |
list of allele frequencies |
estA |
logical; should base-population allele frequencies be estimated? |
E1 |
if Wang's (2004) model of genotyping error for co-dominant markers is used this is a vector of probabilities of an allele dropping out. If CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers is used this parameter is not used. If Hadfield's (2009) model of genotyping error for dominant markers is used this is a vector of probabilities of a dominant allele being scored as a recessive allele. Default=0.005. |
estE1 |
logical; should E1 estimated? |
E2 |
if Wang's (2004) or CERVUS's (Kalinowski, 2006; Marshall, 1998) model of genotyping error for co-dominant markers are used this is a vector of probabilities of an allele being miss-scored. In the CERVUS model errors are not independent for the two alleles within a genotype and so if a genotyping error has occurred at one allele then a genotyping error occurs at the other allele with probability one. Accordingly, |
estE2 |
logical; should E2 be estimated? |
ped |
pedigree in 3 columns: id, dam, sire. Base individuals have NA as parents. |
estP |
logical; should the pedigree be estimated? |
beta |
vector of population-level parameters |
estbeta |
logical; should the population-level parameters be estimated? |
USdam |
vector of unsampled female population sizes |
estUSdam |
logical; should the female population sizes be estimated? |
USsire |
vector of unsampled male population sizes |
estUSsire |
logical or character; if |
shrink |
optional scalar for the variance defining the ridge-regression likelihood penalisation used to obatain starting values for beta and/or unsampled population sizes. |
Details
If estG=FALSE an approximation is used for genotyping error. In this case error rates and allele frequencies are not estimated but fixed at the starting parameterisation. If individuals have been typed more than once, then the approximation only uses the genotype that first appears in the GdP$G object passed to MCMCped. If A is not specified estimates are taken directly from GdP$G using extractA. If E1 and E2 are not specified they are set to 0.005. Note that if the approximation for genotyping error is used with codominant markers, Wang's (2005) model is not used, and the CEVUS model (Marshall 1998) is adopted. In this case E2 is the per-allele error rate and E2(2-E2) is the per-genotype error rate used by CERVUS. If dam and sire are not specified the most likely set of parents given the genetic data are used (see MLE.ped). The starting value of beta, if not given, is the Maximum Likelihood estimate (MLEMLE of beta given the starting pedigree (see MLE.beta). The starting values of USdam and USsire, if not given, are the MLE based on the genotype data (see MLE.popsize).
Value
list containing the arguments passed
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
# In this example we simulate a pedigree and then fix the
# pedigree and estimate the population level paarmeters
data(WarblerP)
var1<-expression(varPed(c("lat", "long"), gender="Male",
relational="OFFSPRING"))
# paternity is to be modelled as a function of distance
# between offspring and male territories
res1<-expression(varPed("offspring", restrict=0))
# indivdiuals from the offspring generation are excluded as parents
res2<-expression(varPed("terr", gender="Female", relational="OFFSPRING",
restrict="=="))
# mothers not from the offspring territory are excluded
PdP<-PdataPed(formula=list(var1,res1,res2), data=WarblerP, USsire=FALSE)
simped<-simpedigree(PdP, beta=-0.25)
# simulate a pedigree where paternity drops with distance (beta=-0.25)
sP<-startPed(ped=simped$ped, estP=FALSE)
model1<-MCMCped(PdP=PdP, sP=sP, nitt=3000, thin=2, burnin=1000)
plot(model1$beta)
# The true underlying value is -0.25
genotypeD Object
Description
creates and object containing allele and genotype frequency for genotypeD objects
Usage
## S3 method for class 'genotypeD'
summary(object, ...)
Arguments
object |
genotypeD object |
... |
other arguments to be passed |
Value
locus |
locus information field (if present) |
allele.names |
vector of allele names: 0 and 1 |
allele.freq |
estimated allele frequencies with finite sample size correction (Lynch and Milligan 1994) |
genotype.freq |
frequencies of observed genotypes (phenotypes) |
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Lynch M. and Milligan B.G. (1994) Molecular Ecology 3 91-99
See Also
genotype, summary.genotypeD
Examples
l1<-rbinom(100,1,0.5)
l1<-genotypeD(l1)
summary(l1)
tunePed Object
Description
An object containing scaling constants for the tuning parameters used in the Metropolis-Hastings updates. The tuning parameters should be set so that the Metropolis-Hastings acceptance rates lie between 0.2 and 0.5. Initial tuning parameters for beta and the unsampled population size are obtained from the large sample variance-covariances of the Maximum Likelihood estimates.
Usage
tunePed(E1 = NULL, E2 = NULL, beta = NULL, USdam = NULL,
USsire = NULL)
Arguments
E1 |
vector of scaling parameters for E1 |
E2 |
vector of scaling parameters for E2 |
beta |
vector which is multiplied by |
USdam |
vector which is multiplied by |
USsire |
vector which is multiplied by |
Details
The proposal distribution for all parameters is the multivariate normal, the variances of which are the large sample variance covariances of the Maximum Likelihood estimates multiplied by the scaling constants. For all parameters except beta, the covariance matrix for the proposal distribution has all off-diagonal elements set to zero. These parameters must be positive and so the proposal distribution is reflected at zero. A diagonal covariance matrices ensures that the proposal distribution remains symetric. For beta the covariances are not constrained at zero, and so the matrices are multiplied by the scaling constants in a way that preserves the correlational structure. The tuning parameters for the error rates are the scaling constants multiplied by 3e-5.
Value
list containing the arguments passed
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
See Also
Examples
data(WarblerG)
A<-extractA(WarblerG)
ped<-matrix(NA, 100,3)
ped[,1]<-1:100
G<-simgenotypes(A, ped=ped, E1=0.1, E2=0.001, no_dup=2)
GdP<-GdataPed(G=G$Gobs, id=G$id)
model1<-MCMCped(GdP=GdP, nitt=1500, thin=1, burnin=500)
# The proposal distribution is to conservative for E1
# and the update is accepted about 70% of the time
plot(model1$E1)
autocorr(model1$E1)
# Succesive samples from the posterior distribution are
# strongly autocorrelated. Should of course run the chain
# for longer with a larger thinning interval, but a greater
# tuning parameter helps (now 3e-4, rather than 3e-5):
model2<-MCMCped(GdP=GdP, tP=tunePed(E1=10), nitt=1500,
thin=1, burnin=500)
plot(model2$E1)
autocorr(model2$E1)
Transforms Variables for a Multinomial Log-Linear Model
Description
Creates offspring specific design matrices the columns of which refer to the explanatory variables of the liner model.
Usage
varPed(x, gender=NULL, lag=c(0,0), relational=FALSE,
lag_relational=c(0,0), restrict=NULL, keep=FALSE,
USvar=NULL, merge=FALSE, NAvar=NULL)
Arguments
x |
predictor variable; numeric or factor |
gender |
the gender of the parent to which |
lag |
numeric vector of length 2. The time interval over which |
relational |
a character string. If "OFFSPRING", the Euclidean distance between |
lag_relational |
numeric vector of length 2. If |
restrict |
character string designating parents with a zero prior probability of parentage. Only parents for which |
keep |
logical; if |
USvar |
if |
merge |
logical; if |
NAvar |
numeric; replacement for missing values in the predictors. |
Details
The design matrix for each offspring represents the state of each parental (dam/sire) combination for each explanatory variable. The number of rows in the design matrix (the number of parental combinations) is free to vary across offspring, but the number of explanatory variables remain the same. As with standard generalised linear modelling the columns of the design matrices take on numerical values or indicator values for continuous and categorical variables, respectively. When relational=FALSE, elements of the design matrices referring to specific parental combinations will not vary across offspring (unless longitudinal data are being used) and the associated vector of parameters will relate the explanatory variables to overall fecundity. For these variables the model is essentially the multinomial analogue of the more familiar Poisson model often used to analyse such data. However, the counts of the multinomial are not known with certainty because uncertainty exists around the maternity and/or paternity of each offspring.
Additional variables can be fitted that relate specific parental combinations to specific offspring, or specific dams to specific sires. Elements of the design matrices referring to specific parental combinations are then free to vary across offspring. The most obvious variable of this type is the Mendelian transition probability obtained from the genetic data themselves. However, by specifying relational="OFFSPRING", relational="OFFSPRINGV", relational="MATE" or relational="MATEV", non-genetic variables are free to vary across offspring. When x is numeric the Euclidean distances between parents and offspring, or between mates enter into the design matrix, when relational="OFFSPRING" or relational="MATE" respectively. When relational="OFFSPRINGV" or relational="MATEV" are specified a signed vector is calculated rather than a distance. When x is a factor then an indicator variable is set up indicating whether parent and offspring, or mate, factor levels match. Often, each offspring will have a variable number of candidate parents as some parents may be excluded a priori. When x is a factor and both relational="OFFSPRING" and restrict="==", only those potential parents that have factor levels matching the offspring factor level are retained. When relational=FALSE, restrict can take on factor levels which exclude parents that have non-matching factor levels.
If a time variable (timevar) is not passed to PdataPed the data are assumed to be cross-sectional and each individual only represented once. If a time variable (timevar) is passed to PdataPed then lag and lag_relational can be set so that time specific covariates are used. lag designates time units relative to the offspring record when relational=FALSE; for example, if lag=c(0,0) the value of x is taken for that parent during the same time period as the offspring record. If relational="OFFSPRING" or relational="MATE" then lag determines the time units relative to the record of the offspring or mate to which the focal individual is being compared. This record can be specified by using lag_relational, which is always relative to the offspring record. Negative lags refer to previous time intervals (e.g. lag=c(-1,-1) takes x from the previous time step), and if the elements of lag or lag_relational differ then the average value of x during this period is taken (e.g lag=c(-1,0) averages x in the record matching and preceding the offspring record). This is not applicable when x is a factor unless restrict takes one of the logical values (e.g."==") in which case parents are retained when the logical value is TRUE at least once in the specified interval.
Below are models that can be fitted using varPed, where x is a univariate continuous variable:
varPed(x, gender="Female")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}x_{i}...)
varPed(x, gender="Male")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}x_{j}...)
varPed(x)
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(x_{i}+x_{j})...)
varPed(x, gender="Female", relational="OFFSPRING")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(|x_{i}-x_{o}|)...)
varPed(x, gender="Female", relational="OFFSPRINGV")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(x_{i}-x_{o})...)
varPed(x, gender="Female", relational="MATE")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(|x_{i}-x_{j}|)...)
varPed(x, gender="Female", relational="MATEV")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(x_{i}-x_{j})...)
varPed(x, gender="Female", lag=c(-1,-1))
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}x_{i,t-1}...)
varPed(x, gender="Female", lag=c(-1,-1), relational="OFFSPRING")
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(|x_{i,t-1}-x_{o,t}|)...)
varPed(x, gender="Female", lag=c(-2,-2), relational="MATE",
lag_relational=c(-1,-1))
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(|x_{i,t-2}-x_{j,t-1}|)...)
varPed(x, gender="Male", lag=c(-2,-2), relational="OFFSPRING",
lag_relational=c(-1,-1))
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}(|x_{j,t-2}-x_{o,t-1}|)...)
Where p^{(o)}_{i,j} is the probability that dam i and sire j are the parents of an offspring o. x and \beta are the variable of interest and the associated parameter, and t is the time period to which the offspring record belongs.
For a categorical variable with two levels (A and B) the model specified by varPed(x, gender="Female") takes on the form
p^{(o)}_{i,j} \propto \textrm{exp}(\beta_{1}\delta_{i}...)
where \delta_{i} is an indicator variable taking the value 1 if x_{i} is equal to the first level of x and zero otherwise. \beta_{1} is then the log odds ratio of the two levels of x with respect to maternity. If merge=TRUE is specified then \beta_{1} may vary across offspring, and \beta_{m} is estimated. \beta_{m} is related to \beta_{1}:
\beta_{m} = \textrm{logit}\left[\frac{\theta N_{A}}{\theta N_{A} + (1-\theta)N_{B}}\right]
where \theta is the inverse logit transformation of \beta_{1}, and N_{A} and N_{B} are the number of potential mothers that have level A and B for x. If N_{A} and N_B are invariant over offspring the models are functionally equivalent.
The denominator of the multinomial likelihood is the summed linear predictors of all possible parents (after setting up a contrast with the baseline parents). Designating the first set of parents as baseline, the contrast for each set of parents is simply:
\eta^{(o)}_{i,j} = \textrm{log}\left[\frac{p^{(o)}_{i,j}}{p^{(o)}_{1,1}}\right]
and the likelihood of \beta is
Pr(x| \bm{\beta}) = \prod^{n_{o}}_{o}\left[\frac{\textrm{exp}(\eta^{(o)}_{d,s})}{\sum^{n^{(o)}_{i}}_{i=1}\sum^{n^{(o)}_{j}}_{j=1}\textrm{exp}(\eta^{(o)}_{i,j})}\right]
where n_{o}, n^{(o)}_{i} and n^{(o)}_{j} are the number of offspring, the number of potential mothers for offspring o, and the number of potential fathers for offspring o, respectively. d and s are the actual parents of offspring o. The set of possible parents in the denominator of the multinomial likelihood are those that are not excluded using the argument restrict. However, if the argument keep=TRUE is used then the denominator of the likelihood will include excluded parents despite the fact that d\neq i and s\neq j.
In version 2.31-2.42 DSapprox=TRUE can be passed to MCMCped which approximates the likelihood of \beta when a variable specifies the distance between mates (i.e relational="MATE"). This approximation reduces the computational burden by fixing i=d or j=s in the denominator of the multinomial likelihood. The parent defined as the "MATE" is fixed, so that a varPed expression with gender="Male" has the approximated likelihood:
Pr(x | \bm{\beta}) \approx \prod^{n_{o}}_{o}\left[\frac{\textrm{exp}(\eta^{(o)}_{d,s})}{\sum^{n^{(o)}_{j}}_{j=1}\textrm{exp}(\eta^{(o)}_{d,j})}\right]
For certain types of problem this approximation does not work well. In version 2.43 and after, another approximation is used which seems to work better:
Pr(x | \bm{\beta}) \approx \prod^{n_{o}}_{o}\left[\frac{\textrm{exp}(\eta^{(o)}_{d,s})}{\sum^{n^{(o)}_{i}}_{i=1}\textrm{exp}(\eta^{(o)}_{i,s})+\sum^{n^{(o)}_{j}}_{j=1}\textrm{exp}(\eta^{(o)}_{d,j})-\textrm{exp}(\eta^{(o)}_{d,s})}\right]
Value
list containing the design matrix for variable x, the identity of retained parents and the gender of the parents
Note
Versions >=2.1 accept different arguments for restrict than earlier versions. When relational="OFFSPRING", earlier versions accepted restrict=TRUE and restrict=FALSE, but these have now been replaced with restrict="==" and restrict="!=", respectively. In addition, restrict now also accepts ">", ">=", "<" and "<=" with parental values on the LHS and offspring values on the RHS.
Also, versions >=2.1 also accept "OFFSPRINGV" and "MATEV" for relational in addition to "OFFSPRING" and "MATE". "V" specifies that the signed vector should be used rather than the Euclidean distance.
Author(s)
Jarrod Hadfield j.hadfield@ed.ac.uk
References
Hadfield J.D. et al (2006) Molecular Ecology 15 3715-31