callHaplotypeSpecificCN.Rd
Call haplotype specific copy number in single cell datasets
callHaplotypeSpecificCN(
CNbins,
haplotypes,
eps = 1e-12,
maskedbins = NULL,
loherror = 0.02,
maxCN = NULL,
selftransitionprob = 0.95,
progressbar = TRUE,
ncores = 1,
phasebyarm = FALSE,
minfrachaplotypes = 0.7,
likelihood = "auto",
minbins = 0,
minbinschr = 0,
phased_haplotypes = NULL,
clustering_method = "copy",
maxloherror = 0.035,
mincells = 7,
overwritemincells = NULL,
cluster_per_chr = TRUE,
viterbiver = "cpp",
filterhaplotypes = 0.1,
firstpassfiltering = TRUE,
smoothsingletons = TRUE,
fillmissing = TRUE,
global_phasing_for_balanced = FALSE,
chr_cell_list = NULL,
chrs_for_global_phasing = NULL,
female = TRUE
)
single cell copy number dataframe with the following columns: cell_id
, chr
, start
, end
, state
, copy
single cell haplotypes dataframe with the following columns: cell_id
, chr
, start
, end
, hap_label
, allele1
, allele0
, totalcounts
default 1e-12
data.frame with columns chr, start and end. These bins will be masked from the inference and copy number states assigned to these bins based on the states of neighbouring bins.
LOH error rate for initial assignment, this is inferred directly from the data in the second pass, default = 0.02
maximum copy number to infer allele specific states, default=NULL which will use the maximum state from CNbins
probability to stay in the same state in the HMM, default = 0.95, set to 0.0 for an IID model
Boolean to display progressbar or not, default = TRUE, will only show if ncores == 1
Number of cores to use, default = 1
Phasing by chromosome arm, default = FALSE
Minimum proportion of haplotypes to retain when clustering + phasing, default = 0.7
Likelihood model for HMM, default is binomial
, other option is betabinomial
or use auto
and the algorithm will choose the likelihood that best fits the data. Default auto
Minimum number of bins containing both haplotype counts and copy number data for a cell to be included
Minimum number of bins containing both haplotype counts and copy number data per chromosome for a cell to be included
Use this if you want to manually define the haplotypes phasing if for example the default heuristics used by signals does not return a good fit.
Method to use to cluster cells for haplotype phasing, default is copy
(using copy column), other option is breakpoints
(using breakpoint for clustering)
Maximum value for LOH error rate
Minimum cluster size used for phasing, default = 7
Force the number of cells to use for clustering/phasing rather than use the output of the clustering
Whether to cluster per chromosome to rephase alleles or not
filter out haplotypes present in less than X fraction, default is 0.1
Filter out cells with large discrepancy after first pass state assignment
Remove singleton bins by smoothing over based on states in adjacent bins
For bins with missing counts fill in values based on neighbouring bins, this ensures that the returned object is the same size as input CNbins
Cells to use for phasing for each chromosome, this should be a named list with a vector of cell_ids for each chromosome eg list("1" = c("cell_id1", "cell_id2)) etc. Default is null. If provided overrides internal phasing.
Which chromosomes to phase using all cells for diploid regions, default is NULL which uses all chromosomes
Default is TRUE
, if set to FALSE
and patient is "XY", X chromosome states are set to A|0 where A=Hmmcopy state
Version of viterbi algorithm to use (cpp or R)
When using cluster_per_chr, use all cells for phasing diploid regions within the cluster
Haplotype specific copy number object
The haplotype specific copy number object include the following additional columns
A
A allele copy number
B
B allele copy number
state_AS_phased
A|B
state_min
Minor allele copy number
LOH
=LOH if bin is LOH, NO otherwise
state_phase
Discretized haplotype specific states
phase
Whether the A allele or B allele is dominant
alleleA
Counts for the A allele
alleleB
Counts for the B allele
totalcounts
Total number of counts
BAF
B-allele frequency (alleleB / totalcounts)
sim_data <- simulate_data_cohort(
clone_num = c(20, 20),
clonal_events = list(
list("1" = c(2, 0), "5" = c(3, 1)),
list("2" = c(6, 3), "3" = c(1, 0))
),
loherror = 0.02,
coverage = 100
)
#> Joining with `by = join_by(chr, start, end, hap_label)`
results <- callHaplotypeSpecificCN(sim_data$CNbins, sim_data$haplotypes)
#> Filtering out haplotypes present < 10% of cells...
#> Fraction of haplotypes retained after filtering = 1
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Phasing based on distribution across all cells
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 0 bins: 40
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the signals package.
#> Please report the issue to the authors.
#> Removing 0 cells for phasing
#> Fitting beta-binomial model to state: 1|1...
#> VGLM linear loop 1 : loglikelihood = -377038.2394
#> VGLM linear loop 2 : loglikelihood = -351943.5317
#> VGLM linear loop 3 : loglikelihood = -340567.9024
#> VGLM linear loop 4 : loglikelihood = -337276.3768
#> VGLM linear loop 5 : loglikelihood = -336604.0917
#> VGLM linear loop 6 : loglikelihood = -336487.3401
#> VGLM linear loop 7 : loglikelihood = -336467.5816
#> VGLM linear loop 8 : loglikelihood = -336465.6372
#> Warning: 27927 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 9 : loglikelihood = -336465.6369
#> Warning: 27104 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 10 : loglikelihood = -336465.6363
#> Warning: 41027 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 11 : loglikelihood = -336465.6366
#> Taking a modified step.
#> VGLM linear loop 11 : loglikelihood = -336465.6353
#> Warning: 97402 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 12 : loglikelihood = -336465.6364
#> Taking a modified step..
#> VGLM linear loop 12 : loglikelihood = -336465.6353
#> Warning: some quantities such as z, residuals, SEs may be inaccurate due to convergence at a half-step
#> Inferred mean: 0.496, Expected mean: 0.5, Inferred overdispersion (rho): 0
#> Tarones Z-score: -4.112, using binomial model for inference.
#> Using 7 cells for clustering...
#> Clustering chromosome 1
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 10
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 11
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 12
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 13
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 14
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 15
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 16
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 17
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 18
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 19
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 2
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 20
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 21
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 22
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 3
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 4
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 5
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 6
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 7
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 8
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 9
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome X
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 0 bins: 40
#> Average distance from median to expected BAF = 0.0054
#> Average distance from median to expected BAF = 0.0054