callHaplotypeSpecificCN.Rd
Call haplotype specific copy number in single cell datasets
callHaplotypeSpecificCN(
CNbins,
haplotypes,
eps = 0.000000000001,
maskedbins = NULL,
loherror = 0.02,
maxCN = NULL,
selftransitionprob = 0.95,
progressbar = TRUE,
ncores = 1,
phasebyarm = FALSE,
minfrachaplotypes = 0.7,
likelihood = "auto",
minbins = 100,
minbinschr = 10,
phased_haplotypes = NULL,
clustering_method = "copy",
maxloherror = 0.035,
mincells = 7,
overwritemincells = NULL,
cluster_per_chr = TRUE,
viterbiver = "cpp",
filterhaplotypes = 0.1,
firstpassfiltering = TRUE,
smoothsingletons = TRUE,
fillmissing = TRUE,
global_phasing_for_balanced = FALSE,
chrs_for_global_phasing = NULL
)
single cell copy number dataframe with the following columns: cell_id
, chr
, start
, end
, state
, copy
single cell haplotypes dataframe with the following columns: cell_id
, chr
, start
, end
, hap_label
, allele1
, allele0
, totalcounts
default 1e-12
data.frame with columns chr, start and end. These bins will be masked from the inference and copy number states assigned to these bins based on the states of neighbouring bins.
LOH error rate for initial assignment, this is inferred directly from the data in the second pass, default = 0.02
maximum copy number to infer allele specific states, default=NULL which will use the maximum state from CNbins
probability to stay in the same state in the HMM, default = 0.95, set to 0.0 for an IID model
Boolean to display progressbar or not, default = TRUE, will only show if ncores == 1
Number of cores to use, default = 1
Phasing by chromosome arm, default = FALSE
Minimum proportion of haplotypes to retain when clustering + phasing, default = 0.7
Likelihood model for HMM, default is binomial
, other option is betabinomial
or use auto
and the algorithm will choose the likelihood that best fits the data. Default auto
Minimum number of bins containing both haplotype counts and copy number data for a cell to be included
Minimum number of bins containing both haplotype counts and copy number data per chromosome for a cell to be included
Use this if you want to manually define the haplotypes phasing if for example the default heuristics used by signals does not return a good fit.
Method to use to cluster cells for haplotype phasing, default is copy
(using copy column), other option is breakpoints
(using breakpoint for clustering)
Maximum value for LOH error rate
Minimum cluster size used for phasing, default = 7
Force the number of cells to use for clustering/phasing rather than use the output of the clustering
Whether to cluster per chromosome to rephase alleles or not
filter out haplotypes present in less than X fraction, default is 0.1
Filter out cells with large discrepancy after first pass state assignment
Remove singleton bins by smoothing over based on states in adjacent bins
For bins with missing counts fill in values based on neighbouring bins
Which chromosomes to phase using all cells for diploid regions, default is NULL which uses all chromosomes
Version of viterbi algorithm to use (cpp or R)
When using cluster_per_chr, use all cells for phasing diploid regions within the cluster
Haplotype specific copy number object
The haplotype specific copy number object include the following additional columns
A
A allele copy number
B
B allele copy number
state_AS_phased
A|B
state_min
Minor allele copy number
LOH
=LOH if bin is LOH, NO otherwise
state_phase
Discretized haplotype specific states
phase
Whether the A allele or B allele is dominant
alleleA
Counts for the A allele
alleleB
Counts for the B allele
totalcounts
Total number of counts
BAF
B-allele frequency (alleleB / totalcounts)
sim_data <- simulate_data_cohort(
clone_num = c(20, 20),
clonal_events = list(
list("1" = c(2, 0), "5" = c(3, 1)),
list("2" = c(6, 3), "3" = c(1, 0))
),
loherror = 0.02,
coverage = 100
)
#> Joining with `by = join_by(chr, start, end, hap_label)`
results <- callHaplotypeSpecificCN(sim_data$CNbins, sim_data$haplotypes)
#> Filtering out haplotypes present < 10% of cells...
#> Fraction of haplotypes retained after filtering = 1
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Phasing based on distribution across all cells
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 100 bins: 40
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the signals package.
#> Please report the issue to the authors.
#> Removing 0 cells for phasing
#> Fitting beta-binomial model to state: 1|1...
#> VGLM linear loop 1 : loglikelihood = -377138.1672
#> VGLM linear loop 2 : loglikelihood = -352140.8312
#> VGLM linear loop 3 : loglikelihood = -340853.0613
#> VGLM linear loop 4 : loglikelihood = -337608.333
#> VGLM linear loop 5 : loglikelihood = -336954.2848
#> VGLM linear loop 6 : loglikelihood = -336843.9099
#> VGLM linear loop 7 : loglikelihood = -336825.736
#> VGLM linear loop 8 : loglikelihood = -336823.0039
#> Warning: 58702 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 9 : loglikelihood = -336823.0035
#> Warning: 80612 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 10 : loglikelihood = -336823.0027
#> Warning: 72330 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 11 : loglikelihood = -336823.0021
#> Warning: 49108 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 12 : loglikelihood = -336823.0019
#> Warning: 73309 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 13 : loglikelihood = -336823.0012
#> Warning: 63757 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 14 : loglikelihood = -336823.0008
#> Warning: 63786 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 15 : loglikelihood = -336823.0005
#> Warning: 57874 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 16 : loglikelihood = -336823.0002
#> Warning: 57642 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 17 : loglikelihood = -336823
#> Warning: 44063 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 18 : loglikelihood = -336822.9998
#> Warning: 91761 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 19 : loglikelihood = -336822.9959
#> Warning: 28117 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 20 : loglikelihood = -336822.9958
#> Warning: 66192 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 21 : loglikelihood = -336822.9951
#> Warning: 67255 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 22 : loglikelihood = -336822.9947
#> Warning: 10721 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM linear loop 23 : loglikelihood = -336822.9947
#> Inferred mean: 0.496, Expected mean: 0.5, Inferred overdispersion (rho): 0
#> Tarones Z-score: -2.587, using binomial model for inference.
#> Using 7 cells for clustering...
#> Clustering chromosome 1
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 10
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 11
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 12
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 13
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 14
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 15
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 16
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 17
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 18
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 19
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 2
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 20
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 21
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 22
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 3
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 4
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 5
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#> Cluster A: 20
#> Cluster B: 20
#> Clustering chromosome 6
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 7
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 8
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome 9
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome X
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Clustering chromosome Y
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#> Cluster 0: 40
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 100 bins: 40
#> Average distance from median to expected BAF = 0.0054
#> Average distance from median to expected BAF = 0.0054