Call haplotype specific copy number in single cell datasets

callHaplotypeSpecificCN(
  CNbins,
  haplotypes,
  eps = 1e-12,
  maskedbins = NULL,
  loherror = 0.02,
  maxCN = NULL,
  selftransitionprob = 0.95,
  progressbar = TRUE,
  ncores = 1,
  phasebyarm = FALSE,
  minfrachaplotypes = 0.7,
  likelihood = "auto",
  minbins = 0,
  minbinschr = 0,
  phased_haplotypes = NULL,
  clustering_method = "copy",
  maxloherror = 0.035,
  mincells = 7,
  overwritemincells = NULL,
  cluster_per_chr = TRUE,
  viterbiver = "cpp",
  filterhaplotypes = 0.1,
  firstpassfiltering = TRUE,
  smoothsingletons = TRUE,
  fillmissing = TRUE,
  global_phasing_for_balanced = FALSE,
  chr_cell_list = NULL,
  chrs_for_global_phasing = NULL,
  female = TRUE
)

Arguments

CNbins: single cell copy number dataframe with the following columns: cell_id, chr, start, end, state, copy
haplotypes: single cell haplotypes dataframe with the following columns: cell_id, chr, start, end, hap_label, allele1, allele0, totalcounts
eps: default 1e-12
maskedbins: data.frame with columns chr, start and end. These bins will be masked from the inference and copy number states assigned to these bins based on the states of neighbouring bins.
loherror: LOH error rate for initial assignment, this is inferred directly from the data in the second pass, default = 0.02
maxCN: maximum copy number to infer allele specific states, default=NULL which will use the maximum state from CNbins
selftransitionprob: probability to stay in the same state in the HMM, default = 0.95, set to 0.0 for an IID model
progressbar: Boolean to display progressbar or not, default = TRUE, will only show if ncores == 1
ncores: Number of cores to use, default = 1
phasebyarm: Phasing by chromosome arm, default = FALSE
minfrachaplotypes: Minimum proportion of haplotypes to retain when clustering + phasing, default = 0.7
likelihood: Likelihood model for HMM, default is binomial, other option is betabinomial or use auto and the algorithm will choose the likelihood that best fits the data. Default auto
minbins: Minimum number of bins containing both haplotype counts and copy number data for a cell to be included
minbinschr: Minimum number of bins containing both haplotype counts and copy number data per chromosome for a cell to be included
phased_haplotypes: Use this if you want to manually define the haplotypes phasing if for example the default heuristics used by signals does not return a good fit.
clustering_method: Method to use to cluster cells for haplotype phasing, default is copy (using copy column), other option is breakpoints (using breakpoint for clustering)
maxloherror: Maximum value for LOH error rate
mincells: Minimum cluster size used for phasing, default = 7
overwritemincells: Force the number of cells to use for clustering/phasing rather than use the output of the clustering
cluster_per_chr: Whether to cluster per chromosome to rephase alleles or not
filterhaplotypes: filter out haplotypes present in less than X fraction, default is 0.1
firstpassfiltering: Filter out cells with large discrepancy after first pass state assignment
smoothsingletons: Remove singleton bins by smoothing over based on states in adjacent bins
fillmissing: For bins with missing counts fill in values based on neighbouring bins, this ensures that the returned object is the same size as input CNbins
chr_cell_list: Cells to use for phasing for each chromosome, this should be a named list with a vector of cell_ids for each chromosome eg list("1" = c("cell_id1", "cell_id2)) etc. Default is null. If provided overrides internal phasing.
chrs_for_global_phasing: Which chromosomes to phase using all cells for diploid regions, default is NULL which uses all chromosomes
female: Default is TRUE, if set to FALSE and patient is "XY", X chromosome states are set to A|0 where A=Hmmcopy state
viterbver: Version of viterbi algorithm to use (cpp or R)
global_phasing_for_diploid: When using cluster_per_chr, use all cells for phasing diploid regions within the cluster

Value

Haplotype specific copy number object

Details

The haplotype specific copy number object include the following additional columns

A A allele copy number
B B allele copy number
state_AS_phased A|B
state_min Minor allele copy number
LOH =LOH if bin is LOH, NO otherwise
state_phase Discretized haplotype specific states
phase Whether the A allele or B allele is dominant
alleleA Counts for the A allele
alleleB Counts for the B allele
totalcounts Total number of counts
BAF B-allele frequency (alleleB / totalcounts)

Examples

sim_data <- simulate_data_cohort(
  clone_num = c(20, 20),
  clonal_events = list(
    list("1" = c(2, 0), "5" = c(3, 1)),
    list("2" = c(6, 3), "3" = c(1, 0))
  ),
  loherror = 0.02,
  coverage = 100
)
#> Joining with `by = join_by(chr, start, end, hap_label)`

results <- callHaplotypeSpecificCN(sim_data$CNbins, sim_data$haplotypes)
#> Filtering out haplotypes present < 10% of cells...
#> Fraction of haplotypes retained after filtering = 1
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Phasing based on distribution across all cells
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 0 bins: 40
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#> ℹ The deprecated feature was likely used in the signals package.
#>   Please report the issue to the authors.
#> Removing 0 cells for phasing
#> Fitting beta-binomial model to state: 1|1...
#> VGLM    linear loop  1 :  loglikelihood = -377038.2394
#> VGLM    linear loop  2 :  loglikelihood = -351943.5317
#> VGLM    linear loop  3 :  loglikelihood = -340567.9024
#> VGLM    linear loop  4 :  loglikelihood = -337276.3768
#> VGLM    linear loop  5 :  loglikelihood = -336604.0917
#> VGLM    linear loop  6 :  loglikelihood = -336487.3401
#> VGLM    linear loop  7 :  loglikelihood = -336467.5816
#> VGLM    linear loop  8 :  loglikelihood = -336465.6372
#> Warning: 27927 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  9 :  loglikelihood = -336465.6369
#> Warning: 27104 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  10 :  loglikelihood = -336465.6363
#> Warning: 41027 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  11 :  loglikelihood = -336465.6366
#> Taking a modified step.
#> VGLM    linear loop  11 :  loglikelihood = -336465.6353
#> Warning: 97402 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  12 :  loglikelihood = -336465.6364
#> Taking a modified step..
#> VGLM    linear loop  12 :  loglikelihood = -336465.6353
#> Warning: some quantities such as z, residuals, SEs may be inaccurate due to convergence at a half-step
#> Inferred mean: 0.496, Expected mean: 0.5, Inferred overdispersion (rho): 0
#> Tarones Z-score: -4.112, using binomial model for inference.
#> Using 7 cells for clustering...
#> Clustering chromosome 1
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 10
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 11
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 12
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 13
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 14
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 15
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 16
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 17
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 18
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 19
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 2
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 20
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 21
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 22
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 3
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 4
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 5
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 6
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 7
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 8
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 9
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome X
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 0 bins: 40
#> Average distance from median to expected BAF = 0.0054
#> Average distance from median to expected BAF = 0.0054