Call haplotype specific copy number in single cell datasets

callHaplotypeSpecificCN(
  CNbins,
  haplotypes,
  eps = 0.000000000001,
  maskedbins = NULL,
  loherror = 0.02,
  maxCN = NULL,
  selftransitionprob = 0.95,
  progressbar = TRUE,
  ncores = 1,
  phasebyarm = FALSE,
  minfrachaplotypes = 0.7,
  likelihood = "auto",
  minbins = 100,
  minbinschr = 10,
  phased_haplotypes = NULL,
  clustering_method = "copy",
  maxloherror = 0.035,
  mincells = 7,
  overwritemincells = NULL,
  cluster_per_chr = TRUE,
  viterbiver = "cpp",
  filterhaplotypes = 0.1,
  firstpassfiltering = TRUE,
  smoothsingletons = TRUE,
  fillmissing = TRUE,
  global_phasing_for_balanced = FALSE,
  chrs_for_global_phasing = NULL
)

Arguments

CNbins

single cell copy number dataframe with the following columns: cell_id, chr, start, end, state, copy

haplotypes

single cell haplotypes dataframe with the following columns: cell_id, chr, start, end, hap_label, allele1, allele0, totalcounts

eps

default 1e-12

maskedbins

data.frame with columns chr, start and end. These bins will be masked from the inference and copy number states assigned to these bins based on the states of neighbouring bins.

loherror

LOH error rate for initial assignment, this is inferred directly from the data in the second pass, default = 0.02

maxCN

maximum copy number to infer allele specific states, default=NULL which will use the maximum state from CNbins

selftransitionprob

probability to stay in the same state in the HMM, default = 0.95, set to 0.0 for an IID model

progressbar

Boolean to display progressbar or not, default = TRUE, will only show if ncores == 1

ncores

Number of cores to use, default = 1

phasebyarm

Phasing by chromosome arm, default = FALSE

minfrachaplotypes

Minimum proportion of haplotypes to retain when clustering + phasing, default = 0.7

likelihood

Likelihood model for HMM, default is binomial, other option is betabinomial or use auto and the algorithm will choose the likelihood that best fits the data. Default auto

minbins

Minimum number of bins containing both haplotype counts and copy number data for a cell to be included

minbinschr

Minimum number of bins containing both haplotype counts and copy number data per chromosome for a cell to be included

phased_haplotypes

Use this if you want to manually define the haplotypes phasing if for example the default heuristics used by signals does not return a good fit.

clustering_method

Method to use to cluster cells for haplotype phasing, default is copy (using copy column), other option is breakpoints (using breakpoint for clustering)

maxloherror

Maximum value for LOH error rate

mincells

Minimum cluster size used for phasing, default = 7

overwritemincells

Force the number of cells to use for clustering/phasing rather than use the output of the clustering

cluster_per_chr

Whether to cluster per chromosome to rephase alleles or not

filterhaplotypes

filter out haplotypes present in less than X fraction, default is 0.1

firstpassfiltering

Filter out cells with large discrepancy after first pass state assignment

smoothsingletons

Remove singleton bins by smoothing over based on states in adjacent bins

fillmissing

For bins with missing counts fill in values based on neighbouring bins

chrs_for_global_phasing

Which chromosomes to phase using all cells for diploid regions, default is NULL which uses all chromosomes

viterbver

Version of viterbi algorithm to use (cpp or R)

global_phasing_for_diploid

When using cluster_per_chr, use all cells for phasing diploid regions within the cluster

Value

Haplotype specific copy number object

Details

The haplotype specific copy number object include the following additional columns

  • A A allele copy number

  • B B allele copy number

  • state_AS_phased A|B

  • state_min Minor allele copy number

  • LOH =LOH if bin is LOH, NO otherwise

  • state_phase Discretized haplotype specific states

  • phase Whether the A allele or B allele is dominant

  • alleleA Counts for the A allele

  • alleleB Counts for the B allele

  • totalcounts Total number of counts

  • BAF B-allele frequency (alleleB / totalcounts)

Examples

sim_data <- simulate_data_cohort(
  clone_num = c(20, 20),
  clonal_events = list(
    list("1" = c(2, 0), "5" = c(3, 1)),
    list("2" = c(6, 3), "3" = c(1, 0))
  ),
  loherror = 0.02,
  coverage = 100
)
#> Joining with `by = join_by(chr, start, end, hap_label)`

results <- callHaplotypeSpecificCN(sim_data$CNbins, sim_data$haplotypes)
#> Filtering out haplotypes present < 10% of cells...
#> Fraction of haplotypes retained after filtering = 1
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Phasing based on distribution across all cells
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 100 bins: 40
#> Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
#>  The deprecated feature was likely used in the signals package.
#>   Please report the issue to the authors.
#> Removing 0 cells for phasing
#> Fitting beta-binomial model to state: 1|1...
#> VGLM    linear loop  1 :  loglikelihood = -377138.1672
#> VGLM    linear loop  2 :  loglikelihood = -352140.8312
#> VGLM    linear loop  3 :  loglikelihood = -340853.0613
#> VGLM    linear loop  4 :  loglikelihood = -337608.333
#> VGLM    linear loop  5 :  loglikelihood = -336954.2848
#> VGLM    linear loop  6 :  loglikelihood = -336843.9099
#> VGLM    linear loop  7 :  loglikelihood = -336825.736
#> VGLM    linear loop  8 :  loglikelihood = -336823.0039
#> Warning: 58702 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  9 :  loglikelihood = -336823.0035
#> Warning: 80612 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  10 :  loglikelihood = -336823.0027
#> Warning: 72330 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  11 :  loglikelihood = -336823.0021
#> Warning: 49108 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  12 :  loglikelihood = -336823.0019
#> Warning: 73309 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  13 :  loglikelihood = -336823.0012
#> Warning: 63757 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  14 :  loglikelihood = -336823.0008
#> Warning: 63786 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  15 :  loglikelihood = -336823.0005
#> Warning: 57874 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  16 :  loglikelihood = -336823.0002
#> Warning: 57642 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  17 :  loglikelihood = -336823
#> Warning: 44063 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  18 :  loglikelihood = -336822.9998
#> Warning: 91761 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  19 :  loglikelihood = -336822.9959
#> Warning: 28117 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  20 :  loglikelihood = -336822.9958
#> Warning: 66192 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  21 :  loglikelihood = -336822.9951
#> Warning: 67255 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  22 :  loglikelihood = -336822.9947
#> Warning: 10721 diagonal elements of the working weights variable 'wz' have been replaced by 1.819e-12
#> VGLM    linear loop  23 :  loglikelihood = -336822.9947
#> Inferred mean: 0.496, Expected mean: 0.5, Inferred overdispersion (rho): 0
#> Tarones Z-score: -2.587, using binomial model for inference.
#> Using 7 cells for clustering...
#> Clustering chromosome 1
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 10
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 11
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 12
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 13
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 14
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 15
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 16
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 17
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 18
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 19
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 2
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 20
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 21
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 22
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 3
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 4
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 5
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> An error occurred in umap calculation: Initial data contains NA values: is n_components too high?
#> Rerunning UMAP after adding small jitter to data points...
#> Clustering cells using hdbscan...
#> Identified 2 clusters
#> Distribution of clusters:
#>   Cluster A: 20
#>   Cluster B: 20
#> Clustering chromosome 6
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 7
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 8
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome 9
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome X
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Clustering chromosome Y
#> Creating CN matrix...
#> Calculating UMAP dimensionality reduction...
#> Clustering cells using hdbscan...
#> Identified 1 clusters
#> Distribution of clusters:
#>   Cluster 0: 40
#> Finding overlapping cell IDs between CN data and haplotype data...
#> Total number of cells in both CN and haplotypes: 40
#> Number of cells in CN data: 40
#> Number of cells in haplotype data: 40
#> Joining bins and haplotypes...
#> Phase haplotypes...
#> Join phased haplotypes...
#> Reorder haplotypes based on phase...
#> Total number of cells after removing cells with < 100 bins: 40
#> Average distance from median to expected BAF = 0.0054
#> Average distance from median to expected BAF = 0.0054