Beta-Binomial model for signals • signals

library(signals)

Background

This vignette demonstrates how to use a beta-binomial model instead of a binomial model for the HMM emission model.

Inference

data("haplotypes")
data("CNbins")
haplotypes <- format_haplotypes_dlp(haplotypes, CNbins)
hscn <- callHaplotypeSpecificCN(CNbins, haplotypes, likelihood = "binomial")
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found

print(hscn)
#> Haplotype specific copy number object 
#>  
#> Number of cells: 250
#> Bin size: 0.5 Mb 
#> Inferred LOH error rate: 0.018 
#> Emission model for HMM: binomial 
#> Average distance from median to expected BAF = 0.0025 
#> Average ploidy = 4 
#> Average number of segments = 79

hscn_bb <- callHaplotypeSpecificCN(CNbins, haplotypes, likelihood = "betabinomial")
#> VGLM    linear loop  1 :  loglikelihood = -313153.5548
#> VGLM    linear loop  2 :  loglikelihood = -309219.4605
#> VGLM    linear loop  3 :  loglikelihood = -308675.2147
#> VGLM    linear loop  4 :  loglikelihood = -308650.0018
#> VGLM    linear loop  5 :  loglikelihood = -308649.6334
#> VGLM    linear loop  6 :  loglikelihood = -308649.6301
#> VGLM    linear loop  7 :  loglikelihood = -308649.6301

print(hscn_bb)
#> Haplotype specific copy number object 
#>  
#> Number of cells: 250
#> Bin size: 0.5 Mb 
#> Inferred LOH error rate: 0.018 
#> Emission model for HMM: betabinomial 
#>   Inferred over dispersion: 0.0097
#>   Tarones Z score: 182.22
#> Average distance from median to expected BAF = 0.0026 
#> Average ploidy = 4 
#> Average number of segments = 79

We see that the model inferred a modest degree of overdispersion in the data, but that this was siginificantly different from 0 dispersion (Tarones Z-score is high).

Below I’ll plot the heatmaps for the two output, which in this case are very similar.

Binomial model

plotHeatmap(hscn, plotcol = "state_BAF", plottree = FALSE, spacer_cols = 15)

Beta-Binomial model

plotHeatmap(hscn_bb, plotcol = "state_BAF", plottree = FALSE, spacer_cols = 15)

QC plot

To interrogate the difference between the binomial and beta-binomial models we can plot the BAF and overlay the two models.

plotBBfit(hscn_bb)

We can check how similar the results are by comparing the two dataframes. We find for this data that the results are very similar.

print(dim(hscn$data))
#> [1] 1090750      20
print(dim(hscn_bb$data))
#> [1] 1090750      20
all.equal(orderdf(hscn$data), orderdf(hscn_bb$data))
#>  [1] "Component \"alleleA\": 'is.NA' value mismatch: 14542 in current 14544 in target"    
#>  [2] "Component \"alleleB\": 'is.NA' value mismatch: 14542 in current 14544 in target"    
#>  [3] "Component \"totalcounts\": 'is.NA' value mismatch: 14542 in current 14544 in target"
#>  [4] "Component \"BAF\": 'is.NA' value mismatch: 14542 in current 14544 in target"        
#>  [5] "Component \"state_min\": Mean relative difference: 0.6081683"                       
#>  [6] "Component \"A\": Mean relative difference: 0.6313584"                               
#>  [7] "Component \"B\": Mean relative difference: 0.641918"                                
#>  [8] "Component \"state_AS_phased\": 33293 string mismatches"                             
#>  [9] "Component \"state_AS\": 2828 string mismatches"                                     
#> [10] "Component \"LOH\": 1821 string mismatches"                                          
#> [11] "Component \"phase\": 30305 string mismatches"                                       
#> [12] "Component \"state_phase\": 31273 string mismatches"                                 
#> [13] "Component \"state_BAF\": Mean relative difference: 0.8458283"

Another thing we can check is the total number of segments identified by the HMM. Using the beta-binomial model should reduce the influence of any noisy regions and thus we might expect fewer segments. We do observe fewer segments in the beta-binomial model.

segs <- create_segments(hscn$data, field = "state_AS_phased")

segs_bb <- create_segments(hscn_bb$data, field = "state_AS_phased")

print(dim(segs))
#> [1] 38686     5
print(dim(segs_bb))
#> [1] 35105     5

Summary

With this dataset, the difference between the nbinomial model and beta-binomial model is minimal but produces a statistically superior fit, in lower quality datasets with for example lower depth of coverage the beta-binomial model may may provide more noticeable differences.