Perform UMAP dimensionality reduction and HDBSCAN clustering on copy number data

This function takes copy number data, performs UMAP dimensionality reduction, and then applies HDBSCAN clustering to identify cell populations. It can handle both standard copy number data and haplotype-specific copy number (HSCN) data.

umap_clustering(
  CNbins,
  n_neighbors = 10,
  min_dist = 0.1,
  minPts = 30,
  seed = NULL,
  field = "copy",
  umapmetric = "correlation",
  hscn = FALSE,
  pca = NULL
)

Arguments

CNbins: A data frame containing copy number data. Must include columns for 'cell_id' and the specified `field`.
n_neighbors: Integer. The number of neighbors to consider in UMAP. Default is 10.
min_dist: Numeric. The minimum distance between points in UMAP. Default is 0.1.
minPts: Integer. The minimum number of points to form a cluster in HDBSCAN. Default is 30.
seed: Integer or NULL. Random seed for reproducibility. Default is NULL.
field: Character. The column name in `CNbins` to use for copy number values. Default is "copy".
umapmetric: Character. The distance metric to use in UMAP. Default is "correlation".
hscn: Logical. Whether to use haplotype-specific copy number data. Default is FALSE.
pca: Integer or NULL. Number of principal components to use in UMAP. If NULL, pca not used, this is the default.

Value

A list containing:

clustering: A data frame with UMAP coordinates and cluster assignments for each cell.
hdbscanresults: The results of the HDBSCAN clustering.
umapresults: The results of the UMAP dimensionality reduction.
tree: A phylogenetic tree object representing the hierarchical structure of the clusters.

Details

The function performs the following steps: 1. Creates a copy number matrix from the input data. 2. Applies UMAP dimensionality reduction. 3. Performs HDBSCAN clustering on the UMAP results. 4. Generates a phylogenetic tree from the clustering results.

If `hscn` is TRUE, the function expects columns 'copy' and 'BAF' in `CNbins`, and creates separate matrices for A and B alleles.

The function automatically adjusts `n_neighbors` if there are too few cells. If UMAP fails, it attempts to rerun with small jitter added to the data points. The function will reduce `minPts` if only one cluster is initially found.