This function takes copy number data, performs UMAP dimensionality reduction, and then applies HDBSCAN clustering to identify cell populations. It can handle both standard copy number data and haplotype-specific copy number (HSCN) data.

umap_clustering(
  CNbins,
  n_neighbors = 10,
  min_dist = 0.1,
  minPts = 30,
  seed = NULL,
  field = "copy",
  umapmetric = "correlation",
  hscn = FALSE,
  pca = NULL
)

Arguments

CNbins

A data frame containing copy number data. Must include columns for 'cell_id' and the specified `field`.

n_neighbors

Integer. The number of neighbors to consider in UMAP. Default is 10.

min_dist

Numeric. The minimum distance between points in UMAP. Default is 0.1.

minPts

Integer. The minimum number of points to form a cluster in HDBSCAN. Default is 30.

seed

Integer or NULL. Random seed for reproducibility. Default is NULL.

field

Character. The column name in `CNbins` to use for copy number values. Default is "copy".

umapmetric

Character. The distance metric to use in UMAP. Default is "correlation".

hscn

Logical. Whether to use haplotype-specific copy number data. Default is FALSE.

pca

Integer or NULL. Number of principal components to use in UMAP. If NULL, pca not used, this is the default.

Value

A list containing:

clustering

A data frame with UMAP coordinates and cluster assignments for each cell.

hdbscanresults

The results of the HDBSCAN clustering.

umapresults

The results of the UMAP dimensionality reduction.

tree

A phylogenetic tree object representing the hierarchical structure of the clusters.

Details

The function performs the following steps: 1. Creates a copy number matrix from the input data. 2. Applies UMAP dimensionality reduction. 3. Performs HDBSCAN clustering on the UMAP results. 4. Generates a phylogenetic tree from the clustering results.

If `hscn` is TRUE, the function expects columns 'copy' and 'BAF' in `CNbins`, and creates separate matrices for A and B alleles.

The function automatically adjusts `n_neighbors` if there are too few cells. If UMAP fails, it attempts to rerun with small jitter added to the data points. The function will reduce `minPts` if only one cluster is initially found.