Perform PCA + kNN graph + Leiden clustering on copy number data

This function performs dimensionality reduction via PCA, constructs a k-nearest neighbor graph, and applies the Leiden community detection algorithm to identify cell populations. It can handle both standard copy number data and haplotype-specific copy number (HSCN) data.

leiden_clustering(
  CNbins,
  field = "copy",
  n_pcs = 50,
  k = 15,
  resolution = 0.7,
  z_clip = 10,
  seed = NULL,
  hscn = FALSE,
  objective_function = "modularity",
  tree_type = "centroid"
)

Arguments

CNbins: A data frame containing copy number data. Must include columns for 'cell_id' and the specified `field`.
field: Character. The column name in `CNbins` to use for copy number values. Default is "copy".
n_pcs: Integer. The number of principal components to compute. Default is 50.
k: Integer. The number of nearest neighbors for graph construction. Default is 15.
resolution: Numeric. Resolution parameter for Leiden algorithm (higher = more clusters). Default is 0.7.
z_clip: Numeric. Maximum absolute z-score for clipping scaled data. Default is 10.
seed: Integer or NULL. Random seed for reproducibility. Default is NULL.
hscn: Logical. Whether to use haplotype-specific copy number data. Default is FALSE.
objective_function: Character. Leiden objective function: "modularity" or "CPM". Default is "modularity".
tree_type: Character. Type of phylogenetic tree to generate: "centroid" (flat clusters) or "cell" (hierarchical within clusters). Default is "centroid".

Value

A list containing:

clustering: A data frame with cell_id and clone_id (cluster assignments).
leiden_results: The igraph communities object from Leiden clustering.
pca_results: The prcomp object from PCA.
tree: A phylogenetic tree object (cluster-level or cell-level based on tree_type).

Details

Inspired by community detection approaches developed by Sohrab Salehi. TODO: Add reference to Salehi et al. paper on community detection in single-cell genomics.

The function performs the following steps: 1. Creates a copy number matrix from the input data. 2. Applies z-score standardization with clipping to handle outliers. 3. Performs PCA dimensionality reduction. 4. Constructs a symmetric k-nearest neighbor graph in PCA space. 5. Applies Leiden community detection algorithm. 6. Generates a phylogenetic tree (either cluster centroids or full cell hierarchy).

If `hscn` is TRUE, the function expects columns 'copy' and 'BAF' in `CNbins`, and creates separate matrices for A and B alleles.

The function automatically adjusts `k` if there are too few cells. Unlike HDBSCAN (used in umap_clustering), Leiden produces flat cluster assignments, so tree generation uses hierarchical clustering on cluster centroids to create a backbone, then grafts cell subtrees onto it. Both tree types preserve clone blocks in the tree structure: - `tree_type = "centroid"`: Cells within each cluster form a flat star/polytomy (no within-cluster hierarchy) - `tree_type = "cell"`: Cells within each cluster are hierarchically organized via hclust