This project implements advanced Bayesian nonparametric clustering methods using state-of-the-art MCMC techniques. It provides a comprehensive framework for clustering analysis with Dirichlet Process (DP), Normalized Generalized Gamma Process (NGGP), and their weighted spatial variants, combining R for data analysis with high-performance C++ implementations.
๐ Documentation: Doxygen API Reference
๐ฌ Research Focus: Bayesian nonparametric models with spatial dependencies
โก Performance: Optimized C++ backend with R interface
โจ Key Features
๐ฏ Advanced Clustering Models
- Dirichlet Process (DP): Classic nonparametric clustering with automatic cluster discovery
- Normalized Generalized Gamma Process (NGGP): Enhanced flexibility and control over cluster structures
- Weighted Variants (DPW, NGGPW): Spatial dependency integration for geographic/network data
- Automatic Model Selection: Data-driven hyperparameter estimation
๐ State-of-the-Art MCMC Samplers
- Neal's Algorithm 3: Efficient collapsed Gibbs sampling for standard applications
- Split-Merge: Advanced joint updates for improved mixing and faster convergence
- SAMS (Sequential Allocation): Optimized proposal generation for large-scale problems
- Hybrid Approaches: Flexible combination of sampling strategies
๐ Spatial Dependencies
- Adjacency Matrix Support: Incorporate spatial/network structure into clustering
- Geographic Clustering: Specialized algorithms for spatial data analysis
๐ Comprehensive Analysis Suite
- Real-time Monitoring: Live convergence diagnostics and progress tracking
- Advanced Visualization: Heatmaps, trace plots, and clustering evaluation
- Performance Metrics: ARI, silhouette scores, and posterior analysis
- Reproducible Workflows: Automated analysis pipelines with devenv integration
๐ Project Architecture
tesi/
โโโ ๐ R/ # R Analysis & Visualization Suite
โ โโโ ๐ฌ sim_data_production.R # Advanced data simulation (Natarajan method)
โ โโโ ๐ simulation_data.R # Main MCMC workflow orchestration
โ โโโ ๐ analysis.R # Post-processing and visualization
โ โโโ ๐ ๏ธ utils.R # Utility functions and diagnostics
โโโ ๐ src/ # High-Performance C++ Core
โ โโโ ๐ฏ Core Framework
โ โ โโโ Sampler.hpp # Abstract MCMC sampler base class
โ โ โโโ Process.hpp # Bayesian nonparametric process interface
โ โ โโโ Data.hpp # Efficient cluster data management
โ โ โโโ Likelihood.hpp # Optimized likelihood computations
โ โ โโโ Params.hpp # Comprehensive parameter management
โ โโโ ๐ MCMC Samplers
โ โ โโโ neal.hpp/.cpp # Neal's Algorithm 3 (Gibbs sampling)
โ โ โโโ splitmerge.hpp/.cpp # Split-Merge sampler
โ โ โโโ splitmerge_SAMS.hpp/.cpp # Sequential Allocation Merge-Split
โ โโโ ๐ Process Implementations
โ โ โโโ DP.hpp/.cpp # Dirichlet Process
โ โ โโโ NGGP.hpp/.cpp # Normalized Generalized Gamma Process
โ โ โโโ DPW.hpp/.cpp # Weighted Dirichlet Process (spatial)
โ โ โโโ NGGPW.hpp/.cpp # Weighted NGGP (spatial)
โ โโโ ๐ launcher.cpp # R-C++ integration interface
โโโ ๐ docs/ # Auto-Generated Documentation
โ โโโ ๐ Doxygen HTML documentation # Comprehensive API reference
โ โโโ ๐จ doxygen-theme/ # Custom documentation styling
โโโ ๐ simulation_data/ # Simulated Datasets Repository
โ โโโ ๐ Natarajan_*sigma_*d/ # Organized by parameters (ฯ, dimensions)
โโโ ๐ results/ # MCMC Analysis Outputs
โ โโโ ๐ {algorithm}_{config}/ # Results organized by method and parameters
โโโ โ๏ธ devenv.nix # Reproducible Development Environment
โโโ ๐ง devenv.yaml # Environment Configuration
โโโ ๐ devenv.lock # Locked Dependencies
โโโ ๐ README.MD # This comprehensive guide
Installation and Setup
Prerequisites
This project uses devenv for reproducible development environments with Nix.
- Install Nix (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf -L https://install.determinate.systems/nix | sh -s -- install
- Install devenv:
nix profile install --accept-flake-config github:cachix/devenv/latest
- Enter the development environment:
cd /path/to/tesi
devenv shell
This will automatically set up:
- R with all required packages (Rcpp, RcppEigen, ggplot2, etc.)
- C++ compiler toolchain
- MCMC analysis libraries (salso, mcclust, etc.)
- VS Code integration (language server, plot viewer)
Manual Installation (Alternative)
If not using devenv, install the following:
R Packages:
install.packages(c(
"Rcpp", "RcppEigen", "ggplot2", "dplyr", "tidyr",
"spam", "fields", "viridisLite", "RColorBrewer",
"pheatmap", "mcclust.ext", "mvtnorm", "gtools",
"salso", "MASS"
))
System Dependencies:
- R (โฅ 4.0)
- C++ compiler with C++23 support
- Eigen3 library
- BLAS/LAPACK libraries
Usage Guide
1. Data Simulation
First, generate simulated clustering data using the Natarajan method:
cd /path/to/tesi
Rscript R/sim_data_production.R
This script:
- Generates mixture data with configurable parameters (ฯ, dimensions)
- Creates distance matrices for spatial analysis
- Saves data to
simulation_data/
directory
- Supports different data generation methods (Gaussian, Gamma, Natarajan)
Key Parameters:
sigma
: Controls cluster separation (0.18, 0.2, 0.25)
d
: Dimensionality (10, 50)
N
: Number of data points (default: 100)
2. MCMC Analysis
Run the main clustering analysis:
Rscript R/simulation_data.R
This script performs:
- Data Loading: Reads simulated data and distance matrices
- Hyperparameter Estimation: Uses k-means and elbow method for initialization
- MCMC Sampling: Runs Bayesian clustering with selected algorithm
- Real-time Monitoring: Displays progress and convergence diagnostics
Algorithm Configuration (in src/launcher.cpp
):
The framework provides multiple combinations of Process ร Sampler ร Spatial configurations:
Neal3 neal_sampler(data, params, likelihood, process);
Dirichlet Process class for Bayesian nonparametric clustering.
Definition DP.hpp:19
Implementation of Neal's Algorithm 3 for collapsed Gibbs sampling.
Definition neal.hpp:41
MCMC Parameters:
BI
: Burn-in iterations (default: 2000)
NI
: Sampling iterations (default: 10000)
a
: Total mass parameter (default: 1)
sigma
: NGGP parameter (default: 0.5)
tau
: NGGP parameter (default: 1.0)
3. Results Analysis
Analyze and visualize results:
This generates:
- Convergence Diagnostics: Trace plots, autocorrelation analysis
- Clustering Results: Posterior similarity matrices, cluster assignments
- Performance Metrics: Adjusted Rand Index (ARI), cluster distribution
- Visualization: Heatmaps, scatter plots, distance distributions
Output Files
Generated Data
simulation_data/Natarajan_{sigma}sigma_{d}d/
all_data.rds
: Raw data points
ground_truth.rds
: True cluster labels
dist_matrix.rds
: Distance matrix
MCMC Results
results/{algorithm}_{initialization}_{parameters}/
simulation_results.rds
: MCMC output (allocations, K, log-likelihood)
simulation_ground_truth.rds
: True labels
simulation_data.rds
: Original data
simulation_distance_matrix.rds
: Distance matrix
plots/
: Generated visualizations
๐ง Core Components & API
๐ Data Generation & Simulation (utils.R
)
generate_mixture_data(N, sigma, d)
distance_plot(data, labels)
โ๏ธ Hyperparameter Optimization (utils.R
)
set_hyperparameters(data, dist_matrix)
plot_k_means(data, max_k)
๐ High-Performance MCMC Interface (launcher.cpp
)
Rcpp::List
mcmc(matrix,
params, initial_clusters);
Manages distance matrices and cluster allocations for points.
Definition Data.hpp:26
Computes log-likelihood for clusters based on distance-based cohesion and repulsion.
Definition Likelihood.hpp:19
Abstract base class for Bayesian nonparametric processes.
Definition Process.hpp:41
Abstract base class for MCMC sampler implementations.
Definition Sampler.hpp:50
Params & params
Reference to the parameters object containing model hyperparameters and MCMC settings.
Definition Sampler.hpp:60
Rcpp::List mcmc(const Eigen::MatrixXd &distances, Params ¶m, const Rcpp::IntegerVector &initial_allocations_r=Rcpp::IntegerVector())
Main MCMC function for Bayesian non-parametric clustering.
Definition launcher.cpp:97
Structure containing all parameters needed for the NGGP (Normalized Generalized Gamma Process) and DP...
Definition Params.hpp:35
๐ Class Hierarchy (See Doxygen Docs)
Sampler (abstract base)
โโโ Neal3 # Algorithm 3 implementation
โโโ SplitMerge # Standard split-merge sampler
โโโ SplitMerge_SAMS # Sequential allocation variant
Process (abstract base)
โโโ DP # Dirichlet Process
โโโ DPW # Weighted Dirichlet Process
โโโ NGGP # Normalized Generalized Gamma Process
โโโ NGGPW # Weighted NGGP
Example Workflow
- Generate Data:
sigma <- 0.25
d <- 10
data_generation <- generate_mixture_data(N = 100, sigma = sigma, d = d)
- Run MCMC:
sourceCpp("src/launcher.cpp")
hyperparams <- set_hyperparameters(all_data, dist_matrix, k_elbow = 3)
param <- new(Params, hyperparams$delta1, hyperparams$alpha, ...)
mcmc_result <-
mcmc(dist_matrix, param, hyperparams$initial_clusters)
- Analyze Results:
plot_mcmc_results(mcmc_result, ground_truth, BI = 2000)
โ๏ธ Advanced Configuration
๐ง Compilation & Build System
sourceCpp("src/launcher.cpp")
๐ Spatial Dependency Configuration
delta1, alpha, beta,
delta2, gamma, zeta,
BI, NI,
a, sigma, tau,
coefficient,
W
);
๐ Hyperparameter Tuning Guide
a <- 1.0
sigma <- 0.5
tau <- 1.0
BI <- 2000
NI <- 10000
coefficient <- 0.5
k_neighbors <- 5
๐ ๏ธ Troubleshooting & FAQ
๐ง Installation Issues
# Missing dependencies
sudo apt-get install r-base-dev libeigen3-dev liblapack-dev libblas-dev
# Rcpp compilation errors
R -e "install.packages('Rcpp', type='source')"
# devenv not working
nix --version # Ensure Nix is installed
devenv --version # Verify devenv installation
๐ Getting Help
- ๐ Documentation: Doxygen API Reference
- ๐ Code Examples: See
R/simulation_data.R
for complete workflows
- ๐งช Test Cases: Run
R/sim_data_production.R
for validation
- ๐ Theory: References in code comments and Doxygen documentation
๐ References & Citations
- Neal, R. M. (2000): "Markov Chain Sampling Methods for Dirichlet Process Mixture Models"
- Jain, S. & Neal, R. M. (2004): "A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model"
- Dahl, D. B. and Newcomb, S. (2022): "Sequentially allocated merge-split samplers for conjugate Bayesian nonparametric models"
- Favaro, S. & Teh, Y. W. (2013): "MCMC for Normalized Random Measure Mixture Models"
- Martinez, A. F. and Mena, R. H. (2014): "On a Nonparametric Change Point Detection Model in Markovian Regimes"
- Natarajan, A. and De Iorio, M. (2023): "Cohesion and Repulsion in Bayesian Distance Clustering"