Advanced Clustering Tutorial
This tutorial explains the advanced clustering techniques used in VERUS.
Understanding Clustering in VERUS
VERUS uses a hybrid clustering approach that combines:
OPTICS - For detecting density-based clusters and finding initial centers
KMeans with Haversine distance - For refinement with geographic distance awareness
Customizing OPTICS Parameters
The OPTICS algorithm has several key parameters that can be tuned:
from verus.clustering import GeOPTICS
# Create custom OPTICS instance
optics = GeOPTICS(
min_samples=10, # Minimum samples in a neighborhood
xi=0.05, # Steepness threshold
min_cluster_size=8, # Minimum cluster size
max_eps=1000, # Maximum neighborhood radius (meters)
verbose=True
)
# Run clustering
optics_results = optics.run(data_source=poi_data)
# Access results
clusters = optics_results["clusters"]
centroids = optics_results["centroids"]
# Visualizing OPTICS results
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.scatter(
clusters["longitude"],
clusters["latitude"],
c=clusters["cluster"],
cmap="viridis",
alpha=0.6
)
plt.scatter(
centroids["longitude"],
centroids["latitude"],
c="red",
marker="x",
s=100
)
plt.title("OPTICS Clustering Results")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()
Customizing KMeans Parameters
KMeans can be further customized:
from verus.clustering import KMeansHaversine
# Create custom KMeans instance
kmeans = KMeansHaversine(
n_clusters=10, # Number of clusters
init="k-means++", # Initialization method
random_state=42, # For reproducibility
max_iter=300, # Maximum iterations
verbose=True
)
# Run KMeans
kmeans_results = kmeans.run(data_source=poi_data)
# Access results
clusters = kmeans_results["clusters"]
centroids = kmeans_results["centroids"]
Hybrid Clustering Pipeline
For best results, combine OPTICS and KMeans:
# 1. Run OPTICS to get initial centers
optics = GeOPTICS(min_samples=5, xi=0.05, min_cluster_size=5)
optics_results = optics.run(data_source=poi_data)
# 2. Use these centers to initialize KMeans
centers = optics_results["centroids"]
kmeans = KMeansHaversine(
n_clusters=len(centers),
init="predefined",
predefined_centers=centers,
random_state=42
)
# 3. Run KMeans with OPTICS centers
kmeans_results = kmeans.run(
data_source=poi_data,
centers_input=centers
)
# 4. Access final results
final_clusters = kmeans_results["clusters"]
final_centroids = kmeans_results["centroids"]
Evaluating Clustering Quality
To evaluate clustering quality:
from sklearn import metrics
# Calculate silhouette score (requires scikit-learn)
# First, extract coordinates and convert to radians for Haversine distance
import numpy as np
from haversine import haversine
coords = final_clusters[["latitude", "longitude"]].values
labels = final_clusters["cluster"].values
# Define custom distance matrix
def create_distance_matrix(coords):
n = len(coords)
distance_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
dist = haversine(
(coords[i][0], coords[i][1]),
(coords[j][0], coords[j][1])
)
distance_matrix[i, j] = dist
distance_matrix[j, i] = dist
return distance_matrix
# Calculate distance matrix
distances = create_distance_matrix(coords)
# Calculate silhouette score
silhouette = metrics.silhouette_score(
distances,
labels,
metric="precomputed"
)
print(f"Silhouette score: {silhouette}")
Conclusion
You’ve now learned how to customize and optimize the clustering process in VERUS. This knowledge can help you achieve better vulnerability assessments by creating more meaningful spatial clusters.