Unit 4 | Notion

Syllabus →

Screenshot 2024-03-23 at 10.24.31 AM.png

Screenshot 2024-05-25 at 2.51.04 AM.png

Endsem pyqs →

Screenshot 2024-04-29 at 12.03.11 PM.png

Screenshot 2024-04-29 at 12.04.48 PM.png

Screenshot 2024-04-29 at 12.05.17 PM.png

Personal Notes →

Cluster Analysis →

Cluster analysis in Data Mining and Analysis (DMA) is a technique used to group a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. It's a form of unsupervised learning, which means it does not rely on predefined classes and labels. Here are some key points about cluster analysis in the context of DMA:

Objective: The main goal is to discover the inherent groupings in the data, such as grouping customers by purchasing behavior or grouping documents by similar topics.
Methods: There are several clustering algorithms, each with its own method and application suitability. Common ones include:
- K-means Clustering: Assigns each point to the nearest cluster center and optimizes the positions of the cluster centers.
- Hierarchical Clustering: Builds a tree of clusters and can be visualized as a dendrogram.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions.
- Mean Shift Clustering: Aims to discover blobs in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.
Distance Measures: The choice of distance metrics can significantly affect the outcome of the clustering. Commonly used metrics include Euclidean distance, Manhattan distance, and cosine similarity.
Applications: Cluster analysis is widely used in various fields such as marketing (to segment customers), biology (to group genes with similar expression patterns), and document clustering for information organization.
Evaluation: Evaluating the quality of the clustering results can be challenging, especially since the ground truth labels are not known. Techniques like the Silhouette Score, Dunn index, or comparing internal vs. external measures can be used to assess clustering performance.
Challenges: Some challenges include choosing the right number of clusters, handling different data types, scaling with large datasets, and the sensitivity to the initial settings in some algorithms.