Classification of HI Galaxy Profiles using Unsupervised Learning and Convolutional Neural Networks

Gabriel Andres Jaimes Illanes
15 abr 2025
3 min de lectura

Actualizado: 17 oct 2025

Introduction

Note: We present a repository for an HI Profile Classification project developed by G. Jaimes (gjaimes@iaa.es) for the AMIGA (Analysis of the Interstellar Medium in Isolated GAlaxies) research group at the Intituto de Astrofísica de Andalucía (IAA).

Supervisors: Manuel Parra, Laura Darriba, Lourdes Verdes-Montenegro (PI)

Our project page is dedicated to the classification of neutral atomic hydrogen (HI) spectral profiles using advanced Machine Learning (ML) techniques. Our research focuses on harnessing the potential of ML to analyze and classify HI profiles, which are crucial for understanding the formation and evolution of galaxies. This project explores the application of Convolutional Neural Networks (CNN) and other ML algorithms to radio astronomy datasets, with the aim of enhancing the quality and efficiency of scientific analysis in the field.

arXiv:2501.11657v1 [astro-ph.GA]:

https://ui.adsabs.harvard.edu/abs/2025arXiv250111657J/abstract

Link to: GitHub Repository

https://github.com/gabojaimesillanes/JAE-Intro-ICU-2024-Gabriel-Jaimes-IAA

Link to: ADASS - Congress

https://pretalx.com/adass2024/talk/review/H7WC8KKTVYKFNRSK8FZE38C8QHGAYA8B

This work proposes a framework for the classification of neutral hydrogen (HI) spectral profiles using Machine Learning (ML) techniques. Unsupervised methodologies and Convolutional Neural Networks (CNN) were applied to analyze 318 profiles from the CIG catalog and 30,780 profiles from the ALFALFA survey. The process included data preprocessing using the Busyfit package, iterative fitting with polynomial, Gaussian, and Lorentzian models, and various classification techniques like KNN, SVM, and Random Forest. Classification was enhanced by adding a 2D dimension to the profiles to study asymmetry. The methodology could be applied to future studies, such as those involving the SKA.

Result Classifications vs Physical Parameters | Input: ALFALFA - Busyfit

A. Classification: Random_Forest_Shapelets, Clustering Method: spectral_clustering.

The classification suggest that Class 0,1,2 follows the Width of Spectrum. Class 2 & Class 1 seems to differentiate from 50 Mpc.

B. Classification: KNN_Regular_Shapelets, Clustering Method: spectral_clustering.

The classification suggest that Class 0,1,2 follows the Width of Spectrum. Class 2 & Class 1 seems to differentiate from 50 Mpc.

C. Classification: KNN_Regular_Shapelets, Clustering Method: AffinityPropagation.

Class 2 seems to reach peaks by Ranges of Vhel & Dist. There is an aglomeration of classification from 50 to 200 Mpc

E. Classification: Random_Forest_Shapelets, Clustering Method: k_means

Class 2 seems to reach peaks by Ranges of Vhel & Dist - There is an aglomeration of classification from 50 to 275 Mpc

F. Classification: SVM_Shapelets, Clustering Method: AffinityPropagation

Class 0,2,3,4,5 seems to be ordered by frequency (counts) in a similar range of W50.

G. Classification: Random_Forest_Shapelets, Clustering Method: DBSCAN

General Application of DBSCAN does not develops a good clustering/classification

Asymetry Analysis | Input: CIG AMIGA - Busyfit

Asymetry analysis:

Differentiate iterations by changes in (time of each classification: 1.13 hrs, 41 iteration in total):

Input configuration.

Selection of 2D Image Model
Quantity of Data (318 Spec, but Class 0, 1 & 3 are nor balanced)
Crop of Image (1quarter)
Reduction of 3 Class into to 2 (0 & 2, balanced and unbalanced)
Classification & Clustering:

Multiple CNN classification are made with iterations usinf the classification model:

file_suffix = "_rot_im_2.jpeg"

...

num_groups = 3

test_size_options = [0.7, 0.5, 0.3]

random_state_options = [42, 99]

epochs_options = [50, 80, 120]

model_configs =

[32, 64, 128], # Simple model

[32, 64, 128, 256], # Medium model

[32, 64, 128, 256, 512] # Complex model

As Output it is obtained:

Best configuration: Model Config: [32, 64, 128], Test Size: 0.3, Random State: 42, Epochs: 80, Success Rate: 0.6333333333333333

Conclusions:

Best Success Rates produced with “im_rot_2.fits”.
CNN Differentiate between colors and shapes according to model classification.
2D Image model increase success rate in a 13% in comparison to regular CNN 1D classification and Shapelets Transformations.
By doing classification with changes in Inputs and Models, theres is a consistent maximum classification rate reaching 63% of Success. It is suggested this is directly related with the size of the sample.
The methodology generates 54 classifications configuration per iteration, making a depuration and transformation for comparison with an already done classification (D.Espada 2010). In exchange requires computational resources to complete them (1.13 Hrs/per iteration, RAM 16,0 GB – 8 Cores).