Objective: With travel restrictions lifted, it’s becoming increasingly important for airline and transport business to recoup profits after difficult times with the COVID-19 pandemic. Hence, we want to perform customer segmentation analysis on airline customers using k-means clustering.

Data Source:: https://www.kaggle.com/datasets/felixign/airline-customer

Analysis: In general, one can perform clustering using 2 methods. First, k-means clustering can be performed. The advantages are one can select the numbers of clusters, it runs on small and large datasets, and the results are simple to understand. In particular, the optimal number of clusters can be derived using the ‘Elbow Method.’ The alternative is hierarchical clustering. Here, the algorithm itself decides on how many clusters should be made. However, this method is not useful on large datasets as the results are difficult to visualize. Hence, this method won’t be pursued in this analysis.

The dataset includes demographic data, such as one’s work country/province/city and gender, as well as flight timestamps including one’s first light date. To use k-means clustering, all integer-based columns need to be normalized and all categorical variables must be ‘one-hot’ encoded. All work was performed in Python.

Feature Engineering: We believed the length of membership would be an important feature to determine a customer segment. We obtained this value by subtracting the date the latest data was taken from the time they joined the Frequent Flyer Program. We visualized the data as seen below.

Distribution Plot for membership duration in months.

Conclusion: One can see that the optimal number of clusters is about 3 based on the airline customer dataset. As a next step, we could use a decision tree classification algorithm, on the derived clusters, to determine which features characterize each cluster.