Congratulations to our Cofounder Gurpal Bisra on Publishing, as First Author, in the Archives of Pathology and Laboratory Medicine Journal!

Excerpts from the publication can be found here: https://pubmed.ncbi.nlm.nih.gov/37074865/.

The paper’s abstract is found below.

Using Pathology Synoptic Reporting Data to Create Individual Dashboards for Pathologists and Surgeons

Abstract

Context: Electronic synoptic pathology reporting using xPert from mTuitive is available to all pathologists in British Columbia, Canada. Comparative feedback reports for pathologists and surgeons were created by using the synoptic reporting software.

Objective: To use data stored in a single central data repository to provide nonpunitive confidential comparative feedback reports (dashboards) to individual pathologists and surgeons for reflection on their practice and to use aggregate data for quality improvement initiatives.

Design: Integration of mTuitive middleware in 5 different laboratory information systems to have 1 software solution (xPert) sending discrete data elements to the central data repository was performed. Microsoft Office products were used to build comparative feedback reports and made the infrastructure sustainable. Two different types of reports were developed: individual confidential feedback reports (dashboards) and aggregated data reports.

Results: Pathologists have access to an individual confidential live feedback report for the 5 major cancer sites. Surgeons get an annual confidential emailed PDF report. Several quality improvement initiatives were identified from the aggregate data.

Conclusions: We present 2 novel dashboards: a live pathologist dashboard and a static surgeon dashboard. Individual confidential dashboards incentivize use of nonmandated electronic synoptic pathology reporting tools and have increased adoption rates. Use of dashboards has also led to discussions about how patient care may be improved.

The rest of the paper can be found here: https://pubmed.ncbi.nlm.nih.gov/37074865/.

Digit Classification using Deep Learning

Context: Digitizing handwritten records remains a challenge for many organizations and makes archiving unnecessary cumbersome. This barrier can be overcome by applying machine learning to more easily extract insights.

Objective: In 1999, the Modified National Institute of Standards and Technology (MNIST) released handwritten images serving as the benchmark for digit classification algorithms. Hence, we want to automatically detect digits 0 through 9.

Data Source: https://www.kaggle.com/competitions/digit-recognizer/data

Each image is 28 x 28 pixels with each pixel having a value of 0-255 inclusive denoting its lightness or darkness.

Training Analysis: Using Keras, we applied a convolutional neural network (CNN) algorithm to assign learnable weights and biases to various objects in the pixelated image. We applied batch normalization to expedite training. Most importantly, we selected to use the dropout regularization method to force the net to learn features in a distributed way, not relying too much on a particular weight, to improve its generalization.

We visualized loss to ensure we identified a good set of weights and biases to train our model across all samples. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater.

The loss is calculated on training and validation; loss value implies how poorly or well a model behaves after each iteration of optimization.

Conclusion: We were able to successfully classify digits and obtained training and testing accuracies of > 99%.

Clustering for Airline Customer Segmentation

Objective: With travel restrictions lifted, it’s becoming increasingly important for airline and transport business to recoup profits after difficult times with the COVID-19 pandemic. Hence, we want to perform customer segmentation analysis on airline customers using k-means clustering.

Data Source:: https://www.kaggle.com/datasets/felixign/airline-customer

Analysis: In general, one can perform clustering using 2 methods. First, k-means clustering can be performed. The advantages are one can select the numbers of clusters, it runs on small and large datasets, and the results are simple to understand. In particular, the optimal number of clusters can be derived using the ‘Elbow Method.’ The alternative is hierarchical clustering. Here, the algorithm itself decides on how many clusters should be made. However, this method is not useful on large datasets as the results are difficult to visualize. Hence, this method won’t be pursued in this analysis.

The dataset includes demographic data, such as one’s work country/province/city and gender, as well as flight timestamps including one’s first light date. To use k-means clustering, all integer-based columns need to be normalized and all categorical variables must be ‘one-hot’ encoded. All work was performed in Python.

Feature Engineering: We believed the length of membership would be an important feature to determine a customer segment. We obtained this value by subtracting the date the latest data was taken from the time they joined the Frequent Flyer Program. We visualized the data as seen below.

Distribution Plot for membership duration in months.

Conclusion: One can see that the optimal number of clusters is about 3 based on the airline customer dataset. As a next step, we could use a decision tree classification algorithm, on the derived clusters, to determine which features characterize each cluster.

Image Classification: Normal vs. COVID Lung X-rays

Objective: To perform image classification to discern normal versus COVID-19 infected lungs based on X-ray images.

Datasource: https://www.kaggle.com/datasets/nabeelsajid917/covid-19-x-ray-10000-images

Analysis: We created a deep learning framework to perform image classification. We relied heavily on the Keras python library to pre-process the data and develop our training model. After loading the images, we were able to see how different COVID-19 infected lungs looked as compared to normal ones (see below).

Model Training: Keras offers two different ways of defining a network. We will use the Sequential API, where you just add on one layer at a time, starting from the input. Notably, the convolutional layers had 16-32 filters using nine weights each to transform a pixel to a weighted average of itself and its eight neighbors. After this is applied to the entire image, max pooling just looks at the nearest four neighbors and selects the maximal value. In addition, we applied batch normalization and dropout as a regularization method.

One epoch is when an entire dataset is passed forward and backward through the neural network only once. While the batch size is the total number of training examples present in a single batch.

 Output: We were able to successfully classify normal and COVID-19 infected lungs using X-ray images.

Sound Classification: Environmental Noises

Objective: To create a process that can classify any audio data; in particular, show it’s usefulness by classifying environmental sounds.

Data Source:: https://www.kaggle.com/datasets/mmoreaux/environmental-sound-classification-50

The dataset consists in 50 WAV files sampled at 16KHz for 50 different classes. To each one of the classes, corresponds 40 audio sample of 5 seconds each. All of these audio files have been concatenated by class in order to have 50 wave files of 3 min. 20sec.

Visualizing Audio Files: Mel-spectrograms and Mel-frequency Cepstral Coefficients (MFCCs): Building machine learning models to classify, describe, or generate audio typically concerns modeling tasks where the input data are audio samples. Audio samples are usually represented as time series, where the y-axis measurement is the amplitude of the waveform (see image below). However, the waveform itself may not necessarily yield clear class identifying information. It turns out one of the best features to extract from audio waveforms (and digital signals in general) has been the Mel Frequency Cepstral Coefficients (MFCCs).

Analysis: Steps for Calculating MFCCs for Audio Samples: (1) Slice the signal into short frames (of time); (2) Compute the periodogram estimate of the power spectrum for each frame; (3) Apply the mel filterbank to the power spectra and sum the energy in each filter; and (4) Take the discrete cosine transform (DCT) of the log filterbank energies

Output: A subset of the classified sounds’ waveforms (left) and corresponding Mel Spectrograms (right) are found below.

Geospatial Analysis for COVID-19

Objective: Perform spatial data analysis for COVID-19 data and visualize its spread in 2020.

Data Source:: https://github.com/CSSEGISandData/COVID-19 https://www.worldometers.info/

Collection Methodology:: https://github.com/imdevskp/covid_19_jhu_data_web_scrap_and_cleaning

Analysis: The dataset includes geographic information system (GIS) data including counts by latitude and longitude per day. Hence, we can count how many confirmed, recovered, death, and active cases exist per day by location including country and/or WHO region.

Data by Country: Provide the total confirmed, active, death, and recovered cases by country as of July 27, 2020. Apply a heat map, and order the results by total confirmed cases, to indicate how the total confirmed cases are highest in the USA, Brazil, and India. The USA has, by far, the highest deaths of any country due to COVID-19.

Geospatial1.JPG

Rather than order by confirmed cases, let's apply a heat map again and order by deaths while calculating the deaths/100 cases. This is a better measure to compare the death rate between countries.

Geospatial2.JPG

Geospatial Analysis: Now, let's plot some choropleth maps. Let's start with confirmed cases by country. First, let’s start with count of confirmed cases by country.

Geospatial3.JPG

The same trend is observed when I plotted deaths by country. While Greenland had few confirmed cases, they appear to have even fewer deaths. Interestingly, the same African countries with few cases have more deaths than Greenland.

Next, I can plot the spread of confirmed COVID-19 cases over time.

Conclusion: One can see that COVID-19 started in China, then spread to Europe, and then to the USA.

Kaggle Competition: Zillow’s Home Value Prediction

Objective: Zillow is trying to predict the log-error between their estimate and the actual sales price using features of a home. My goal was to determine which features, given their correlation coefficients, should be included in a predictive model.

Data: I was provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2016.

Analysis: The provided dataset has ~ 3M rows and 58 columns; but, yikes! ~ 49.2% of the cells are empty. Let's visualize this to see which few columns might even be helpful for making predictions.

MissingValues.png

In particular, 23 columns of data have less than 10% missing values. For these columns, which have the datatype float, I'll replace the missing values with the average value for that column. Furthermore, 53 (or 83%) of 60 rows are float datatypes so I don’t really need to do any categorical encoding for inputting into my model. Now, I can also plot variable correlation coffecients against eachother in the form of a heat map.

HeatMapZillow.JPG

Conclusion: The structuretaxvaluedollarcnt field seems highly correlated with most of the important potentially useful fields. For predictive modeling, I would definitively include that one!