Building an Interactive NLP Demo

Using Hugging Face Spaces

Catherine Breslin
3 min readFeb 17, 2022
Photo by AbsolutVision on Unsplash

ML tools have come a long way in the past decade. One of the companies making great strides here is Hugging Face. I wanted to try out their new Spaces to create a public browser-based demo of some NLP tools. Demos really help with explaining NLP concepts — especially if they’re interactive.

My goal is to show how to cluster sentences together using open source libraries. The concept is easy enough to grasp and many people have had to sift through lots of text in their jobs so can see the benefits that this might bring.

There are many ways to tackle this problem of clustering text, but this demo brings together two ideas — sentence embeddings and clustering.

Sentence Embeddings

Computers can’t directly deal with text. We need to convert our sentences into numbers. More specifically, we convert them into a vector, usually known as an embedding — a list of numbers. How to do this conversion is an active area of NLP research and there are various options around.

Two open source models that do this are:

Using these models requires just a few lines of code to convert sentences to embeddings. The underlying models are trained on large amounts of text data from the web, and so should have a good performance without having to do any work to update them.

Once two sentences have been converted to embeddings, we can measure how similar they are. The similarity between two sentences is the cosine distance between their embedding vectors, and is a number between 1 and -1.

To get a sense of what these sentence embedding models do, we can plot a heatmap for the similarity of some example sentences.

The heatmap plots a darker red for when sentences are more similar to each other (hence the dark red line down the diagonal — each sentence is identical to itself!)

As you might expect, the two sentences about the sun are most similar to each other, while the sentence about Tuesday is dissimilar to the others.

Clustering

Representing sentences as embeddings (i.e. vectors) means we can do more interesting things with them. The idea in this demo is to cluster them and discover whether there are similar topics. Clustering is a way of dividing a group of things (sentence embeddings in our case) into smaller groups that are similar in some way.

K-means clustering is a simple algorithm with a ready-to-use implementation in the Scikit-learn library. We just need to give it some options and the number of clusters, and it’ll give us the clustering results.

With two clusters, the results are as we might expect.

Results

Despite being new to some of the libraries involved, it only took me an hour or so to pull the demo together — a far cry from ML tools of the past!

To get the demo working I just need to write two files — requirements.txt and app.py. The first just lists out the libraries that we need. The second includes the code to run, and in my demo I used a library called Streamlit to render on the page.

The final demo is publicly available to try for yourself and see how well these libraries work on your own text.

I work with companies building AI technology. Get in touch to explore how we could work together.

--

--

Catherine Breslin
Catherine Breslin

Written by Catherine Breslin

Machine Learning scientist & consultant :: voice and language tech :: powered by coffee :: www.catherinebreslin.co.uk

No responses yet