Natural Language Processing for Post-Livestream Analysis

Improving the Livestream Shopping Experience

The motivation behind this blog and project was to see if there was any way to gain additional insights into live streaming data besides viewer count and engagement. Currently, livestreaming platforms offer great tools for identifying subscriptions, purchase history during a stream, and engagement during a stream. However, there is a missed opportunity for more tools for livestreamers to really understand their audience by analyzing the chat log itself.

Empowering influencers with real-time and post-stream analytics from participating customers will let them build more meaningful connections with their community, especially during a time of social isolation. My coworker, Allison Youngdahl, has written a great blog about a proof of concept that assists live streamers by gathering audience insights in real-time using natural language processing (NLP). As a continuation of that work, this blog addresses how to conduct a more in-depth audience analysis after the stream is over.

In other words, would it be possible to determine what people are talking about in a three-hour live stream without sifting through all the chat messages manually? If a streamer was talking about going vegan during their stream, could we build a tool that could show veganism, meat, and food as identified topics in the post-stream analysis? If I wanted to know what specific messages related most to those topics or search chat based on a keyword, it would be useful to have a search tool to identify semantically similar messages. Through such a tool, I could possibly discover that people were heaving a heated debate about Impossible burgers in the chat or gain insight into how many people in the chat identify as non-meat eaters. From a livestream shopping perspective, this could provide unique consumer data and could be useful for targeted marketing efforts, among other use cases.

Natural Language Processing for Chat Analysis

This blog addresses how natural language processing can be utilized to analyze livestreams after the fact and yield valuable insights. Specifically, I will discuss how to obtain chat logs from Twitch and put them in a Pandas data frame, preprocess the chat data using Texthero and other NLP tools, perform topic modeling using Latent Dirichlet allocation (LDA), and search for semantically similar chat messages based on a keyword or sentence using the Google Universal Sentence Encoder (USE). The code for this blog is available on my GitHub.

Disclaimer: This post isn’t meant to be an in-depth explanation of machine learning or natural language processing best practices; rather, it is a guide on topic modeling and semantic search with Python and a technological exploration into whether this technique can yield effective results in the livestream space.

In order to analyze a large corpus of text, I needed four things: (1) a method to obtain the Twitch data and make it useable; (2) a method to derive topics from a large corpus of chat messages; (3) a method to semantically search through the chat messages quickly; and (4) a programming language that could accomplish all these tasks.

Thankfully, Brendan Martin has a great method for obtaining Twitch chat data that we can employ. If we clean up this data after obtaining it from Twitch (e.g., removing emojis, removing unnecessary punctuation, etc.), we can employ natural language processing to analyze the data. Specifically, we can use topic modeling — an unsupervised machine learning technique to scan a set of documents, detect word and phrase patterns, and automatically cluster words for a set of documents (or in this case, Twitch chat). Because this technique is unsupervised, we don’t need to train the model! Assuming we can find a method to perform topic modeling on the chat data, we still need a method to semantically search a document based on a provided keyword or topic.

Fortunately, text analysis is a fairly well-documented process and Google has a pre-trained Universal Sentence Encoder that can be utilized for textual similarity. We can use this to create a semantic search tool that can scan our corpus for semantically relevant information based on a keyword or topic. Even more fortunate is the fact that all three of these tasks can be accomplished in a Jupyter Notebook using Python.

Step 1 of 3: Obtaining Livestream Chat Data from Twitch

After creating your own “.ipynb” Jupyter Notebook file and opening it (guide here), the first step is to obtain an Authorization token for your Twitch account. This requires you to have a Twitch account and to be logged in. For more clarification on this step, you can refer back to my GitHub code or Brendan Martin’s article — which is where this method is from. To obtain an authorization token, go to the following site while logged in to your Twitch account. Once logged in, you will be provided an authorization code that looks similar to the following: oauth:43rip6j6fgio8n5xly1oum1lph8ikl1. This will be one piece of information, along with your Twitch username and the channel you’re interested in, that you will provide.

The next step is to establish a connection to the Twitch IRC, which is done using Python’s socket library. We send the socket our token, channel, and nickname information as well as connect to the specified port.

Then, we can set up a logger that writes messages to a file and a loop that’ll check for new messages as long as the socket’s open. A screenshot of the code is provided below. It’s important to note that the code below will run continuously and properly assuming all the information is input correctly and the streamer is currently streaming. In the code that configures logger, you can change chat2.log to be whatever name you want. It will create a file of that name with the chat information and continuously update it as long as the code is running. Additionally, we use the demojize function from the emoji library to convert emojis used in the Twitch chat to text when logging information. Finally, Twitch IRC requires us to send “PONG” if the server sends “PING”, which is why that is included in the conditional statement.

It’s important to note that once you run the code block before sock.close() in the screenshot above, it will run continuously until you manually stop the code or close the socket. While the code is running, you can check if it’s logging properly by going to the terminal, navigating to the repository your jupyter notebook is in, and running tail -f chat2.logassuming that your file is called chat2.log. Once your code is terminated, your log file will look similar to this:

If you don’t want to obtain chat data in real-time but want to utilize data from previous chats, there are resources online that you can use. I haven’t tested them, but the methods I outline in this blog can still be utilized if you can convert that data into a Pandas data frame.

Assuming you have stopped your code and have a log file generated with a good amount of Twitch chat data, it’s time to parse and process the chat logs. The following code, which is modified slightly from Brendan Martin’s, allows you to create a Pandas data frame that stores the channel, message, and username information for the provided stream. It uses the regular expression library to parse out punctuation.

It’s important to note that in the screenshot above, I ran df.drop(1248,inplace=True). This is because the code will not work if the data reads a chat as having no text, which was the case for the 1248th Twitch message (the user submitted a message with no text). Please make sure to remove any blank messages from the data frame in this step! Once you finish removing the blank messages, when you run df, you can see your data in a neat format:

Now, you have your chat data from Twitch in a Pandas data frame, which we can then use for NLP (topic modeling).

Step 2 of 3: Topic Modeling on Twitch Chat Data using Latent Dirichlet Allocation (LDA)

Previously, I mentioned that topic modeling would enable us to take a large amount of text data and generate a set of words or topics that describe that data without explicitly training a model to do so. Topic modeling is a type of statistical modeling to discover topics (words that frequently occur together) in a collection of documents. Latent Dirichlet Allocation (LDA) is one of the most popular methods of performing topic modeling. LDA imagines a fixed set of topics (representing a set of words) and maps all the documents to the topics in a way where the words are captured by those topics (source). Without getting too technical, as shown in the figure below, LDA operates by using Dirichlet distributions as prior knowledge to generate documents made of topics and updates them until they match ground truth. In other words, documents can be described by a distribution of topics and each topic can be described by a distribution of words.

It makes sense to utilize this process for Twitch data if we imagine Twitch chat messages as a “document.” Admittedly, I would achieve more accurate results if I look at each minute of chat as a “document” for LDA as done by Thomas Debeauvais (LDA works better with longer text data as opposed to shorter text data). This is an improvement I would like to make later in the future and thus the following code can be seen as a technical experiment for baseline or minimal efficiency.

In order to perform LDA, we need to pre-process the data. This is done by completing the following steps: (1) tokenization: splitting the text into sentences and those sentences into words (removing punctuation and uppercase); (2) removing stopwords; (3) lemmatization: changing words to the first person form and to the present tense; and (4) stemming: words are reduced to the root or stem form. In order to conduct these steps, we will first use TextHero — a very useful Python library for NLP pre-processing. Before continuing, I’d like to mention that these methods of performing LDA on text data are either derived or modified from Barsha Saha and her courses on Topic Modeling on Coursera.

Using the very helpful TextHero library for stopword removal, word cloud creation, and preprocessing.

The first code block in the screenshot will remove punctuation, lowercase the messages, and remove URLs in all the messages. The next code block contains a list of custom stopwords appended to an existing list of default English stopwords, which are all removed from the data. Stopwords, defined as “common words that would appear to be of little value in helping select documents matching a user need [and are] excluded from the vocabulary,” were determined by manual selection via WordCloud (which is what the third code block accomplishes). A WordCloud of the Twitch data looks like the following:

Example Twitch WordCloud — which is used to detect/add custom stopwords and see frequent terms.

Next, after removing stopwords and cleaning up the chat messages, we perform tokenization. We use the Gensim library for this and the code looks like the following:

After tokenization, we will Lemmatize; for this, we will use the Spacy Python library. We parse the Twitch message data — keeping only nouns, adjectives, verbs, and adverbs. This looks like the following:

After you complete this step, if you were to run print(data_lemmatized[:20]), you will receive output like the following:

“‘internet quite eye opener’, ‘sweet candy spicy food’, ‘ponynamedtony’, ‘consider play minecraft’, ‘favorite’, ‘place live’, ‘superior condiment’, ‘’, ‘sour’, ‘’, ‘superior condiment’, ‘make tho’, ‘land website first visit’, ‘ponynamedtony evening’, ‘ponynamedtony evening’, ‘’, ‘stop ask favorite’, ‘armorant sugar may spell wrong less refined white sugar’.

After performing lemmatization, we can start the process of building an LDA model to do topic modeling. To do this, we will use the pyLDAvis and Scikit-Learn libraries. After importing these libraries, we need to create a word document matrix (which converts a collection of text documents to a matrix of token counts). In other words, this is a mathematical matrix that describes the frequency of terms that occur in a collection of documents where rows refer to the documents in the collection and columns refer to the terms. The LDA topic model algorithm requires a document word matrix as the main input, which is why this step is important.

We create a document word matrix using a CountVectorizer. In the case below, our CountVectorizer will consider words that have occurred at least ten times (min_df), remove English stopwords, convert words to lowercase, and keep the length of words to a minimum length of three.

Next, we build the LDA model using the function provided by SciKit-Learn. I built my LDA model with the following parameters; I then calculated the log-likelihood and perplexity of the model. Keep in mind, depending on your computer’s settings, these processes may take time.

After evaluating this model, I utilized Grid Search to optimize the algorithm to find the best LDA model for the data. Grid Search is a computationally taxing tuning technique to compute the optimum values of hyperparameters. This code, if run, will take time as it is performing an exhaustive search on specific parameter values of a model (the model is also called an estimator). After running the code (shown below) and re-calculating the perplexity and log-likelihood, we can identify which model is best for the data.

After using GridSearch and picking out the best LDA model for the data, we can use the pyLDAvis library for data visualization of our topics, which looked like the following for the data I obtained from chat.

This interactive map can provide a lot of valuable information about the data. For example, the circles each represent a generated topic, where the area of each circle is proportional to the proportions of the topics across the total tokens in the corpus (proportional to the number of words that belong to each topic across the dictionary). Blue bars represent the overall frequency of a term in the corpus; red bars represent the estimated number of times a term was generated by a given topic. The intertopic distance map is a visualization of the topics in 2D space; topics that are closer together will have more words in common. For the specifics around the visualization, please see the following documentation or read this helpful article.

When running the visualization, we are first presented with the most prevalent terms in the corpus (Twitch chat log) on the right-hand side. In this stream’s case, we see the terms “people,” “eat,” “watch,” “love,” “meat,” “sweet,” and “animal” as some of the most prevalent terms. From this, we can infer that the chat was talking about food or what kinds of food people eat. Furthering delving into the individual topics identified by hovering or clicking the circles can help us get more specific information on what different topics appeared at different points in the stream.

For example, if we were to look at topic three (see screenshot above), it appears that the chat was talking about something related to veganism or vegetarianism (e.g., vegan, egg, fish, kill, plant, and animal were identified terms). Because topic three’s circle is near topic one’s circle, we know that these two topics are likely to have more in common than topic three and topic six, which are far apart from another. If we take a look at the terms in topic six on the map, we see “love,” “friend,” “tick,” “hear,” and “explain.” This may be a strange set of words without context; however, given that this stream data is taken from British YouTuber sweet_anita, who has Tourette’s syndrome, one can infer that an explanation of this condition was being discussed in the chat. It also explains why topic eight on the map brings up “Tourette.”

For better visualization of the words and topics, we can utilize the following code, which will place the topics/words in a grid-like format. Please note that these will simply list the topics and not put them in any particular order. In other words, this is a simple list of the topics and their associated words — not structured in order of the most prevalent topics.

Step 3 of 3: Semantic Text Similarity & Document Search

Now that we are able to identify topics from a given corpus, it would be helpful if we could analyze that corpus to identify what people were saying about that topic. For this step, we will modify and utilize a semantic search method defined by Zayed Rais and Ali Zahid Raja in their GitHub project.

First, we conduct word tokenization and lemmatization (we can use the NLTK library). After doing this, we can add a column in our original Pandas dataframe that includes the tokenized list/keywords.

Given this list, we can add it to our Pandas dataframe using the following one-liner (adding the lemmatized words to the dataframe): df.insert(loc=4,column='Clean_Keyword',value=df_clean['Keyword_final'].tolist()). Our updated dataframe will look something like the following:

When building the semantic search tool, I explored using both TF-IDF and Google’s Universal Sentence Encoder (USE). Both methods provide a way to search Twitch chat; however, the Google USE method is much better at searching based on semantic similarity. I’ve included how to do both methods in this blog, but I highly suggest using USE. First, looking briefly at TF-IDF.

TF-IDF is the process of calculating the weight of each word (signifying the importance of the word in the corpus/document). The algorithm is used mainly for retrieving information and text mining. TF (Term Frequency) is how many times a word appears in a document divided by the number of the words in the document. IDF (Inverse Data Frequency) is the log of the number of documents divided by the number of documents with the word W that we’re interested in. TF-IDF is just these two numbers multiplied together — Scikit-learn implements this feature for you in their library. The following code creates the TF-IDF weight for the dataset.

Creating the TF-IDF Weight of the whole dataset.

Then, if you were to follow Zayed Rais and Ali Zahid Raja’s method, you would create a vector for Query/search keywords and then build a function for cosine similarity. See screenshots below for more information.

Using the TF-IDF method will enable you to search the corpus (Twitch chat messages); however, it will just return results based on words available in the documents. Because we want to semantically search, we need another method. Thankfully, Google’s USE (Universal Sentence Encoder) is a fantastic pre-trained and public tool that takes a word, sentence, or paragraph and encodes it into a 512-dimension vector for use with text classification, semantic similarity, clustering, and other NLP tasks. USE comes in two variations — one trained with a transformer encoder and one trained with a deep averaging network; we will use the deep averaging network-trained model, which trades accuracy for being computationally more efficient.

After a few installs and import statements, we load the USE model like so:

After loading the model, we need to train the USE model (which we will do batch-wise with a chunk size of 1000 rows).

After doing so, we will load the model batch-wise, like so:

Next, we train the model once, load it, and build the semantic similarity function. This looks like the following:

Now, finally, running the command with a keyword or identified topic.

A semantic search of the Twitch Chat Log using the Google USE method.

In this example, I chose “meat” as a keyword because it seemed that topic three identified previously was related to vegetarianism and meat-eating. I can now identify that people were talking about veganism and eating meat without having to parse through the entirety of the Twitch streamer’s chat. Additionally, messages about veganism are showing despite me not specifically typing out “veganism” or “vegan” as a keyword.


From the preliminary results, it seems that we are able to obtain a list of topics and semantically search Twitch chat for specific messages related to a topic or keyword. For livestream shopping, this could be valuable for seeing what people are saying about a product or generating new types of consumer data for companies to utilize. In the future, I’d like to perform topic modeling based on a minute of chat data rather than each individual message; additionally, I’d like to use more contemporary deep learning techniques to perform semantic search and analysis.

Interested in hearing more?

Please contact me or my team member Allison Youngdahl. To read more about Accenture Labs and our R&D areas, please visit our website.


This code is either taken or heavily based on work from Barsha Saha and her courses on Coursera on Topic Modeling for Business and Optimization of Topic Models using the Grid Search Method. Additionally, I would like to thank Brendan Martin for their thorough and well-documented method for streaming/logging chat from the Twitch IRC. The method for obtaining Twitch chat logs is from Brendan’s code on GitHub. Also, the code regarding text similarity is either taken or heavily based on work from Zayed Rais and Ali Zahid Raja. Specifically, their blog on building a semantic document search engine and their GitHub project was very useful. Finally, I’d like to thank Allison Youngdahl for proofreading and for her assistance with the project.

Engineer (HMC ’19), DJ, DDR Addict, Cheese Aficionado, and Polyglot. Interested in the intersections of technology and society. Views are my own.