Anna Pillar

Data Scientist Textmetrics
February 11, 2021

View all blogs

Blogs

An Introduction to Automatic Keyphrase Extraction

With the increasing amount of content available on the Web, we have bountiful information resources in the form of web pages (henceforth called documents). At the same time, having so much information at our disposal can make it unwieldy to go through and find exactly what you want. There is a need to represent documents concisely to give an idea of what the document is about at just a quick glance. One way to do this is by using keyphrases to describe the document. You might be more familiar with the more commonly used term keywords, but keyphrases can consist of more than one word and thus capture a broader set. This is only one of the use cases for keyphrases, and perhaps the most straightforward, however there are plenty more.

Why use keyphrases?

Here are a few examples of how keyphrases could be used:

  • Give the reader an idea about what the document is about at a quick glance. Keyphrases are short, descriptive, and few, which allows the reader to quickly know what the document is about before actually reading it.
  • Organize documents based on the same or similar keyphrases. Documents with keyphrases allow for a systematic way to group them and easily find related documents.
  • Gain insight about a given topic. Given a set of documents about a specific topic, keyphrases from each document can provide the bigger picture of what is talked about within the topic.
  • Improve information retrieval systems. Indexing documents with keyphrases can allow for fast retrieval of relevant documents given a query.
  • Improve downstream Natural Language Processing (NLP) tasks. Keyphrases can, for example, be used for text summarization to guide the process, where the keyphrases must be included in the summary.

Automation

Manually assigning keyphrases is laborious and can be inconsistent between humans. Annotating all the documents on the Web or even a fraction by hand is hardly feasible. Obviously, annotating the whole Web is not something you would want to do, but it could be realistic to have a relatively large set of documents that needs to be annotated with keyphrases. Alternatively, you possibly just want a single document to be quickly annotated without reading it yourself and coming up with keyphrases. Automating the extraction of keyphrases is then desirable. It both offloads human labor and can be performed more quickly than a human (provided that the human is not the author of the document).

How to automatically extract keyphrases

There are various techniques for automatic keyphrase extraction. Let’s start with one of the simplest, yet quite effective, methods called term frequency-inverse document frequency (TF-IDF). TF-IDF is a score that intuitively captures properties of what we consider to be keywords or keyphrases. The term frequency denotes how many times the phrase occurs in the document, whereas the inverse document frequency is the inverse of how many times the phrase occurs in a given background corpus. The TF-IDF score combines these two frequencies. The score will be high when the phrase occurs frequently in the document, but infrequently in the background corpus. In other words, TF-IDF captures important, yet unique, phrases in a document, which is what a keyphrase entails. How well the TF-IDF score reflects keyphrases depends on the background corpus. The TF-IDF method is one of the earlier and more basic approaches, but let’s now go over some more involved approaches. 

The next method uses a graph to represent a document. A word co-occurrence graph is constructed from a document, which simply means that words that appear next to each other are connected in the graph. The famous PageRank algorithm from Google can then be used to rank words in the graph. The top-ranked words can be used as keywords and also to form keyphrases if they appear next to each other.

For the last two approaches we will be going into supervised learning. This means we will need a dataset of documents with annotations for gold keyphrases to learn from. Firstly, we have the more traditional approach using standard classifiers such as Naive Bayes and Logistic Regression. For these kinds of classifiers, the keyphrase extraction task is generally framed as binary classification. For a given candidate phrase from a document, the classifier will output a score that indicates to what degree the candidate is a keyphrase; one being certain it is a keyphrase, while zero means it is not a keyphrase. Candidate phrases need to be represented by numerical feature vectors, because such classifiers cannot deal with text. Features to consider are, for example, TF-IDF and the first occurrence of the phrase in the document. Note that you would need to choose candidate phrases beforehand using some kind of heuristic, such as only noun phrases.

Lastly, we have the deep learning approaches, which often use (contextual) word embeddings like BERT to represent the text for an end-to-end system. Without going into too much detail, we will just list a few common conventions in deep learning for automatic keyphrase extraction. Sequence labeling is one way to model keyphrase extraction in deep learning. If you are familiar with part-of-speech (POS) tagging, then you can think of something similar like that. Instead, we want to tag whether a word is part of a keyphrase or not in a sequence of words. The other way to perform keyphrase extraction with deep learning is to input a sequence of words and extract n-gram candidates using a convolutional neural network (CNN), and then use a classification network to classify the candidates as keyphrase or not a keyphrase.

Closing remarks

We have briefly discussed the importance of keyphrases and the automation thereof. We also took a closer look at what goes into an automatic keyphrase extraction system. As we have seen, keyphrases are versatile and can be helpful in various tasks. Automating the extraction of keyphrases is the logical step to deal with the ever increasing amount of data.

Textmetrics is an augmented writing platform. Our platform will assess your writing in real time and give augmented tips and tricks on general quality elements that are important for job postings and marketing content. Examples are Diversity and Inclusion, (employer) branding, style guides and spelling/grammar. But also specific candidate or marketing quality elements, such as age bias, education level, gender, readability and abilities. Not only will Textmetrics help you create better job descriptions and marketing content, but it will also monitor the quality of all your (online) postings or content and tell you where to improve. Supporting 12 languages, Textmetrics is integrated in environments such as MS Word, Google Docs, ATS and CMS.

Your privacy is important to us

We are committed to ensuring the confidentiality, integrity, and availability of information and data. We make every effort to ensure that all data assets are fully protected, following applicable laws, regulations and industry best practices.

Download our ISO Certificate
Read our privacy policy

Happy to meet you at our next event!

At Textmetrics, we love to participate in various events and special occasions actively. We are often present and eager to make new connections and share experiences. We look forward to welcoming you to the upcoming events we will be partcipating in.

Click here to see which events we will be attending!

Share This