Stanford’s Natural Language Processing Software: Text Tagging and Finding Named Entities

Introduction

Stanford NLP Logo

Stanford Natural Language Processing (NLP) group at Stanford University has an open suite of language analysis tools that are available for the public to use. Most of the tools are only available in English but some have been translated into Chinese, Spanish, German, and Arabic. This tutorial will focus on the English tool sets, specifically the Named Entity Recognizer and the Parts of Speech Tagger. This is helpful is being able to pinpoint and extract specific locations / organizations from a text; Or if you wanted to look at the complexity of sentence structure; Or even looking for hesitations in transcripts for english as a second language learners and where they pause the longest. There are various applications to this technology in research and learning.

Named Entity Recognizer

The Named Entity Recognizer (or NER) will label words in the text that are names of things, such as a person, organization, location, and even gene and protein names. The output once your text is run through NER will look something like the image below with the NER output on the left and the Terminal output on the right:

Example of Named Entity Recognizer output

Parts Of Speech Tagger

The Parts of Speech Tagger will allow you to copy and paste large quantities of text into the tagger and the tagger will assign parts of speech to each word such as noun, verb, adjective, etc. This tool tags parts of speech with 96.97% accuracy. The output when this is run will look something like what you see below:

image of the output from the parts of speech tagger

Let’s get started with these tools!

Getting Started

Installing Java

You’ll need to have Java version 1.8 or later installed on your computer to run the Stanford NLP (Natural Language Processing) Software. To install Java go to Oracle’s website, click the Agree to Terms button and then choose the product you’re installing Java on.

image of how to install Java

Here are some additional instructions on how to install Java if you run into difficulties.

Part 1: Using the Named Entity Recognizer (NER)

Download the Named Entity Recognizer (NER) Software

The Named Entity Recognizer (or NER) will label words in the text that are names of things, such as a person, organization, location, and even gene and protein names. To use this free software you can download it here.

how to download the named entity recognizer

Make sure to save the NER files on your Desktop or some easily accessible place on your computer. Once the file is done downloading, unzip the file by double clicking it:

Unzip the file by double clicking

I like to rename the file to just stanford-ner, so that it’d easier to call the file from the Terminal window.

Using NER Through Terminal

Next open up Terminal and navigate to the stanford-ner folder.

To access Terminal on a Mac or Command Prompt on Windows you can check out the tutorials below:

  • If you have a Mac check out this video to learn more
  • If you have Windows 8 check out this video to learn more
  • This post shows how to open the command prompt for pre-Windows 8 systems

After you’re in the stanford-ner folder in Terminal, copy and paste the following into the Terminal window:

java -mx1000m -jar stanford-ner.jar
GitHub Gist: stanford-ner

Doing this should cause the Stanford Named Entity Recognizer to open:

Named Entity Recognizer window image

Inside of this box you can delete the current text and paste your own text into the box. Next we need to run a classifier, which is a machine learning tool that takes the data items and places them into one of the k classes (what’s a k class???). To do this go to “Classifier” and “Load CRF from File”:

Loading a classifier from a file

Next, select the “english.muc.7class.distsim.crf.ser” classifier from the classifier folder and click “Open”:

choosing the right classifier

Several tags should now appear in the NER window on the right hand side of the screen and the NER button at the bottom should be highlighted now. Go ahead and click it.

Running NER

The Results

After you click “Run NER” two things should happen. One the NER window should now have highlighted the corresponding tags on the right within the text like so:

output after running ner

And two, the terminal window should also list all the tags for location, organization, date, money, persons, time, etc:

Running terminal after NER

And you’re done learning now to use Stanford’s Named Entity Recognizer! Now onto the Parts of Speech Tagger.

Part 2: Using the Parts of Speech Tagger

Download the Parts Of Speech Tagger

The Parts of Speech Tagger will allow you to copy and paste large quantities of text into the tagger and the tagger will assign parts of speech to each word such as noun, verb, adjective, etc. If you need to tag the parts of speech in your document you can download it here.

download_pos

Go ahead and click the “basic English Stanford Tagger” since we’ll only be analyzing text in English.

Many of the steps that we do here are similar to what’s described above. This tagger uses the ‘english-left3words-distsim.tagger’ model which has a 96.97% accuracy when tagging the text you input. You can read more about common questions on the Parts of Speech Tagger here.

Using the Parts of Speech Tagger Through Terminal

Open up a Terminal window and navigate to the “stanford-postagger” folder that you just downloaded. There are instructions above on how to use Terminal and navigate to a folder using it. Once you’re in the folder, copy and paste the following command into the Terminal window:

java -mx1000m -jar stanford-postagger.jar
GitHub Gist: stanford-postagger

Once this line of code finishes running, the following window will appear:

parts of speech tagger window

You can copy and paste the text you’d like to tag in the first text box and click “Tag Sentence!”

The Results

The output will look something like this:

output from running the parts of speech

You’ll notice that all the tags for the parts of speech are attached to the word with an “_”. The tags are based on the University of Pennsylvania Treebank Tag-set, which the University of Leeds has a good decrypter available here (i.e. JJ = adjective, NN = Noun, etc).

Additional Sources

If you’d like to learn more about Stanford’s Natural Language Processing software and other free software tools, you can learn more at their home site where they have links to additional resources as well.

Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *