Python / nltk

If you like using Voyant but would like more fine-grained control over your searches, using NLTK (the natural language toolkit) with Python might be what you are looking for.  Here are some instructions to get started with NLTK from the command line.  These instructions are cobbled together from a variety of resources but are most indebted to the NLTK book.


1. Go to your terminal/command line. If you are going to use the sample texts provided by NLTK, it doesn’t matter which directory you are in.  If you are using your own texts, you need to navigate there before launching python.  Here are handy Unix commands for navigation, etc.

2. type python at the prompt

3.  at the python prompt (>>) type import nltk (no quotes) and press enter.

4.  if it is installed correctly nothing will happen except for the fact that you’ll get another prompt (>>).  Type from nltk.book import* (no quotes) here and press enter.

4A.  If you are in the lab, you’ll get a message saying that the Gutenberg corpus will not load.  Once you get this message, all you need to do at the python prompt is type nltk.download() (no quotes) and press enter.  You should get a simple pop-up GUI that asks you what you want to download.  Just select “all” and then push the download button. This might take up to 5 minutes.  Once it’s done close the popup and go back to terminal and you should be good to go.

5. Here some of the easier things you can do with NLTK. I recommend that you play around with the sample texts for a bit to get a feel for the following commands.  I am partial to text1 (Moby Dick), so let’s do that here for clarity.  The >> signifies the python prompt.

CONCORDANCE: this will show you where all of the instances of a chosen word occur (here “kitty”) in a text, in context.
>>text1.concordance(“kitty”)

SIMILAR: this will show you words that have a similar context as your chosen word (here “cat”) in a given text (still Moby Dick)
>>text1.similar(“cat”)

COLLOCATION: this will show you words that appear together frequently (“white whale” will come up here):
>>text1.collocations()

LEN: this will give you the length of the chosen text (still text1)
>>len(text1)

LEXICAL DIVERSITY FORMULA: this will calculate the “lexical richness” of a text
>>len(set(text1)) / len(text1)

DISPERSION PLOT: this one can be a little tricky, but it is worth it.  It will create a visualization that shows you the frequency of distribution of the words you select (here “cat”, “dog”, “whale”) in a given text (still Moby Dick here).  It is only tricky because you might need to import “numpy” and/or “matplotlib” to make it work:
>>text1.dispersion_plot([“cat”, “dog”, “whale”])

FREQUENCY DISTRIBUTIONS–give us the most frequently used words.
>>fdist1 = FreqDist(text1)
>>fdist1
>>vocabulary1 = fdist1.keys()
>>vocabulary1:[:50]
>>fdist1[“whale”]
>>fdist1.plot(50, cumulative=True)
>>fdist1.hapaxes()

HAPAXES=words that occur only once

6.  Once you’re comfortable with these commands, you will want to use your own texts.  To do so, do the following at the python prompt—again, make sure you are in the correct directory–i.e., the directory with the text/s you want to use for your corpus.   (So let’s say I am going to use nltk to look at the novel Dune, which I have saved as “dune.txt” on my desktop.  From terminal I would type cd/desktop and then type “python.”)  Once you’re in python do the following (there are redundancies that I haven’t yet eliminated here, but it works; each prompt is a line of code (push enter after each one):

>>import nltk

>>from nltk.book import *

>>a=open(“dune.txt”, “rU”)

>>text=a.read()

>>text_a=text.split()

>>dune_novel=nltk.Text(text_a)

>>import sys

>>sys.setdefaultencoding(“utf-8″)

7.  If you don’t get any error messages you should now be able to use the previous commands (LEN, COLLOCATION, etc) on “dune_novel”.

8.  Troubleshooting: If (when) you get error messages, note what they are (I find that taking a screenshot works well for this). You should always feel free to ask me for help, but I will most likely have to look it up, and you should get comfortable with doing the same. But do let me know what kind of error message you get it if remains a mystery;  I will most likely send you to Stackoverflow to look for solutions, but at least you’ll know we will be looking together.  NLTK has great documentation, but it is not always 100% up to date, nor is it always clear what effect switching versions of Python (from 2.xx to 3.xx) might have.  If you look at the NLTK documentation and don’t see a clear answer to the problem, try typing the error code into google and adding “nltk” and “python” to the search.  You will usually find that someone else has had the same problem and it has been answered on github or stackoverflow.

Once you’re comfortable with these steps, you might want to write your own scripts and save them for easy retrieval.  You might also want to explore other possibilities with NLTK.  If so, the NLTK book is an indispensable resource.

Share on FacebookTweet about this on TwitterPin on PinterestShare on Google+

Leave a Reply