Home About Addresses Topic Modeling Network Analysis Conclusions

Topic Modeling With MALLET

A topic model is a statistical model that uses a set of algorithms to find clusters or "topics" of words that appear together often in a corpus of texts. It can be used to trace themes, ideas, or specific terms throughout a corpus and map out where those features of the texts occur. The advantage of topic modeling is that it finds patterns implicit in the texts that casual markup may not yield. This can be useful in breaking down the makeup of a single text, but can also be integral to seeing how often a specific topic appears chronologically throughout the corpus. For our research question, we are looking at the prevalence of different policies over time to tease out historical trends. Thus, we use the latter approach.

MALLET is a java-based package that is used by digital humanists to perform topic modeling. It takes in a corpus of texts, rendered as .txt files, and produces a .mallet file. Once you have this file, you can run commands on it to produce topic models. The most important inputs you want to give the program are the input file (the aforementioned .mallet file, in this case inaugural.mallet, which contains condensed information about the corpus), and your paramters: the number of topics to produce, the number of words per topic, a set interval optimization, a number of iterations, and output "keys" and "composition" files. The first thing MALLET does is strip the corpus of all of the general words found in the English Language (e.g a, as, the), and then starts applying your given parameters. We picked 60 topics, which told the machine to find 60 unique clusters of words that appear together often. If you do a small number of topics, they will be too coarse and not meaningful, and if you do too many topics they will be too granular and none of them will appear in more than a few texts. It is a balancing act that requires some troubleshooting. Next, we told the machine to put 20 words in each cluster. This is the default and usually yields the best results. Interval Optimization allows the topics to have different weights, which means that some will appear more than others overall in the corpus. For statistical reasons, this generally yields better and more natural topics because it doesn't rigidly require that every topic appear equally often. Then, we told the machine to do 10,000 iterations. The number of iterations is also a balancing act; the higher it is, the more accurate, but it takes the program longer to run. 10,000 iterations takes about a minute to run, which is reasonable. Finally, we told the program to output a "keys" file that gives each topic with its words and weight, and a "composition" file that shows each .txt file's breakdown by topic. All the values in the "composition" file should add up to 1, or 100%.

It is important to note that in a topic model with 60 topics, only some of them are going to be useable. Generally, the useable ones for this kind of analysis are topics with a fairly high weight (so it appears in many texts) that have 20 words that are meaningful together. This part of the analysis the computer cant do; it is up to the researcher to give names to each topic. Below is a sample from the "keys" file, with topics that we will be looking at:

Topic Number Topic Name Weight Words
1 Monetary Policy & Currency .02645 gold dollar specie pay ratification paying house coined closed twenty-five ultimately debt lower greatest resumed commanding contingency advisable safer senate
6 Foreign Policy .30722 war commerce revenue force peace industry equally honorable produce defense reason resources practice countries sovereignty treaty effect enter wars nations
11 States' Rights & Federalism .6059 constitution union states foreign state rights powers duties fellow-citizens general interests government executive policy principles principle power constitutional opinion proper
15 Globalization & Industry .26464 world economic civilization opportunity industrial responsibility peoples justice cooperation hold essential ideals social cost national standards problems business private face
29 Slavery .10721 slavery purpose attention south supreme commercial suffrage north cease grave dispute races party contest perpetual cherish universal material conclusive custom
32 Civil Service & Rule of Law .25225 laws public service law respect provided large citizenship enforce development reform expect class advancement patriotic population officers presence oath efficient
36 Trade & Budget .21069 business congress law policy trade tariff legislation popular taxation revenue methods currency expenditures conditions consideration failure effort mutual make session
41 Civil Rights .03067 race proper south negro passed feeling interstate prevent amendment change hope army bill canal issue protection predecessor railroads exercise jurisdiction

Here we see 8 unique topics, with their weights and associated words. The "keys" file included everything but the Topic Name column, which we added. These topic names are not perfect, but we do our best to capture the general theme of the topics. For example, look at Topic 1. We see words like "gold, dollar, specie, pay, paying, coined, debt, lower, greatest" and can discern a pretty good general topic theme: monetary policy and currency. You will also notice that some of these topics have a much higher weight than others. The ones with higher weight are generally more useful, as you will see in the data visualization. However, the words in the topics with less weight may appear more specific than the words in the topics with more weight. This makes sense, because topics with less weight appear less, and thus the machine picks up on more specific words within that topic because there are simply less words. These 8 topics represented the topics which best implied a certain policy. 6, 11, 15, 29, 32, and 36 in particular yielded useful historical trends when graphed using the "composition" file.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 11 : Federalism & States Rights Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Here we have Topic 11, States Rights and Federalism. The x axis is the speech number, where 1 = Washington's first speech and 59 = Biden's speech in 2021. Numbers were used instead of names to avoid clutter and because we are less interested in specific presidents and more in trends over time.

This graph is a great example of the efficacy of topic modeling. We can now see how the relevance of this topic shifted over time. To reiterate, Topic Relevance (the y axis) is the percent of the document that consisted of the topic. So for example, Speech 12 (Jackson's second inaugural address) was over 30% topic 11. This graph yields an obvious historical trend; the issue of states' rights started out important in the early presidencies, reached a heightened relevance during the administrations leading up to the civil war, stayed relevant after the civil war, and then declined in the 1900s and is a non-issue today. This makes a whole lot of sense to students of history; the clash between the states and the federal government was a central issue at the founding of the country, led to a civil war in the mid 1800's, and then declined in relevance as the primacy of the federal government was established.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 15 : Globalization Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Here we have the graph for globalization and industry. This graph yields an interesting and believable historical trend- industry and globalization were hardly talked about before the turn of the century (1900s), but then as the US became a global power in the early 1900s, presidents started talking about the implications of the US as a superpower and international industry. Once the country get used to it, it because a topic discussed less with more modern presidents.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 36 : Budget & Trade Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Here we have the graph for budget and trade. From this we can discern that budget and trade was hardly spoken about in the pre-civil war period and after world war 2, but was a topic of relative importance in the gilded age through the turn of the century (~1870 - 1930) with a peak right in the middle of that period. This makes sense because tariffs were a big issue at that time, and balancing the budget was an ideal that presidents often sought to uphold. This is no longer the case. So, this result is relatively unsurprising.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 29 : Slavery Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Here is the graph for slavery. As you can see, it was a relevant topic in the years leading up to the civil war, as well as the immediate years after, and otherwise not really at all. It makes a slight reappearance in the inaugural speeches of the post world war 2 presidents, which is interesting. Another interesting point about this is that even when slavery was discussed the most, it was dwarfed by the emphasis put on states' rights (see graph 1 for comparison). This tells us that Presidents perhaps framed the slavery issue as often one of states' rights rather than explicitly in terms of slavery. This would seem consistent with pre-civil war Presidents' southern sympathies or desire not to escalate conflict. We can't necessarily say that that is the reason slavery is a less weighted topic here, but that is an inference we may draw.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 32 : Civil Service & Rule of Law Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

This is the graph for Civil Service and the Rule of Law. This is a slightly more vague category. We interpret it generally as an appeal to good government and maybe some specific civil service reforms. We see here relatively low but still present levels of relevance in the early administrations (remember that when there are 60 topics, 5% relevance for one topic is not insignificant) and an increase in the post-civil war era, when corruption reigned and Presidents promised civil service reform and good government. This seemed to peak right before the 1900s, and remained until rouhgly the 1930s when it became irrelevant as a topic.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 6 : Foreign Policy Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Foreign Policy has an interesting graph. As a topic overall, it remains relatively consistently relevant over time, though it almost completely disappears with the modern presidents. There is a neat "staircase" effect with the early administrations, however, that warrants attention. When America was a nascent country, there was a genuine fear that we might not survive long and get invaded. This culminated in the war of 1812. This may explain why foreign policy was of increasing importance to the early presidents. It's mild importance since then also makes sense, because foreign policy is rarely central to a President's pitch to the people.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 41 : Civil Rights Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

And now we get to our less relevant topics, 1 and 41. This is 41, Civil Rights. We include this, as well as topic 1, to demonstrate that topics with low weights can tell us more about an individual president or two than trends over time. As you can see, there are only a handful of presidents that address this topic, and only one with a high relevance (in this case 31, Taft's inaugural address). This tells us important information about President Taft's priorities, but not much about the trends over time. It does tell us that specific mentions to civil rights were few and far between, often nested within other, more generic topics.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Topic 1: Monetary Policy & Currency Speech Number Topic Relevance 46 6 5 57 24 4 33 31 59 27 40 7 21 17 51 34 44 9 45 1 55 19 12 35 28 13 29 41 32 14 56 11 15 47 26 39 43 18 54 30 22 10 42 36 52 58 37 50 48 38 2 53 25 8 20 3 16 23 49

Again, we see that topic 1 is only relevant for a handful of presidents, specifically Grant's first address. This can tell us useful things about grant's priorities, but nothing about long term trends.

One thing you may notice is that the policies that tend to be more specific (like 1 and 41) are much less prevalent and usually only appear in one or two presidents. This makes sense, and is a reality of working with topic modeling with optimization. Because some topics will be bigger than others, you will get some broad, prevlanet topics and some specific, less relevant ones. One way to make these data more meaningful is to creat supertopics, by merging some similar smaller topics together. This could be a focus of future research.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.