Building a LDA-based Book Recommender System

By Andrew McAllister, Ivana Naydenova, Quang Nguyen Duc | February 8, 2019

IS_LDA_final

Building a LDA-based Book Recommender System

Autors: Andrew McAllister, Ivana Naydenova, Quang

The task: Building a books recommendation engine

Latent Dirichlet Allocation is a type of unobserved learning algorithm in which topics are inferred from a dictionary of text corpora whose structures are not known (are latent). The practical use of such an algorithm is to solve the cold-start problem, whereby analytics can be done on texts to derive similarities in the dictionary's corpses, and further be expanded to others outside the original scope - all without user input, which is often an early stumbling block for businesses that do not meet early usage quotas to derive needed analysis.

What we will be applying this to towards the end of this article is a recommendation engine for books based on their Wikipedia articles. Our recommendation engine aims to help users find books which might be interesting for them based on their summaries. At this point the algorithm is fully content based, lacking user input for collaborative filtering completely, but serves to illustrate the potential of such algorithms in forming the basis of high-level business implementable hybrid filters.

The data for the project - all books on Wikipedia - is collected from Wikipedia Dumps from the 1st of January, 2019, in their compressed forms. Using techniques outlined by Will Koehrsen of Medium's Towards Data Science we use a process of XML handlers to separate out individual pages, myparserfromhell's Wikipedia parser to derive the article titles, article contents, and length of each article, and further paralyze the process to assure optimal speed when going through all 15.8 GBs of compressed data (Link to the original article). Our thanks to Mr. Koehrsen for his techniques on how to access and create such an interesting data set for our analysis.

What follows is an overview of the related work on this topic followed by an explanation of the theoretics behind LDA to set the groundwork of our implementation. At the end, we will work through the full codes and present specific results of the algorithm, as well as discuss some of its shortcomings.

The Latent Dirichlet Allocation (LDA) model and the Variational EM algorithm used for training the model were proposed by Blei, Ng and Jordan in 2003 (Blei et al., 2003a). Blei et al. (2003b) described LDA as a “generative probabilistic model of a corpus which idea is that the documents are represented as weighted relevancy vectors over latent topics, where a distribution over words characterizes a topic”. This topic model belongs to the family of hierarchical Bayesian models of a corpus, which has the purpose to expose the main themes of a corpus that can be used to classify, search, and investigate the documents of the corpus. In LDA models, a topic is a distribution over the feature space of the corpus, and several topics with different weights can represent each document. According to Blei et al. (2003a), the number of topics (clusters) and the proportion of vocabulary that create each topic (the number of words in a cluster) are considered to be tbe two hidden variables of the model. The conditional distribution of words in topics, given these variables for an observed set of documents, is regarded as the primary challenge of the model.

Griffiths and Steyvers (2004), used a derivation of the Gibbs sampling algorithm for learning LDA models to analyze abstracts from PNAS by using Bayesian model selection to set the number of topics. They proved that the extracted topics capture essential structure in the data, and are further compatible with the class designations provided by the authors of the articles. They drew further applications of this analysis, including identifying ‘‘hot topics’’ by examining temporal dynamics and tagging abstracts to illustrate semantic content. The work of Griffiths and Steyvers (2004) proved the Gibbs sampling algorithm is more efficient than other LDA training methods (e.g., Variational EM). The efficiency of the Gibbs sampling algorithm for inference in a variety of models that extend LDA is associated with the "the conjugacy between the Dirichlet distribution and the multinomial likelihood". Thus, when one does sampling, the posterior sampling becomes easier, because the posterior distribution is the same as the prior, and it makes inference feasible. Therefore, when we are doing sampling, the posterior sampling becomes easier. (Blei et al., 2009; McCallum et al., 2005).

Mimno et al. (2012) introduced a hybrid algorithm for Bayesian topic modeling, in which the main effort is to merge the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. Their approach decreases the bias of variational inference and can be generalized by many Bayesian hidden-variable models.

LDA is being applied in various Natural Language Processing tasks such as for opinion analysis (Zhao et al., 2010), for native language identification (Jojo et al., 2011) and for learning word classes (Chrupala, 2011).

In this blog post, we focus on the task of automatic text classification into a set of topics using LDA, in order to help users find books which might be interesting for them based on their summaries.

What is LDA?

Latent Dirichlet Allocation is a type of unobserved learning algorithm in which topics are inferred from a dictionary of text corpora whose structures are not known (are latent). The practical use of such an algorithm is to solve the cold-start problem, whereby analytics can be done on texts to derive similarities in the dictionary's corpses, and further be expanded to others outside the original scope - all without user input, which is often an early stumbling block for businesses that do not meet early usage quotas to derive needed analysis.

What we will be applying this to towards the end of this article is a recommendation engine for books based on their Wikipedia articles. Our recommendation engine aims to help users find books which might be interesting for them based on their summaries. At this point the algorithm is fully content based, lacking user input for collaborative filtering completely, but serves to illustrate the potential of such algorithms in forming the basis of high-level business implementable hybrid filters.

The data for the project - all books on Wikipedia - is collected from Wikipedia Dumps from the 1st of January, 2019, in their compressed forms. Using techniques outlined by Will Koehrsen of Medium's Towards Data Science we use a process of XML handlers to separate out individual pages, myparserfromhell's Wikipedia parser to derive the article titles, article contents, and length of each article, and further paralyze the process to assure optimal speed when going through all 15.8 GBs of compressed data (Link to the original article). Our thanks to Mr. Koehrsen for his techniques on how to access and create such an interesting data set for our analysis.

What follows is an overview of the related work on this topic followed by an explanation of the theoretics behind LDA to set the groundwork of our implementation. At the end, we will work through the full codes and present specific results of the algorithm, as well as discuss some of its shortcomings.

In order to understand the kind of books that a certain user likes to read, we used a natural language processing technique called Latent Dirichlet Allocation (LDA )used for identifying hidden topics of documents based on the co-occurrence of words collected from those documents.

The general idea of LDA is that

each document is generated from a mixture of topics and each of those topics is a mixture of words.

Having this in mind, one could create a mechanism for generating new documents, i.e. we know the topics a priori, or for inferring topics present in a set of documents which is already known for us. This bayesian topic modelling technique can be used to find out how high the share of a certain document devoted to a particular topic is, which allows the recommendation system to categorize a book topic, for instance, as 30% thriller and 20% politics. These topics will not and do not have to be explicitly defined. Managers looking to apply LDA will often expect that outputs of specific topic classes will be provided by the analysis, but this is often extraneous - or potentially harmful to the robust potential of the analysis when a topic set is chosen such that they can be easily defined by human intuition. In our analysis, we will directly name any of our found topics, but will draw the reader's attention to when the words do seem to fit in a potential topic.

Concerning the model name, one can think of it as follows (Blei et al., 2003a):

Latent: Topic structures in a document are latent meaning they are hidden structures in the text.

Dirichlet: The Dirichlet distribution determines the mixture proportions of the topics in the documents and the words in each topic.

Allocation: Allocation of words to a given topic.

Parameter estimation

LDA is a generative probabilistic model, so to understand exactly how it works one needs first to understand the underlying probability distributions. The idea behind probabilistic modeling is (Blei et al., 2003a):

  • to treat data as observations that arise from some kind of generative probabilistic process (the hidden variables reflect the thematic structure of the documents (books) collection),
  • to infer the hidden structure using posterior inference (What are the topics that describe this collection?) and
  • to situate new data into the estimated model (How does a new document fit into the estimated topic structure?)

In the next sectios we will focus on the multinomial and Dirichlet distributions utilized by LDA.

Inference

Having specified the statistical model structure, the crucial part now is to find a method for estimating the relevant parameters. As shown in the previous section, our model is based on the joint probability distribution over both observable data and hidden model parameters. Such models are considered as probabilistic generative models. In comparison, discriminative models like e.g. logistic regression or SVMs, directly model the conditional probability $P(Y|X=x)$ of some target variable $Y$ with given observations $x$.

Parameter estimation, or more precisely statistical inference, on generative models are generally approached by Bayesien inference. In essence, bayesian inference applies the Bayes theorem on statistical models to draw conclusions from the data about the parameter distribution of interest.

Bayes Theorem

Before introducing the concept of Baysian inference, it is necessary to provide the Bayes' theorem. An advantage of this theorem is the ability to include prior knowledge or belief to calculate the probability of related events.

Mathematically, the Bayes' theorem can be stated as follows: $$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$ where

  • $A$ and $B$ are events with $P(B)\neq0$

Model-Based Bayesian Inference

The introductory definition of Bayes' rule will now be adopted to the model form in order to represent the data and parameters. In the section above we have only used single numbers for the terms in the Bayes' theorem equation. However, when applying Bayes' rule to more complex machine learning models, this is not sufficient. As prior beliefs often contain uncertainty, probability distributions over a range of values are more appropriate. The same logic applies to the other terms of the equation. Therefore, probabilities $P(.)$ are replaced by densities $p(.)$.

Furthermore, the events $A$ and $B$ are replaced by the model parameters and the observed data. The observed data is represented by $y=\{y_1,y_2,...,y_N\}$ and the set of parameters (which we are typically interested in) is denoted as $\Theta$ Taking everything into account and using the Bayes' rule, the Baysian inference model is formalized as

$$p(\Theta|y)= \frac{p(y|\Theta)p(\Theta)}{p(y)}$$

with the following components:

  • $p(\Theta)$ is the prior distribution (or simply: prior) of the parameter set $\Theta$ and is used as a means of expressing uncertainty about $\theta$ before observing the data. As a probability distribution, it contains its own set of parameters which are called hyperparameters, to differentiate them from the model parameters $\Theta$.
  • $p(y|\Theta)$ is the sampling distribution, which models the probability of observable data $y$ given latent model parameters $\Theta$. It is also often called the likelihood (function) when it is viewed as a function of $\Theta$.
  • $p(y)$ defines the marginal likelihood of $y$ and is also called the evidence.
  • $p(\Theta|y)$ is the posterior distribution giving the updated probability of the parameters $\Theta$ after taken both the prior and the data into account

Multinomial Distribution

Instead of maximum-likelihood, Bayesian inference encourages the use of predictive densities and evidence scores. This is illustrated in the context of the multinomial distribution, where predictive estimates are often used but rarely described as Bayesian (Minka, 2003).

Now we will describe the multinomial distribution which is used to model the probability of words in a document. For this reason, we will also discuss the conjugate prior for the multinomial distribution, the Dirichlet distribution.

Given a set of documents, one can choose values for the parameters of a probabilistic model that make the training documents have high probability (Elkan, 2014). Considering a test document, one can evaluate its probability according to the model: The higher the probability is, the more similar the test document is to the initial set of documents. For this purpose, the multinomial distribution could be applied with the following probability distribution function:

$$ p(x; \theta)=\bigg(\frac{n!}{\prod^{m}_{j=1}}\bigg)\bigg(\prod^{m}_{j=1}\theta_{j}^{x_{j}}\bigg), $$ where $\theta_{j}$ is the probability of word j while $x_{j}$ is the count of a word j.

So every time, a certain word j appears in the document it contributes an amount $\theta_{j}$ to the total probability ( $\theta_{j}^{x_{j}}$).
In this equation, the first factor in parentheses is called a multinomial coefficient, which represents the size of the equivalence class of x, while the second factor in parentheses is the probability of any individual member of the equivalence class of x.

Dirichlet distribution — what is it, and why is it useful?

The Dirichlet distribution is used for representation of the distributions over the simplex (the set of N-vectors whose components sum to 1) and is a useful prior distribution on discrete probability distributions over categorical variables.

It is a conjugate prior to the categorical and multinomial distributions (meaning that multiplying a Dirichlet prior by a multinomial or categorical likelihood will yield another Dirichlet distribution of a certain form). However, the primary difference between Dirichlet and multinomial distributions is that Dirichlet random variables are real-valued, where each element of the vector is in the interval $[0, 1]$, and multinomial random variables are integer-valued.

The Dirichlet distribution defines a probability density for a vector valued input having the same characteristics as our multinomial parameter $\theta$.

\begin{equation} Dir (\vec\theta\mid \vec\alpha)={\frac {\Gamma (\Sigma_{i=1}^{K}\alpha_{i})}{\prod _{i=1}^{K}\Gamma (\alpha_{i})}}\prod _{i=1}^{K}\theta_{i}^{\alpha_{i}-1} \end{equation}

This equation is often represented using the Beta function in place of the first term as seen below:

\begin{equation} Dir (\vec\theta\mid \vec\alpha)={\frac {1}{B(\alpha)}}\prod _{i=1}^{K}\theta_{i}^{\alpha_{i}-1} \end{equation}

Where:

\begin{equation} \frac {1}{B(\alpha)} = {\frac {\Gamma (\Sigma_{i=1}^{K}\alpha_{i})}{\prod _{i=1}^{K}\Gamma (\alpha_{i})}} \end{equation}

The Dirichlet distribution is parameterized by the vector $\alpha$, which has the same number of elements K as our multinomial parameter $\theta$ (Thomas Boggs, 2014). Thus, one could interpret $p(\theta \mid\alpha)$ as responding to the question “what is the probability density associated with multinomial distribution $\theta$, given that our Dirichlet distribution has parameter $\alpha$.

To get a better sense of what the distributions look like, let us visualize a few examples in the context of topic modeling.

The Dirichlet distribution takes a number ($\alpha$) for each topic. The GIF below shows the iterations of taking 1000 samples from a Dirichlet distribution using an increasing alpha value. Every topic gets the same alpha value one could see displayed, and the dots represent a distribution or mixture of the three topics having certain coordinates such as (0.6, 0.0, 0.4) or (0.4, 0.3, 0.3). Low $\alpha$-values account for the case where most of the topic distribution samples are in the corners (near the topics). Having values (1.0, 0.0, 0.0), (0.0, 1.0, 0.0), or (0.0, 0.0, 1.0) would mean that the observed document would only ever have one topic in case a three topic probabilistic topic model is build.

Usually, one aims to end up with a sample favoring only one topic or a sample that gives a mixture of all topics, or something in between. In case the $\alpha$-values>1, the samples start to congregate to the center. With greater $\alpha$ the samples will more likely be uniform or a homogeneous mixture of all the topics.

Typically one aims to find more than three topics depending on the number of observed documents. In the ideal case, the composites to be made up of only a few topics and our parts to belong to only some of the topics.

In [5]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://cdn-images-1.medium.com/max/800/1*_NdnljMqi8L2_lAYwH3JDQ.gif", width=700)
Out[5]:

LDA as a Generative process

LDA is being often described as the simplest topic model (Blei and Lafferty, 2009): The intuition behind this model is that documents exhibit multiple topics. Furthermore, one could most easily describe the model by its generative process by which the model assumes the documents in the collection arose (Blei, 2012).

LDA can be described more formally with the following notation:

[Source: Blei, 2012.]

  • A document-to-topic model parameter $\theta_{m}$ is used to generate the topic for the $n^{th}$ word in the doument d denoted by $z_{m,n}$, for all n of [1; N].
  • A topic-to-word model $\phi_{k}$ is used to generate the word $w_{m,n}$ for each topic $k=z_{m,n}$

Those two steps can be seen on the graphical model above, where the document-to-topic model and the topic-to-word model essentially follow multinomial distributions (counts of each topic in a document or each word in a topic), a good prior for their parameters, would be Dirichlet distribution, explained here.

Assuming that the word distributions for each topic vary based on a Dirichlet distribution, as do the topic distribution for each document, and the document length is drawn from a Poisson distribution, one can generate the words in a two-stage process for each document in the whole data collection:

  1. Randomly choose a distribution over topics.
  2. For each word in the document:

    • A topic is being randomly chosen from the distribution over topics in the first step.
      • Sample parameters for document topic distribution
      • $\theta_{m}$ follows the $Dir(\alpha_{m}\mid \alpha)$ distribution, after observing topics in the document and obtaining the counting result $n_{m}$, one could calculate the posterior for $\theta_{m}$ as $Dir(\alpha_{m}\mid n_{m}+\alpha)$. Thus, the topic distribution for the mth document is $$ p(z_{m}\mid \alpha)=\frac{\Delta(n_{m}+\alpha)}{\Delta(\alpha)} $$ and the topic distribution for the all document is

    $$ p(z\mid\alpha)= p(w,z\mid\alpha,\beta)= \displaystyle\prod_{m=1}^{M}\frac{\Delta(n_{m}+\alpha)}{\Delta{\alpha}} $$

    • we assume, similarly, for the kth topic, that the prior for the topic-to-word model's parameter $\phi_k$ follows the $Dir(\phi_k\mid \beta)$ distribution, after observing words in the topic and obtaining the counting result $n_{k}$, one could calculate the posterior for $\phi_{k}$ as $Dir(\phi_k\mid n_{k}+\beta)$. Thus, the word distribution for the kth topic is

      $$ p(w_{k}\mid z_{k}, \beta)=\frac{\Delta(n_{k}+\beta)}{\Delta(\beta)} $$ and the word distribution for the all topics is

    $$ p(w_{k}\mid z, \beta)=\displaystyle\prod_{m=1}^{M}\frac{\Delta(n_{k}+\beta)}{\Delta(\beta)} $$

The distinctive characteristic of LDA is that all the documents in the collection share the same set of topics and each document exhibits those topics in different proportion.

Using this notation, the generative process for LDA is equivalent to the following joint distribution of the hidden and observed variables (Blei, 2012):

\begin{equation*} p(w,z\mid\alpha,\beta)=p(w\mid z,\beta)p(z\mid \alpha)=\displaystyle\prod_{k=1}^{K}\frac{\Delta(n_{k}+\beta)}{\Delta{\beta}}\displaystyle\prod_{m=1}^{M}\frac{\Delta(n_{m}+\alpha)}{\Delta{\alpha}} \end{equation*}

Posterior computation for LDA

As already mentioned,

the aim of topic modelling is to automatically discover the topics from a collection of documents, which are observed, while the topic structure is hidden.

This can be thought of as “reversing” the generative process by asking what is the hidden structure that likely generated the observed collection?

We now turn to the computational problem, computing the conditional distribution of the topic structure given the observed documents (called the posterior.) Using our notation, the posterior is

$$p(\theta,\phi, z \mid w, \alpha, \beta) = \frac{p(\theta,\phi, z, w, \mid \alpha, \beta)}{p(w, \mid \alpha, \beta)}$$

The left side of the equation gives us the probability of the document topic distribution, the word distribution of each topic, and the topic labels given all words (in all documents) and the hyperparameters $\alpha$ and $\beta$. In particular, we are interested in estimating the probability of topic (z) for a given word (w) (and our prior assumptions, i.e. hyperparameters) for all words and topics.

A central research goal of modern probabilistic modelling is to develop efficient methods for approximating the posterior inference (Blei, 2012).

Gibbs sampling

The general Idea of the Inference process is the following (Tufts, 2019):

  1. Initialization: Randomly select a topic for each word in each document from a multinomial distribution.
  2. Gibbs Sampling:
    • For i iterations
    • For document d in documents:
      • For each word in document d:
        • assign a topic to the current word based on probability of the topic given the topic of all other words (except the current word).

This procedure is repeated until the samples begin to converge to what would be sampled from the actual distribution. While convergence is theoretically guaranteed with Gibbs Sampling, there is no way of knowing how many iterations are required to reach the stationary distribution (Darling, 2011). Usually, an acceptable estimation of convergence can be obtained by calculating the log-likelihood or even, in some situations, by inspection of the posteriors.

When applying LDA, one is interested in the latent document-topic portions $\theta_{d}$, the topic-word distributions $\phi(z)$, and the topic index assignments for each word $z_{i}$. While conditional distributions – and therefore an LDA Gibbs Sampling algorithm – can be derived for each of these latent variables, we note that both $\theta_{d}$ and $\phi(z)$ can be calculated using just the topic index assignments $z_{i}$ (Darling, 2011). To this reason, Griffiths and Steyvers (2004) applied a simpler algorithm, called collapsed Gibbs sampler, which integrates out the multinomial parameters and sample $z_{i}$.

To compute the collapsed Gibbs sampler updates efficiently, the algorithm maintains and makes use of several count matrices during sampling (Papanikolaou et al., 2017):

  • $n_{kv}$ represents the number of times that word type v is assigned to topic k across the corpus,
  • $n_{dk}$ represents the number of word tokens in document d that have been assigned to topic k.

The equation giving the probability of setting $z_{i}$ to topic k, conditional on $w_{i},d$, the hyperparameters $\alpha$ and $\beta$, and the current topic assignments of all other word tokens (represented by ·) is:

$$p(z_{i}=k\mid w_{i}=v, d, \alpha, \beta, ·)\propto \frac{n_{kv\neg i}+\beta_{v}}{\Sigma_{v'=1}^{V}(n_{kv\neg i}+\beta_{v})} . \frac{n_{dk\neg i}+\alpha_{k}}{N_{d}+\Sigma_{k'=1}^{K}\alpha_{k'}}$$

The exclusion of the current topic assignment of $w_{i}$ of from all count matrices is annotated by $\neg i $. A common practice is to run the Gibbs sampler for many iterations before retrieving estimates for the parameters, and the one can retrieve estimates for the variables of interest.

By using Rao-Blackwell estimators Griffiths and Steyvers, (2004) computed a point estimate of the probability of word type v given topic k ($\hat{\phi_{kv}}$) and a point estimate for the probability of the topic k given document d ($\hat{\theta_{dk}}$):

$$ \hat{\phi_{kv}}=\frac{n_{kv}+\beta_{v}}{\Sigma_{v'=1}^{V}(n_{kv}+\beta_{v})} $$

$$ \hat{\theta_{dk}}=\frac{n_{dk}+\alpha_{k}}{N_{d}+\Sigma_{k'=1}^{K}\alpha_{k'}} $$

When applying the collapsed Gibbs sampler on new documents during the prediction stage, a common practice is to fix the $\phi$ distributions and set them to the ones learned during estimation:

$$p(z_{i}=k\mid w_{i}=v, d, \alpha,\hat{\phi_{kv}} , ·)\propto \hat{\phi_{kv}} . \frac{n_{dk\neg i}+\alpha_{k}}{N_{d}+\Sigma_{k'=1}^{K}\alpha_{k'}}$$

Variational Inference

Blei et al. (2003) first introduced LDA using an inference method based on variational inference. As the derivation and its mathematical details are out of scope for this introduction to the LDA, we provide an intuition for the steps involved. The general idea of this method is to approximate the intractable posterior distribution $P$ by a tractable distribution $Q_V$ with a set of free variational parameters $V$. The goal then is to eventually find parameters $V$ which minimizes the Kullback-Leibler (KL) divergence $KL(Q_V||P)=\sum Q_V \log{\frac{Q_V}{P}}$. The key advantage is that this process can be done without computing the actual posterior distribution and still approach the posterior up to an constant, which is independent of the variational parameters. Moreover, these parameters $V$ are only dependend on the obervable data $w$.

Model

The specific approach taken here is referred as a mean field approximation, where we assume the following:

  • All target variables are independent of each other
  • Each variable has its own variational parameter

With these assumptions, we can set up the model:

  • Probability of topic $z$ given document $d$: $q(\theta_d|\gamma_d)$
    Each document has its own Dirichlet prior $\gamma_d$
  • Probability of topic assignment to word $w_{d,n}$: $q(z_{d,n}|\phi_{d,n})$
    Each $n^{th}$-word in document $d$ has its own multinomial over topics $\phi_{d,n}$

Inference

The inferential step estimates the parameters of $q(\theta,z|\gamma,\phi)$ by minimizing the KL divergence

$$KL[q(\theta,z|\gamma,\phi) \ || \ p(\theta,z|w,\alpha,\beta)]$$

The KL divergence, also known as relative entropy, is a function which quantitatively measures the "difference" between two probability distributions. Thus, the lower the value of KL, the more similar are both distributions $q(\theta,z|\gamma,\phi)$ and $p(\theta,z|w,\alpha,\beta)$. The optimization problem is therefore given by

$$\gamma^*,\phi^* = argmin_{(\gamma,\phi)} KL[q(\theta,z|\gamma,\phi) \ || \ p(\theta,z|w,\alpha,\beta)]$$

Using the definition of the KL divergence function, Blei et. al (2003) derive the equation

$$KL[q(\theta,z|\gamma,\phi) \ || \ p(\theta,z|w,\alpha,\beta)] = - L(\gamma, \phi|\alpha,\beta) + \log p(w, \alpha,\beta)$$
with $L(\gamma, \phi|\alpha,\beta) = E_q[\log p(\theta, z, w|\alpha,\beta)]-E_q[\log q(\theta,z|\gamma, \phi)]$

and additionally applying Jensen's inequality, it follows the inequality

$$\log p(w|\alpha, \beta) \geq L(\gamma, \phi|\alpha,\beta)$$

In the first equation we can see that the minimization problem turns into a maximization problem of some likelihood function $L(\gamma, \phi|\alpha,\beta)$, as the second term is constant and can be ignored. The left hand side of the second equation shows the log-likelihood of the evidence of our posterior distribution. This inequality is commonly knows as the evidence lower bound, abbreviated as ELBO. The interpretation of this ELBO has some positive insights: As we aim to maximize $L(\gamma, \phi|\alpha,\beta)$ (which is equivalent to minimize the KL divergence), we simultaneously maximize the lower bound of our evidence. Remember that the evidence gives the probability of our observed data, given our model parameters. Hence, by optimizing the parameters, we increase our confidence of the observed data. In fact, this is similar to the intuition of the maximum likelihood estimation, where we estimate parameters such that the probability of our observed data is maximum. What we can see on the left hand side is the log-likelihood of the evidence of our posterior probability, whereas $L(\gamma, \phi|\alpha,\beta)$ represents the equivalent optimzation

Variational EM-algorithm

However, a closed-form solution can not be derived for maximizing the objective function $L(\gamma, \phi|\alpha,\beta)$. Hence, Blei et. al have proposed a variational expectation maximization (variational EM) algorithm to iteratively solve this optimization problem.

The derivation of the parameter estimatiosn by using the variational EM algorithm is quite lengthy but the intuition can be summarized as follows:

Goal: Find $\alpha, \beta$ which maximize the marginal log-likelihood $l(\alpha,\beta|w)=\sum_d \log p(w_d|\alpha,\beta)

  1. Define and initialize a tractable distribution $q(\theta,z|\gamma,\phi)$ that defines a lower bound of the log-likelihood of the actual data.
  2. Iterate the two following steps until $l(\alpha,\beta|w)$ converges:
    • E-step: Update $\gamma^*, \phi^*$ to get new approximated posteriors $q(\theta,z|\gamma,\phi)$. For each document, find optimal values of its variational parameters. This is basically the optimization equation above.
    • M-step: Update $\alpha, \beta$ to get new evidence $p(\theta, z, w|\alpha,\beta)$. The parameters $\alpha, \beta$ are chosen such that the resulting lower bound of the log likelihood are maximum. Usually this corresponds to finding maximum likelihood estimates of $\alpha, \beta$ for each document under the new approximate posterior $q(\theta,z|\gamma^*,\phi^*)$

Book Similarity Computation

LDA and Jaccard Similarity

As already mentioned, we will apply the LDA algorithm to extract the latent topics within Wikipedia's book summaries. This is followed by aggregation of most frequently occurring topics and their utilization in order to calculate the Jaccard Similarity between the topics of other book summaries. The combination of latent topics and similarity measurement allows one to uncover both topics and apparent relationships between the summaries of two establishments (Yang and Provinciali, 2015).

The Jaccard similarity coefficient measures similarity between two finite sample sets A and B by counting the number of items they have in common and dividing by the total number of unique items between them:

\begin{equation*} J(A,B)= \frac{\mid A \cap B \mid}{\mid A \cup B\mid } \end{equation*}

When the Jaccard similarity coefficient is above a certain threshold, a high correlation between the latent characteristics of two books is implied.

However, a content-based recommendation system will not perform on the highest level, if there is no data on user’s preferences, regardless of how detailed our metadata is.

Implementation - an LDA Recommendation Engine for Books

Let's start by uploading the base Python packages, as well as tools to remove expressions from the texts, pickle to save us some time, plotting packages, and json to load the data.

In [1]:
import os
import pandas as pd
import numpy as np
import datetime as dt
import time

%load_ext ipycache

import random
import re
import pickle

import matplotlib.pyplot as plt
import seaborn as sns

import json

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>"))
/Users/andrew/anaconda3/lib/python3.6/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/Users/andrew/anaconda3/lib/python3.6/site-packages/ipycache.py:17: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  from IPython.utils.traitlets import Unicode
In [2]:
project_folder = os.getcwd()

Loading the Data

We begin by loading the book data that we derived from the January 1st 2019 Wikipedia dumps, removing those with 'Wikipedia:' in their name, as these are pages for Wikipedia contributors (see below). To start, we have 37,523 books. This total number will be reduced, and the total words within each corpus will be further reduced after tokenization.

In [788]:
books = []

with open(project_folder+'/wiki/found_books_filtered.ndjson', 'r') as fin:
    books = [json.loads(l) for l in fin]

books_with_wikipedia = [book for book in books if 'Wikipedia:' in book[0]]
books = [book for book in books if 'Wikipedia:' not in book[0]]
print(f'Found a total of {len(books)} books.')
Found a total of 37523 books.

As stated before, each book entry contains the title, the text of the article (not necessarily including the title in the first part), the date of last edit (unused), and a rough estimate of how long the article is in charachters.

Data Exploration and Preparation

To begin our data exploration, let's derive first what data types we're dealing with.

In [790]:
print('List of Books type: {}'.format(type(books)))
print('Book type: {}'.format(type(books[0])))
print('Title type: {}'.format(type(books[0][0])))
print('Text type: {}'.format(type(books[0][1])))
print('Edit date type: {}'.format(type(books[0][2])))
print('Length type: {}'.format(type(books[0][3])))
List of Books type: <class 'list'>
Book type: <class 'list'>
Title type: <class 'str'>
Text type: <class 'str'>
Edit date type: <class 'str'>
Length type: <class 'int'>

Now let's look at some elements of each book list element - first the titles, then the descriptions, and then the lengths.

In [791]:
for i in range(5):
    print(books[i][0])
The Market for Liberty
Complete Mage
Live from Golgotha: The Gospel According to Gore Vidal
The Night of the Mary Kay Commandos
Confessions of a Yakuza
In [792]:
print(books[0][1])
The Market for Liberty is an anarcho-capitalist book written by Linda and Morris Tannehill, < !--do not link. redirects to this same article-- >  which according to Karl Hess has become  " something of a classic. " < ref > Foreword by Karl Hess, in The Market for Liberty by the Tannehills < /ref >  It was preceded by the self-published Liberty via the Market in 1969. Mary Ruwart credits the Tannehill's book with winning her over to anarcho-capitalism. < ref >  < /ref >  Doug Casey was also converted to anarcho-capitalism after reading the book at the behest of Jarret Wollstein. According to the Ludwig von Mises Institute, it was written just following a period of intense study of the writings of both Ayn Rand and Murray Rothbard. < ref >  < /ref >  It was the first significant anarcho-capitalist work to hit the libertarian movement, coming into print a year before Rothbard's Power and Market although Rothbard's book had been written earlier. 
 
 Like Rothbard's Man, Economy, and State, The Market For Liberty opposes statutory law and advocates natural law as the basis for society; however, unlike Man, Economy, and State, which explains what sort of libertarian legal code the market would create in an anarcho-capitalist society, The Market for Liberty points out that society would not be lawless in the absence of the state. < ref > Brown, Susan Love, The Free Market as Salvation from Government: The Anarcho-Capitalist View, Meanings of the Market: The Free Market in Western Culture, edited by James G. Carrier, Berg/Oxford, 1997, p. 113. < /ref >  Conversely, The Market for Liberty spends a great deal of time outlining how different businesses and organizational structures would interact in a laissez-faire society and how these interactions would create checks which would ultimately keep the tendency for crime low. In keeping with radical free-market principles, the book is skeptical about the potential for violent anarcho-capitalist revolution to bring about good outcomes. < ref >  < /ref > 
 
 ==Summary== 
 
 ===Part I – The Great Conflict=== 
 Chapter 1, If We Don't Know Where We're Going..., notes the growing dissatisfaction among youth, the many problems society faces, and the need for a clear goal rather than just an adversary (e.g. the state). It claims that the authors are not advocating any type of utopia that depends on the infallibility of man in order to function. It contends that if the present system is brought crashing down without valid ideas having been disseminated about how society can function without governmental rule, people will demand a strong leader, and a Hitler will rise to answer their plea. 
 
 Chapter 2, Man and Society, argues that the nature of man is such that he must think and produce in order to live; and that in order to reach his full potential, he must have the right not only to do these things but to enjoy the rewards of his productive actions. It defines a laissez-faire society as one that  " does not institutionalize the initiation of force and in which there are means for dealing with aggression justly when it does occur. "  It notes that only the possessor of a right can alienate himself from that right. If one does $100 of damage to a taxicab, for instance, then he alienates himself from his right to that $100. The cabbie then has a moral right to use force to collect it. 
 
 Chapter 3, The Self-Regulating Market, states that state interference causes the buyer, the seller, or both in a transaction to lose and that only a voluntary trade can be a completely satisfactory trade. It notes that markets clear; that taxation is economic hemophilia; regulation amounts to slow strangulation; that market monopolies can only attain and maintain monopoly status through excellence and low prices; and that without freedom of the market, no other freedom is meaningful. It criticizes the government for red tape which denies entrepreneurs opportunities to rise out of poverty. 
 
 Chapter 4, Government – An Unnecessary Evil, states that government is a coercive monopoly; that democratic governments decide issues largely on the basis of pressure from special interest groups; and that the notion of  " a government of laws, not of men "  is meaningless because laws must be written and enforced by men, and therefore a government of laws is a government of men. It argues that the eternal vigilance which is held to be the price of liberty is a constant non-productive expenditure of energy, and that is it grossly unreasonable to expect men to keep expending their energy in such a way out of unselfish idealism. It also argues that because of the danger of one interest group using the government to impose laws favoring itself or crippling its opponents, people are constantly in fear of different interest groups. Thus, blacks fear suppression by whites; whites worry about blacks gaining too much power; and any number of other groups, such as labor and management, urbanites and suburbanites, etc., are pitted against each other. Government is identified as a cause of strife. The checks and balances of government are also recognized as a source of waste that is no substitute for external checks such as competition. This chapter identifies many tools by which the government convinces people that government is necessary, such as state schools that brainwash the young into accepting pro-State ideas, investing government with tradition and pomp and identifying it with  " our way of life. "  It also blames people for having a fear of self-responsibility. 
 
 ===Part II – A Laissez-Faire Society=== 
 Chapter 5, A Free and Healthy Economy, begins by noting the difficulties people have in picturing a society radically different from their own. It concludes that poverty would be better addressed by a laissez-faire society for many reasons, including the fact that unemployment is caused by the government, that untaxed businesses would have more profits to reinvest in productivity-enhancing technology, that private charities are more efficient than government, that parents would be more likely to avoid having excess children in the absence of social safety nets, etc. It argues that a plethora of choices in education would emerge in a free market. It also notes that the focus of media in a laissez-faire society would shift from covering government to covering business and individuals and that abuses would be checked by reporters looking for stories on aggression or fraud. The chapter argues that the quality of health care could be more efficiently kept at an adequate level through reputation, standards instituted by insurance companies, etc. It also discusses how currency could be provided without government. 
 
 Chapter 6, Property – The Great Problem Solver, argues that most social problems could be solved through an increase in the amount and type of property owned. It claims that taxation is theft and that regulation by initiated force is slavery. It argues that it should be possible to claim ownership over the ocean floor, the surface of other planets, corridors of airspace, radio wavelengths, and so on, by being the first to occupy them or otherwise clearly stake out territory. It also argues that all public property should be privatized in order to reduce crime and pollution. 
 
 Chapter 7, Arbitration of Disputes, argues that governmental arbiters are not necessary, since a man who agrees to the settlement of disputes by a third party and then breaks the contract would suffer harm to his reputation and be ostracized, thus solving the problem of noncompliance. It notes that the government's judges will tend to be biased in favor of government, since that is the entity from which they receive their salaries and power. It promotes the concept of insurance companies as a substitute for government as the institution used to pursue claims; in the event a person were defrauded, they could file a claim with their insurance company, and the insurer would obtain the right of subrogation. Insurers who, themselves, committed abuses would suffer loss of reputation and be at a competitive disadvantage to more reputable insurers. 
 
 Chapter 8, Protection of Life and Property, asserts that a person has the right to defend his life against aggression; and that he therefore has the right to defend his possessions as well, since they are the results of his investment of parts of his life and are, thus, extensions of that life. It notes,  " Pacifism encourages every thug to continue his violent ways, even though the pacifist may devoutly wish he wouldn't (wishes don't create reality). Pacifistic behavior teaches the aggressor that crime does pay and encourages him to more and bigger aggressions. Such sanctioning of injustices is immoral, and because it is immoral, it is also impractical. "  It argues that self-defense is a personal responsibility, which one can fulfill by hiring an agent to protect him, such as a private defense agency. It distinguishes initiated force from retaliatory force, noting that the former is not a market phenomenon because it acts to destroy the market; but the latter is a market phenomenon because it restrains aggressors who would destroy it and/or exacts reparations from them. It notes that government creates a social environment which breeds crime through its prohibitions on gambling, prostitution, drugs, and so on. It argues that the main role of police is to protect the government, rather than the citizens. It contrasts the police with private defense agencies, which would focus on preventing aggression and whose officers would lack immunity for any offenses they might commit. It also notes that insurance companies might sell policies covering the insured against loss resulting from any type of coercion and that these insurers could bring unruly defense agencies to their knees through ostracism and boycotts. Yet, at the same time, the insurers would seek to avoid taking such action without cause, since it could be costly and result in boycotts against the insurer itself. 
 
 Chapter 9, Dealing with Coercion, argues that punishment in the form of eye for an eye vengeance does nothing to compensate the victim, and therefore opposes justice. It argues that an aggressor should repay the victim for his loss and for all expenses occasioned by the aggression, such as the cost of apprehending the offender. It further states that when an offender could not pay the restitution for a crime in his lifetime, the additional expenses could be paid by the insurance company. 
 
 Chapter 10, Rectification of Injustice, notes that some criminals of a particularly untrustworthy nature might need to work off their debt in workhouses. To insure against refusal to work, the reparations payments would be deducted from each pay before room and board costs, and those who refused to work would not eat or would have only a minimal diet. A variety of degrees of confinement would exist. The argument that the rich would buy crime is refuted by the argument that even a wealthy man could be killed in self-defense if he attempted a violent act and that he would risk his reputation as well. 
 
 Chapter 11, Warring Defense Agencies and Organized Crime, asserts the falsity of the assumption that government is necessary to prevent the initiation of force by arguing that government, as a coercive monopoly, must initiate force in order to survive. It notes several factors that would make a private defense agency avoid aggression. It would put itself at risk of retaliation and would lead its customers to fear that, in the event of a falling-out, it would turn its aggressive force against them. Moreover, insurers would consider the company to be a poor risk. Its employees would also be liable for any damages they caused, which would cause problems between the companies and its employees if it ordered unjustified attacks. It also speculates that a mafia-style agency would be unlikely to survive since there would be no black market to support it. The chapter argues that a tyrant would have more difficulty rising to power under a system of competing private defense agencies than under a governmental system, because customers oppressed by their company could simply switch to another company and obtain protection from the tyrant. 
 
 Chapter 12, Legislation and Objective Law, argues that free men, acting in a free market, would manage their affairs in accordance with natural law. It calls statutory law a clumsy, anachronistic, and unjust hindrance. It also argues that government judges have no market signals to guide their decisions, in contrast to free-market arbiters, who have profit and loss as a built-in correction mechanism. 
 
 Chapter 13, Foreign Aggression, notes that governments obtain the resources used for defense from the people, and those same resources can be used by private defense agencies to protect the people from aggression. It argues that governments aggravate or threaten other governments to the point of armed conflict and then coerce their citizens into protecting them. It notes that the ability of aggression insurers to pay claims would be enhanced by the limited damage resulting from the fact that foreign aggressors would need to use conventional warfare in wars of conquest to avoid destroying the property and slaves they seek to gain. It notes that in a laissez-faire society, there would be no government that could surrender to the enemy; defenders would fight as long as they perceived it was in their best interest. 
 
 Chapter 14, The Abolition of War, argues that government, not business, is responsible for the formation of the military–industrial complex. It notes that the burden of supporting wars falls heavily on business, since taxes are taken out of the pocket of the consumer. Moreover, businesses are a society's producers, and it is a society's producers who pay the bills. Business also suffers from wars because of the disruption of trade and the ruin and poverty that result. Government, however, gains from wars because it is left with more power, more money, and more territory. War also helps unite the people behind the government in the face of a  " common enemy. "  The authors conclude that all that is needed to abolish war is to abolish government. 
 
 ===Part III – How Do We Get There?=== 
 Chapter 15, From Government to Laissez Faire, argues that first and foremost, the economy should be provided with media of exchange to replace the dollar. It states that possession of public property should be taken by individuals who, simply by clearly marking their claims, become the rightful owners. It argues against disposing of public property at auction, since bureaucrats would figure out ways to divert the proceeds into their own pockets, and the system would be biased toward the rich, many of whom obtained their wealth through political pull. The process of auctioning off the property would also prolong the politicians' power. 
 
 Chapter 16, The Force Which Shapes the World, argues that it is immoral to destroy the private property or life of an individual who has not aggressed against one. It argues that violent revolution is not only destructive, but actually strengthens the government. It also notes that a revolution's leaders could then become the next rulers. Because of the people's desire for a leader to get them out of chaos, the chapter opines that a violent revolution would pave the way for a new Hitler. This chapter calls for people to share ideas related to freedom, which can eventually lead to widespread non-cooperation with government. 
 
 ==References== 
  
 
 ==External links== 
 * Morris and Linda Tannehill, The Market for Liberty (Lansing, Michigan: self-published, 1970). 
 * Another source for the book, The Market for Liberty 
 * Morris and Linda Tannehill, Foreword by Karl Hess, Introduction by Douglas Casey, The Market for Liberty (San Francisco, California: Fox  &  Wilkes, third edition, 1993),  (hardcover),  
 * Foreword by Karl Hess and Chapters 1, 3 and 5 from The Market for Liberty by Linda and Morris Tannehill 
 * Free audio book of The Market for Liberty (Read by Ian Bernard) 
 
  
 
  
 Category:1970 books 
 Category:Anarcho-capitalist books
In [793]:
for i in range(5):
    print(books[i][3])
16290
6065
826
5225
2805

The maximum and minimum article lengths should be found to set basic bounds to reduce our articles corpus based on unsuitible lengths.

In [794]:
def max_article_length(books_list):
    return max([books_list[x][3] for x in range(len(books_list))])

def min_article_length(books_list):
    return min([books_list[x][3] for x in range(len(books_list))])

print('Maximum article text length is: {}'.format(max_article_length(books)))
print('Minimum article text length is: {}'.format(min_article_length(books)))
Maximum article text length is: 251164
Minimum article text length is: 0

Let's now find which books have the longest and shortest article lengths. Cache the data after the first run to avoid running it again.

In [795]:
%%cache mycache_article_lengths.pkl longest_article, shortest_article
longest_article = None
shortest_article = None

# Longest and shortest book articles:
for i in range(len(books)):
    if books[i][3] == max_article_length(books):
        longest_article = [books[i][0], books[i][3]]

for i in range(len(books)):
    if books[i][3] == min_article_length(books):
        shortest_article = [books[i][0], books[i][3]]
[Skipped the cell's code and loaded variables longest_article, shortest_article from file '/Users/andrew/Work/Learning/HU_IS/mycache_article_lengths.pkl'.]
In [796]:
print(longest_article)
print(shortest_article)
['The Spirit of the Age', 251164]
['Draft:Treecat Wars', 0]

Removing Books

The following defines a function that removes a book from the list based on indexes that we will derive. The list object that is the book will be replaced by the string 'Delete me!', and all entries that are not this string will be kept. In looking at the titles of the books collected, those that had a semicolon were shown to be those that were most likely not singular books, so I focused on that to remove unneeded articles.

In [797]:
def book_remover(books_list, removal_index_list):
    for i in removal_index_list:
        books_list[i] = 'Delete me!'
    books_list = [book for book in books_list if book != 'Delete me!']
    return books_list

We use book remover to delete the following kinds of articles:

  • Commentaries
  • Dr. Who Anthologies
  • Draft Articles
  • Lists of Books
  • Anthologies
  • Short Story Collections

Deriving Most Popular Books

For further exploration of the corpus, let's look into which book titles appear most often in the article texts of other books. Again pickling the results, as this cell takes a very long time to run.

In [891]:
%%cache mycache_title_frequencies.pkl title_frequencies

def title_freq_generator(books_list):
    title_frequencies = []
    prog_marker = 0
    for i in range(len(books_list)):
        freq = 0
        book = books_list[i]
        title = book[0]
        other_book_indexes = list(range(len(books_list)))
        other_book_indexes.remove(i)
        for o in other_book_indexes:
            if title in books_list[o][1]:
                freq += 1
        title_frequencies.append((title, freq))
        prog_marker += 1
        
        if prog_marker % 1000 == 0:
            print('{} title frequencies have been found.'.format(prog_marker))
    return title_frequencies

title_frequencies = title_freq_generator(books)

What's the result?

In [811]:
most_frequent_mentions = sorted(title_frequencies, key=lambda x: x[1], reverse=True)
most_frequent_mentions[25:50]
Out[811]:
[('Don Quixote', 61),
 ("Uncle Tom's Cabin", 59),
 ('Concluding', 52),
 ('The Hollow', 50),
 ('On the Road', 49),
 ('Brave New World', 48),
 ('Eldest', 45),
 ('Oxford English Dictionary', 45),
 ('Lolita', 45),
 ("Alice's Adventures in Wonderland", 44),
 ('Ramona', 44),
 ('Paradise Lost', 43),
 ('The Great Gatsby', 43),
 ('American Psycho', 43),
 ('The Wonderful Wizard of Oz', 42),
 ('War and Peace', 42),
 ('The Fur', 41),
 ('Nineteen Eighty-Four', 41),
 ('Decline and Fall', 40),
 ('Robinson Crusoe', 39),
 ('The Secret History', 38),
 ('The Hobbit', 38),
 ('Jane Eyre', 38),
 ('Another Kind', 38),
 ("Gulliver's Travels", 37)]

We used the above range to remove common English word books that appear at the top. Some famous books that appeard before place 25 were: The Road by Cormac McCarthy, The Trial (Der Process) by Franz Kafka, Encyclopedia Britannica, The Encyclopedia of Science Fiction, Dracula by Bram Stoker, and The Prince by Niccolò Machiavelli. The novels obviously benefited from having relatively common names, which is to say nothing of their content. Furthermore, the obvious omission is The Bible, which is does not have a book infobox, and is thus not a part of the dataset.

The encyclopedias would thus rank first and second, and Cervantes gets the award for the most mentioned novel on English Wikipedia with Don Quixote. May Don Quixote and Sancho Panza be ever appreciated for their battles against ferocious giants that are actually windmills.

Cleaning Book Names

Now that we've done an analysis of titles in text bodies, let's remove the parethases from them. This was not done before this step so as to avoid further common English words from being added to the frequency aggregation. Famous books rarely have (novel) or (book).

In [812]:
for i in range(15):
    print(books[i][0])

def book_tag_remover(books_new, books_old):
    for x in range(len(books_old)):
        books_old[x][0] = re.sub(r'\(.*?\)', '', books_old[x][0])
        if books_old[x][0][-1] == ' ': # Removes the space that's left at the end of the books that have the tag removed
            books_old[x][0] = books_old[x][0][:-1]
        books_new.append(books_old[x])
    
    return books_new
books_no_tag = []

print('-----')
books_no_tag = book_tag_remover(books_no_tag, books)
for i in range(15):
    print(books_no_tag[i][0])
The Market for Liberty
Complete Mage
Live from Golgotha: The Gospel According to Gore Vidal
The Night of the Mary Kay Commandos
Confessions of a Yakuza
90 Minutes in Heaven
Egalitarianism as a Revolt Against Nature and Other Essays
Hyperion (Hölderlin novel)
The Blind Side: Evolution of a Game
Katrina (novel)
Family Without a Name
Leonardo da Vinci: The Flights of the Mind
The First Intimate Contact
Krabat (novel)
The Ultimate Sniper
-----
The Market for Liberty
Complete Mage
Live from Golgotha: The Gospel According to Gore Vidal
The Night of the Mary Kay Commandos
Confessions of a Yakuza
90 Minutes in Heaven
Egalitarianism as a Revolt Against Nature and Other Essays
Hyperion
The Blind Side: Evolution of a Game
Katrina
Family Without a Name
Leonardo da Vinci: The Flights of the Mind
The First Intimate Contact
Krabat
The Ultimate Sniper

Dealing With the Article Length Distribution

And now we proceed further with some necessary cleaning. Let's look at the distribution of article lengths that we have.

In [819]:
def roundup_100(x):
    return x if x % 100 == 0 else x + 100 - x % 100

def round_lengths(books_list):
    return [roundup_100(books_list[x][3]) for x in range(len(books_list))]

rounded_to_100_lengths = round_lengths(books_no_tag)
print(rounded_to_100_lengths[:5])
[16300, 6100, 900, 5300, 2900]
In [820]:
length_counts = {x:rounded_to_100_lengths.count(x) for x in rounded_to_100_lengths}
print(sorted(length_counts.items())[:5])

x = [tup[0] for tup in sorted(length_counts.items())]
y = [tup[1] for tup in sorted(length_counts.items())]

plt.figure(figsize=(20,10))
plt.bar(x,y, width=100)
plt.title('Number of Articles by Article Length', fontsize=30)
plt.xlabel("Ranges of Article Lengths (by 100 charaters)", fontsize=20)
plt.ylabel("Number of Articles", fontsize=20)
plt.show()
[(100, 2), (200, 147), (300, 318), (400, 523), (500, 822)]

As we can see, there is an inordinate number of books with almost no length, and that is matched by books at the higher end of the spectrum that upon inspection are shown to have large sections on cultural significance.

The shortest articles were found to simply be those that said that an author wrote a book, with nothing else in its content.

And now the longest books should be inspected.

In [822]:
longest_articles = books_sorted[round(len(books_sorted)*0.9995):]

# Longest books:
for i in range(len(longest_articles)):
    print(longest_articles[i][0])
Timeline of science fiction
Sir Gawain and the Green Knight
To Kill a Mockingbird
States and Social Revolutions
Marthandavarma
Gone with the Wind
The Little Prince
Tetrabiblos
Critique of Pure Reason
On the Origin of Species
Moby-Dick
The General Theory of Employment, Interest and Money
Mansfield Park
A Treatise of Human Nature
Finnegans Wake
Characters of Shakespear's Plays
The Britannia's Fist Trilogy
The Spirit of the Age

Obviously we do not need the books at the short end of the article length spectrum. LDA is not going to assign meaningful topics when the only words to derive topics from are the author's name, year or release, and the collection it appears in - even more so when considering that some words may be removed later post tokenization, where some of these words may be removed because they're stop words, appear to frequently, or aren't long enough.

We'll thus remove those books in the bottom 20% of article length and the last three books in the catalogue, which are all collections. Ultimately the decision to not remove a percentage of books at the end of the distribution was made to keep certain extremely influential books in the greater text corpus.

In [823]:
books_no_stub = books_sorted[round(len(books_sorted)*0.20):len(books_sorted)-3]
print('Number deleted: {} | Current catalogue: {} books'.format(len(books_sorted) - len(books_no_stub), len(books_no_stub)))
Number deleted: 7389 | Current catalogue: 29543 books

Let's look again and what we have for the length distribution, and note that we will be dealing with the longer articles in another way once the texts have been tokenized.

In [824]:
rounded_to_100_lengths = round_lengths(books_no_stub)

length_counts = {x:rounded_to_100_lengths.count(x) for x in rounded_to_100_lengths}
print(sorted(length_counts.items())[:5])

x = [tup[0] for tup in sorted(length_counts.items())]
y = [tup[1] for tup in sorted(length_counts.items())]

plt.figure(figsize=(20,10))
plt.bar(x,y, width=100)
plt.title('Number of Articles by Article Length', fontsize=30)
plt.xlabel("Ranges of Article Lengths (by 100 charaters)", fontsize=20)
plt.ylabel("Number of Articles", fontsize=20)
plt.show()
[(1200, 918), (1300, 868), (1400, 868), (1500, 836), (1600, 801)]

Final Text Cleanup

Before we switch over to the tokenized articles, we should switch all articles to one line bodies, remove real expression references, remove those parts of the text that are used to distinguish between parts of an article, remove possessives, remove websites, and finally reduce all spaces in the articles to single spaces such that post cleaning we'll have a readable body.

In [825]:
def text_one_liner(books_list):
    for x in range(len(books_list)):
        books_list[x][1] = books_list[x][1].replace('\n', ' ')
    return books_list

books_clean = text_one_liner(books_no_stub)
In [826]:
def real_exp_remover(books_list):
    for x in range(len(books_list)):
        books_list[x][1] = re.sub(r'\<.*?\>', '', books_list[x][1])
        books_list[x][1] = re.sub(r'\[.*?\]', '', books_list[x][1])
    return books_list

books_clean = real_exp_remover(books_clean)
In [827]:
unneeded_filler = ['Category:', '(hardcover)', '(paperback)', '==External links==', '==Consequences==', '==Discussion==', 
                   '==References==', '==Plot==', '==Contents==', '==Reception==', '==Notes==', '==Themes==', '==Miniseries adaptation==',
                   '==Sources==', '==See also==', '==Plot summary==', '==Quotations==', '==Publication==',
                   '==Title==', '== References ==', '== External links ==', '==Further reading==', '==Introduction==',
                   '==Main characters==', '==Release details==', '===Summary===', '*', '====', '===', '==', '& nbsp;', '.com', 'jpg', '❤']

def filler_remover(books_list):
    for x in range(len(books_list)):
        for filler in unneeded_filler:
            books_list[x][1] = books_list[x][1].replace(filler, '')
    return books_list

books_clean = filler_remover(books_clean)
In [828]:
def possessive_fixer(books_list):
    for x in range(len(books_list)):
        books_list[x][1] = books_list[x][1].replace("\'", "")
    return books_list

books_clean = possessive_fixer(books_clean)
In [829]:
def website_remover(books_list):
    for x in range(len(books_list)):
        websites = []
        for word in books_list[x][1].split(' '):
            if word[:4] == 'http':
                websites.append(word)
        
        for website in websites:
            books_list[x][1] = books_list[x][1].replace(website, '')
    return books_list

books_clean = website_remover(books_clean)
In [830]:
def space_reducer(books_list):
    for x in range(len(books_list)):
        for i in range(25, 0, -1): #Loop backwards to assure that smaller spaces aren't made by larger deletions and then missed
            large_space = str(i*' ')
            if large_space in books_list[x][1]:
                books_list[x][1] = books_list[x][1].replace(large_space, ' ')
    return books_list
                
books_clean = space_reducer(books_clean)
books_clean[0]
Out[830]:
['The Great Sea: A Human History of the Mediterranean',
 'The Great Sea: A Human History of the Mediterranean is an award-winning book by the British historian David Abulafia. First published in 2011, it is a history of the Mediterranean Sea from 22,000 BC to the present time, and provides one of the most comprehensive treatments of the subject since the works of Fernand Braudel. The book has been critically acclaimed and received the Mountbatten Literary Award from the Maritime Foundation, and the British Academy Medal. Richard Bosworth review, Times Higher Education Supplement Jane Liddell-King review, Jewish Chronicle It has so far been translated into Dutch, Greek, Turkish, Spanish, Korean, German, Italian, Romanian and Portuguese. Notes and references 2011 books History books about Europe History books about civilization Penguin Books books',
 '2018-07-29T13:53:26Z',
 1102]

Preparing Articles for Data Analysis

The following was done to prepare articles for dataanalysis:

  • The book articles are removed from the initial list of book properties
  • Punctuation is removed
  • Article strings are turned into tokens, and stop words are removed
  • Numeric tokens were removed
  • Bigrams are created
  • Words are lemmatized to reduce words of similar roots to similar tokens
  • English first names are removed, as these are often very prevalent in articles (note that bigrams of full names are retained)
  • Token amounts per article are rounded off at an exponentially increasing rate (we'll more explain when we get there)
  • Tokens of that are less than four characters are removed
  • Only tokens that appear in at least ten articles are kept
  • Those tokens that appear in more than 15% of all articles will be removed

And here we are to the stage where we need to get rid of the extra information at the end of articles. To do this, we looked at the longest book article in our text corpus - Finnigan's Wake (article here: https://en.wikipedia.org/wiki/Finnegans_Wake). The sections slowly decrease in topic modeling importance starting with "Allusions to Other Works," which would help us connect it to other novels via bigrams, and its Norwegian influences, leading to the section on 100 letter words that Joyce uses.

The hundred letter words, for example:

bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenenthurnuk

... were all deleted when the real expressions were removed from the text. In the section though the word "hundredlettered" appears, which appears only once in the text. We'll use this as our reference point for where we want the text to be shortened. Let's first find at what point this word is found.

In [877]:
articles_no_name[-1].index('hundredlettered')
Out[877]:
7117
In [878]:
articles_no_name[-1].index('hundredlettered')/len(articles_no_name[-1])
Out[878]:
0.7212931995540691

Let's further look at our distribution of tokens per article.

In [879]:
def round_lengths_exp(articles_list):
    return [roundup_100(len(articles_list[x])) for x in range(len(articles_list))]
In [880]:
rounded_to_100_now = round_lengths_exp(articles_no_name)

length_counts_now = {x:rounded_to_100_now.count(x) for x in rounded_to_100_now}
print(sorted(length_counts_now.items())[:20])

x_now = [tup[0] for tup in sorted(length_counts_now.items())]
y_now = [tup[1] for tup in sorted(length_counts_now.items())]

plt.figure(figsize=(20,10))
plt.bar(x_now, y_now, width=100)
plt.title('Current Number of Articles by Token Amount', fontsize=30)
plt.xlabel("Ranges of Token Amounts (by 100)", fontsize=20)
plt.ylabel("Number of Articles", fontsize=20)
plt.show()
[(100, 918), (200, 8114), (300, 5137), (400, 3419), (500, 2510), (600, 1903), (700, 1410), (800, 1098), (900, 884), (1000, 677), (1100, 541), (1200, 419), (1300, 328), (1400, 278), (1500, 249), (1600, 212), (1700, 179), (1800, 132), (1900, 123), (2000, 121)]

Our goal is thus to write a function to remove the last 30% of the Finnegans Wake article and other similarly long articles, but remove little or no words from smaller articles. We'll achive this by using the fact that our articles are ordered by length, and write an exponential funciton. We'll count up by index in the list of books as we go through it, and put this value into a sixth degree polynomial such that the rate of increase will be rapid.

To prevent it from increasing in the early part of the article distribution, we'll multiply the funciton by a very small decimal to keep all initial values close to 0 until such a time when the exponential increase is fast enough to overcome the initial reduction. Looking at the table above, with roughly half of our articles having less than 400 tokens, and then the amounts drastically increase, let's design a function that's close to 0 until 15,000, and then increases rapidly to around 0.3 at our maximum article index of 29543.

In [881]:
x_vals = list(range(len(articles_no_name)))
x_vals[:5]
Out[881]:
[0, 1, 2, 3, 4]
In [882]:
length = len(x_vals)
print(length)

y_vals = [0.00000000000000000000000000045*(x**(6)) for x in x_vals]
y_vals[:5]
29543
Out[882]:
[0.0, 4.5e-28, 2.88e-26, 3.2805e-25, 1.8432e-24]
In [883]:
plt.figure(figsize=(10,5))
plt.plot(x_vals, y_vals, linestyle='solid')
Out[883]:
[<matplotlib.lines.Line2D at 0x18f609828>]

Trial and error at its finest. The y-values of the above function will thus be the amounts rouded off of the end of each article, or more specifically 1 minus these amounts will be multiplied by the length of the token amount per article and used as the upper bound for token subsetting, with the lower bound of course being 0.

In [884]:
import math

def remove_last(new_list, old_list):
    prog_marker = 0
    for article in old_list:
        article_percent_remaining = 1-(0.00000000000000000000000000045*(prog_marker**(6)))
        new_article = article[:round(len(article)*article_percent_remaining)]
        new_list.append(new_article)
        prog_marker += 1
        
#         if prog_marker % 1000 == 0:
#             print('{} articles have had their sizes reduced.'.format(prog_marker))
    
    print(token_counter(new_list, old_list))
    
    return new_list
articles_last = []

articles_last = remove_last(articles_last, articles_no_name)
print(articles_last[0])
Unique tokens deleted: 31974 | New total unique tokens in all articles: 322076
None
['notes_references', 'german_italian', 'translated_dutch', 'education_supplement', 'times_higher', 'literary_award', 'critically_acclaimed', 'mediterranean_sea', 'award_winning', 'great', 'sea', 'human', 'history', 'mediterranean', 'award', 'win', 'british', 'historian', 'abulafia', 'publish', 'history', 'mediterranean', 'sea', 'bc', 'present', 'time', 'provide', 'one', 'comprehensive', 'treatment', 'subject', 'since', 'work', 'fernand', 'braudel', 'critically', 'acclaim', 'receive', 'mountbatten', 'literary', 'award', 'maritime', 'foundation', 'british', 'academy', 'medal', 'bosworth', 'review', 'time', 'higher', 'education', 'supplement', 'liddell', 'king', 'review', 'jewish', 'chronicle', 'far', 'translate', 'dutch', 'greek', 'turkish', 'spanish', 'korean', 'german', 'italian', 'romanian', 'portuguese', 'note', 'reference', 'history', 'europe', 'history', 'civilization', 'penguin']

Let's take a look at our new distribution.

In [885]:
rounded_to_100_new = round_lengths_exp(articles_last)

length_counts_new = {x:rounded_to_100_new.count(x) for x in rounded_to_100_new}
print(sorted(length_counts_new.items())[:20])

x_new = [tup[0] for tup in sorted(length_counts_new.items())]
y_new = [tup[1] for tup in sorted(length_counts_new.items())]

plt.figure(figsize=(20,10))
plt.bar(x_new, y_new, width=100)
plt.title('New Number of Articles by Token Amount', fontsize=30)
plt.xlabel("Ranges of Token Amounts (by 100)", fontsize=20)
plt.ylabel("Number of Articles", fontsize=20)
plt.show()
[(100, 918), (200, 8114), (300, 5185), (400, 3535), (500, 2719), (600, 2095), (700, 1659), (800, 1275), (900, 926), (1000, 642), (1100, 486), (1200, 383), (1300, 317), (1400, 195), (1500, 197), (1600, 144), (1700, 119), (1800, 94), (1900, 76), (2000, 71)]

Our upper bound has decreased by 30%, and the early articles were unaffected. It does seem to cluster the articles at the tail a bit more than in the original graph, but let's not split hairs over whether a seventh or eigth degree polynomial with an even more restrictive initial divisor would be more appropriate.

We continue now to short tokens. As stated before, tokens under four characters are removed.

From here we continue with the other cleaning processes that can be found in the main notebook.

Let's see what the most common words in our text corpus are!

In [715]:
text_word_frequency = Counter(chain.from_iterable(articles_model))
text_word_frequency.most_common()[:15]
Out[715]:
[('doctor', 11013),
 ('king', 10593),
 ('black', 9682),
 ('science_fiction', 9638),
 ('volume', 9123),
 ('london', 8792),
 ('murder', 8782),
 ('united_states', 8513),
 ('political', 8313),
 ('social', 8030),
 ('ship', 8019),
 ('sister', 7918),
 ('childrens', 7914),
 ('white', 7840),
 ('earth', 7778)]

Model Prep

We've already imported our LDA package gensim for its Phrases tool within its models, and now we'll import the rest of the parts we'll need.

In [716]:
from gensim import corpora, models, similarities

First, let's create our dictionary, which is a mapping between the words in our corpus and their integer representations. The cell after next will display a few examples of the mappings.

In [717]:
%%time

dictionary = corpora.Dictionary(articles_model)
CPU times: user 9.02 s, sys: 13.3 ms, total: 9.03 s
Wall time: 9.04 s
In [718]:
from itertools import islice
def take(n, iterable):
    return list(islice(iterable, n))

items_5 = take(5, dictionary.iteritems())
items_5
Out[718]:
[(0, 'academy'),
 (1, 'acclaim'),
 (2, 'award_winning'),
 (3, 'bosworth'),
 (4, 'chronicle')]

Now we convert our documents into bag of words (BoW) format, where each will be represented by tuples of token id and token count.

In [719]:
%%time

bow_corpus = [dictionary.doc2bow(text) for text in articles_model]
CPU times: user 6.92 s, sys: 111 ms, total: 7.03 s
Wall time: 7.03 s

How many tokens and articles do we have?

In [720]:
print('Number of unique tokens: %d' % len(dictionary)) # roughly 400 less than in presentation
print('Number of articles: %d' % len(bow_corpus)) # roughly 500 less than in presentation
Number of unique tokens: 44713
Number of articles: 29543

And what does The Great Sea: A Human History of the Mediterranean look like now?

In [721]:
print(bow_corpus[0])
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)]
In [722]:
def tokenization_report(bow, article_number):
    article_tokens = bow[article_number]
    sorted_tokens = sorted(article_tokens, key=lambda x: x[1], reverse=True)
    for i in range(len(bow[article_number])):
        print("Word {} (\"{}\") appears {} time(s).".format(sorted_tokens[i][0], 
                                                         dictionary[sorted_tokens[i][0]], 
                                                         sorted_tokens[i][1]))
tokenization_report(bow_corpus, 0)
Word 28 ("mediterranean") appears 2 time(s).
Word 0 ("academy") appears 1 time(s).
Word 1 ("acclaim") appears 1 time(s).
Word 2 ("award_winning") appears 1 time(s).
Word 3 ("bosworth") appears 1 time(s).
Word 4 ("chronicle") appears 1 time(s).
Word 5 ("civilization") appears 1 time(s).
Word 6 ("comprehensive") appears 1 time(s).
Word 7 ("critically") appears 1 time(s).
Word 8 ("critically_acclaimed") appears 1 time(s).
Word 9 ("dutch") appears 1 time(s).
Word 10 ("education") appears 1 time(s).
Word 11 ("education_supplement") appears 1 time(s).
Word 12 ("europe") appears 1 time(s).
Word 13 ("fernand") appears 1 time(s).
Word 14 ("foundation") appears 1 time(s).
Word 15 ("german") appears 1 time(s).
Word 16 ("german_italian") appears 1 time(s).
Word 17 ("greek") appears 1 time(s).
Word 18 ("higher") appears 1 time(s).
Word 19 ("historian") appears 1 time(s).
Word 20 ("italian") appears 1 time(s).
Word 21 ("jewish") appears 1 time(s).
Word 22 ("king") appears 1 time(s).
Word 23 ("korean") appears 1 time(s).
Word 24 ("liddell") appears 1 time(s).
Word 25 ("literary_award") appears 1 time(s).
Word 26 ("maritime") appears 1 time(s).
Word 27 ("medal") appears 1 time(s).
Word 29 ("mediterranean_sea") appears 1 time(s).
Word 30 ("notes_references") appears 1 time(s).
Word 31 ("penguin") appears 1 time(s).
Word 32 ("portuguese") appears 1 time(s).
Word 33 ("romanian") appears 1 time(s).
Word 34 ("spanish") appears 1 time(s).
Word 35 ("subject") appears 1 time(s).
Word 36 ("supplement") appears 1 time(s).
Word 37 ("times_higher") appears 1 time(s).
Word 38 ("translate") appears 1 time(s).
Word 39 ("translated_dutch") appears 1 time(s).
Word 40 ("treatment") appears 1 time(s).
Word 41 ("turkish") appears 1 time(s).

We can see from the above words that it will be quite difficult to come up with an accurate prediction for the above book. We'll find that with books tht have more specific tokens that link them to similar books, that the results are more obvious, but there is a lot of noise above. There are the languages at the end, that it references countless European countries - so we're hoping it would be referred to other books that are a summary of a European region and are "critically acclaimed history that chronicles civilization" - to use other tokens.

Determining Optimal Topic Number

At this point we need to determine our optimal number of topics. We do this as we would using cluthering methods like k-nearest neighbors, by producing models at a regularly increasing interval of topic amounts, and using an elbow method based on the Jaccard similarity to select an amount that gives sufficient coverage, but doesn't over cover such that the topics are overly similar and incldue too many articles.

For the upcoming LDA model, we'll need to understand the following:

  • num_topics: number of requested latent topics to be extracted from the training corpus.
  • passes: number of passes through the corpus during training.
  • update_every: number of documents to be iterated through for each update. 1 = iterative learning.

And of course, we'll be using random state 42. What respectable programming project on books would not use 42 as is practice - in homage to the Hitchhiker's Guide to the Galaxy. It is "The Answer to the Ultimate Question of Life, The Universe, and Everything."

In [723]:
topicnums = [1,5,10,15,20,25,30,35,40,45,50]
project_folder = os.getcwd()

ldamodels_bow = {}
for i in topicnums:
    random.seed(42)
    if not os.path.exists(project_folder+'/models/ldamodels_bow_'+str(i)+'.lda'):
        %time ldamodels_bow[i] = models.LdaModel(bow_corpus, num_topics=i, random_state=42, update_every=1, passes=10, id2word=dictionary)
        ldamodels_bow[i].save(project_folder+'/models/ldamodels_bow_'+str(i)+'.lda')
        print('ldamodels_bow_{}.lda created.'.format(i))
    else: 
        print('ldamodels_bow_{}.lda already exists.'.format(i))
ldamodels_bow_1.lda already exists.
ldamodels_bow_5.lda already exists.
ldamodels_bow_10.lda already exists.
ldamodels_bow_15.lda already exists.
ldamodels_bow_20.lda already exists.
ldamodels_bow_25.lda already exists.
ldamodels_bow_30.lda already exists.
ldamodels_bow_35.lda already exists.
ldamodels_bow_40.lda already exists.
ldamodels_bow_45.lda already exists.
ldamodels_bow_50.lda already exists.
In [724]:
lda_topics = {}
for i in topicnums:
    lda_model = models.LdaModel.load(project_folder+'/models/ldamodels_bow_'+str(i)+'.lda')
    lda_topics_string = lda_model.show_topics(i)
    lda_topics[i] = ["".join([c if c.isalpha() else " " for c in topic[1]]).split() for topic in lda_topics_string]

pickle.dump(lda_topics,open(project_folder+'/models/pub_lda_bow_topics.pkl','wb'))

Now that we have our models, we need to find what the best number of topics is. As stated before, we'll be using Jaccard Similarity:

  • A statistic used for comparing the similarity and diversity of sample sets.
  • J(A,B) = (A ∩ B)/(A U B)
  • Goal is low Jaccard scores for coverage of the diverse elements

We'll first define a function, then compute the Jaccard similarities accross the topics at each topic number level, and then graph these to find our elbow point.

In [725]:
def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return float(len(intersection))/float(len(union))
In [726]:
lda_stability = {}
for i in range(0,len(topicnums)-1):
    jacc_sims = []
    for t1,topic1 in enumerate(lda_topics[topicnums[i]]):
        sims = []
        for t2,topic2 in enumerate(lda_topics[topicnums[i+1]]):
            sims.append(jaccard_similarity(topic1,topic2))    
        jacc_sims.append(sims)    
    lda_stability[topicnums[i]] = jacc_sims
    
pickle.dump(lda_stability,open(project_folder+'/models/pub_lda_bow_stability.pkl','wb'))
In [727]:
lda_stability = pickle.load(open(project_folder+'/models/pub_lda_bow_stability.pkl','rb'))
mean_stability = [np.array(lda_stability[i]).mean() for i in topicnums[:-1]]

with sns.axes_style("darkgrid"):
    x = topicnums[:-1]
    y = mean_stability
    plt.figure(figsize=(20,10))
    plt.plot(x,y,label='Average Overlap Between Topics')
    plt.xlim([1, 55])
    plt.ylim([0, 0.25])
    plt.xlabel('Number of topics')
    plt.ylabel('Average Jaccard similarity')   
    plt.title('Average Jaccard Similarity Between Topics')
    plt.legend()    
    plt.show()

Deploying Optimal Model

From above we can see that the similarity continues to decrease after 30, and the decrease is farily consistent. We don't want too many topics, so we'll use that as our optimal model.

In [728]:
num_topics = 30
lda_model_final = models.LdaModel.load(project_folder+'/models/ldamodels_bow_'+str(num_topics)+'.lda')

Here's where it starts to get really good! Let's look at our topics. The next cell will print all 30 topics. Each entry will have the top 10 words by significance, followed by the whole topics coherence - the average of the pairwise word-similarity scores of all words within the topic.

In [729]:
from pprint import pprint
top_topics = lda_model_final.top_topics(bow_corpus, topn=10)

avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

counter = 0
for topic in top_topics:
    print('Topic {}:'.format(counter))
    counter += 1
    pprint(topic)
    
pickle.dump(lda_topics,open(project_folder+'/models/pub_lda_bow_topics.pkl','wb'))
Average topic coherence: -2.3491.
Topic 0:
([(0.014210283, 'theory'),
  (0.0112899225, 'argue'),
  (0.008718467, 'social'),
  (0.00857375, 'nature'),
  (0.007735236, 'religion'),
  (0.00772367, 'understand'),
  (0.007135545, 'individual'),
  (0.006946929, 'philosophy'),
  (0.006513779, 'belief'),
  (0.0064604846, 'question')],
 -1.3977855682235243)
Topic 1:
([(0.022364164, 'earth'),
  (0.016499098, 'science_fiction'),
  (0.015664991, 'planet'),
  (0.014607186, 'ship'),
  (0.014023151, 'space'),
  (0.008566551, 'system'),
  (0.0081997765, 'universe'),
  (0.007855967, 'future'),
  (0.007540402, 'alien'),
  (0.00751091, 'crew')],
 -1.5175084477989171)
Topic 2:
([(0.016778223, 'king'),
  (0.008709668, 'fight'),
  (0.008248136, 'ship'),
  (0.008135976, 'land'),
  (0.007879827, 'magic'),
  (0.007843292, 'battle'),
  (0.00744357, 'captain'),
  (0.007309842, 'save'),
  (0.00702877, 'rescue'),
  (0.0067325197, 'capture')],
 -1.5350774502960836)
Topic 3:
([(0.011919974, 'narrator'),
  (0.009990753, 'letter'),
  (0.009893245, 'narrative'),
  (0.007654871, 'style'),
  (0.006902002, 'critic'),
  (0.0051097968, 'period'),
  (0.0048818165, 'influence'),
  (0.004799381, 'represent'),
  (0.004794581, 'refer'),
  (0.0046267537, 'modern')],
 -1.5597872866477989)
Topic 4:
([(0.018360177, 'political'),
  (0.012718209, 'government'),
  (0.009647229, 'social'),
  (0.008784685, 'united_states'),
  (0.007923653, 'economic'),
  (0.007060242, 'country'),
  (0.0067271455, 'nation'),
  (0.006516468, 'revolution'),
  (0.0062044268, 'america'),
  (0.0061669247, 'public')],
 -1.5599439258697319)
Topic 5:
([(0.0116070965, 'husband'),
  (0.0108911125, 'marriage'),
  (0.0058457297, 'affair'),
  (0.0056121396, 'money'),
  (0.005192735, 'lover'),
  (0.0051382044, 'suicide'),
  (0.0042379536, 'commit'),
  (0.0041104686, 'agree'),
  (0.0038668262, 'past'),
  (0.003665546, 'wealthy')],
 -1.646380636648313)
Topic 6:
([(0.024184082, 'army'),
  (0.023187194, 'soldier'),
  (0.020261638, 'general'),
  (0.019815935, 'military'),
  (0.019323546, 'officer'),
  (0.013818595, 'battle'),
  (0.010859633, 'command'),
  (0.010739857, 'train'),
  (0.010175608, 'colonel'),
  (0.009874464, 'world_war')],
 -1.6867478830151577)
Topic 7:
([(0.037160113, 'murder'),
  (0.026238658, 'police'),
  (0.02024448, 'crime'),
  (0.020038037, 'case'),
  (0.01935841, 'chinese'),
  (0.013693445, 'detective'),
  (0.011121611, 'arrest'),
  (0.010452749, 'mystery'),
  (0.009959042, 'shoot'),
  (0.009245443, 'investigation')],
 -1.7432272774248974)
Topic 8:
([(0.005759434, 'drug'),
  (0.0057452354, 'talk'),
  (0.005614311, 'leaf'),
  (0.0055833454, 'sleep'),
  (0.0054389895, 'realize'),
  (0.0054172096, 'something'),
  (0.004769118, 'drink'),
  (0.0043631014, 'hear'),
  (0.0042530345, 'farm'),
  (0.003797971, 'drive')],
 -1.7478979570226194)
Topic 9:
([(0.01959422, 'sister'),
  (0.016744334, 'parent'),
  (0.0118366135, 'student'),
  (0.008595059, 'teacher'),
  (0.0084526315, 'class'),
  (0.0077684075, 'adult'),
  (0.007369701, 'older'),
  (0.006873682, 'teach'),
  (0.005862473, 'dream'),
  (0.0057722516, 'best_friend')],
 -1.778207665391798)
Topic 10:
([(0.0128823975, 'volume'),
  (0.008358772, 'section'),
  (0.007910455, 'church'),
  (0.007177777, 'example'),
  (0.0066198693, 'source'),
  (0.0064351214, 'information'),
  (0.00595193, 'method'),
  (0.005746086, 'research'),
  (0.005351031, 'print'),
  (0.0052332627, 'article')],
 -1.794099789040294)
Topic 11:
([(0.01124156, 'poem'),
  (0.008071868, 'interview'),
  (0.0071447953, 'song'),
  (0.0070351516, 'music'),
  (0.0064988206, 'march'),
  (0.006423904, 'april'),
  (0.0063485648, 'sell'),
  (0.0063280785, 'retrieve'),
  (0.006230698, 'october'),
  (0.006212553, 'november')],
 -1.8048452592689153)
Topic 12:
([(0.0323321, 'german'),
  (0.022588069, 'russian'),
  (0.019555489, 'soviet'),
  (0.014192136, 'germany'),
  (0.0137838535, 'world_war'),
  (0.013041501, 'jewish'),
  (0.012456559, 'empire'),
  (0.011271158, 'europe'),
  (0.011054829, 'union'),
  (0.011014814, 'nazi')],
 -1.8741724735337473)
Topic 13:
([(0.03949716, 'game'),
  (0.018377122, 'dragon'),
  (0.013591324, 'player'),
  (0.011237182, 'rule'),
  (0.008202068, 'adventure'),
  (0.007832079, 'monster'),
  (0.0065095993, 'class'),
  (0.0064722607, 'master'),
  (0.0058553563, 'spell'),
  (0.0053168484, 'dungeon')],
 -1.9477208205549605)
Topic 14:
([(0.013901363, 'television'),
  (0.011346217, 'agent'),
  (0.010947663, 'team'),
  (0.010700805, 'movie'),
  (0.010332861, 'simon'),
  (0.009609766, 'episode'),
  (0.009248384, 'production'),
  (0.009111157, 'intelligence'),
  (0.0080780545, 'director'),
  (0.007920342, 'secret')],
 -2.0763130165604404)
Topic 15:
([(0.01851417, 'town'),
  (0.010858672, 'miss'),
  (0.009146825, 'local'),
  (0.008641151, 'village'),
  (0.008489788, 'river'),
  (0.0074412143, 'island'),
  (0.006559046, 'uncle'),
  (0.005735745, 'aunt'),
  (0.005554381, 'land'),
  (0.0054394323, 'fish')],
 -2.1363899662573163)
Topic 16:
([(0.034541868, 'french'),
  (0.023374217, 'translation'),
  (0.023141049, 'london'),
  (0.016712416, 'translate'),
  (0.015694914, 'france'),
  (0.014250788, 'paris'),
  (0.010814258, 'freud'),
  (0.009870471, 'thumb'),
  (0.009161403, 'volume'),
  (0.008418526, 'italian')],
 -2.139195050739163)
Topic 17:
([(0.01193989, 'blue'),
  (0.010204266, 'paint'),
  (0.010081834, 'artist'),
  (0.009644983, 'plant'),
  (0.009514295, 'illustration'),
  (0.009123517, 'comic'),
  (0.008370122, 'food'),
  (0.007789589, 'image'),
  (0.0076526846, 'white'),
  (0.0073237712, 'color')],
 -2.149956956229622)
Topic 18:
([(0.024467716, 'wolfe'),
  (0.018596414, 'science_fiction'),
  (0.017780738, 'carter'),
  (0.015258154, 'tale'),
  (0.014172577, 'paperback'),
  (0.011525531, 'anthology'),
  (0.010885688, 'hugo'),
  (0.010555099, 'short_collections'),
  (0.010502694, 'novella'),
  (0.009911158, 'camp')],
 -2.308774994925293)
Topic 19:
([(0.045170207, 'doctor'),
  (0.0077072703, 'serial'),
  (0.0064508836, 'angel'),
  (0.005989482, 'martin'),
  (0.0056082993, 'memory'),
  (0.0055338745, 'episode'),
  (0.005497826, 'room'),
  (0.0054322276, 'persuade'),
  (0.0054237135, 'lucy'),
  (0.00437867, 'realise')],
 -2.5331837746137498)
Topic 20:
([(0.03563244, 'childrens'),
  (0.023687357, 'vampire'),
  (0.019932605, 'animal'),
  (0.016886814, 'tale'),
  (0.01330407, 'fairy'),
  (0.010796315, 'weekly'),
  (0.0107140755, 'rabbit'),
  (0.010367858, 'publishers_weekly'),
  (0.009816919, 'mouse'),
  (0.009730396, 'kirkus_reviews')],
 -2.7010429292631466)
Topic 21:
([(0.018113224, 'rise'),
  (0.01474904, 'monk'),
  (0.013873671, 'ghost'),
  (0.012749598, 'christmas'),
  (0.0079808505, 'london'),
  (0.007388651, 'new_yorker'),
  (0.0067818156, 'flower'),
  (0.00635282, 'club'),
  (0.006333892, 'room'),
  (0.0062948302, 'glass')],
 -2.736775087598357)
Topic 22:
([(0.022093631, 'lady'),
  (0.018747235, 'prince'),
  (0.017867122, 'lord'),
  (0.012942275, 'poirot'),
  (0.011317158, 'castle'),
  (0.0111694345, 'king'),
  (0.009351824, 'princess'),
  (0.0089676455, 'bird'),
  (0.00871652, 'tower'),
  (0.008394809, 'christie')],
 -2.7935406250956714)
Topic 23:
([(0.0542325, 'black'),
  (0.0363144, 'sexual'),
  (0.031445514, 'white'),
  (0.022780227, 'african'),
  (0.022344515, 'male'),
  (0.021040868, 'anne'),
  (0.020896407, 'female'),
  (0.018415984, 'dictionary'),
  (0.017478526, 'african_american'),
  (0.016358454, 'hitler')],
 -2.9508915936515616)
Topic 24:
([(0.031507593, 'ndash'),
  (0.02928748, 'date'),
  (0.020967891, 'roman'),
  (0.013960156, 'rome'),
  (0.012234584, 'ancient'),
  (0.011939464, 'pub_date'),
  (0.011450172, 'temple'),
  (0.011298914, 'paperback'),
  (0.0105419485, 'campbell'),
  (0.009935377, 'dynasty')],
 -3.025732807006353)
Topic 25:
([(0.017457977, 'tribe'),
  (0.014439859, 'insist'),
  (0.013865065, 'energy'),
  (0.0131049575, 'warrior'),
  (0.012810626, 'charlie'),
  (0.009723709, 'capital'),
  (0.009708381, 'howard'),
  (0.007992722, 'conan'),
  (0.007630276, 'pure'),
  (0.0073829642, 'green')],
 -3.2385321403295637)
Topic 26:
([(0.014409648, 'brown'),
  (0.01032986, 'president'),
  (0.008803303, 'false'),
  (0.008038997, 'wilson'),
  (0.007016318, 'white'),
  (0.0063371467, 'v_onepage'),
  (0.006001991, 'bush'),
  (0.005494745, 'navy'),
  (0.005481764, 'disorder'),
  (0.0054458263, 'davis')],
 -3.3020491609449056)
Topic 27:
([(0.022676716, 'australian'),
  (0.019863538, 'australia'),
  (0.018165965, 'spanish'),
  (0.0155634815, 'prize'),
  (0.014795829, 'franklin'),
  (0.014625945, 'irish'),
  (0.013771835, 'civil'),
  (0.012497698, 'nietzsche'),
  (0.012015728, 'ireland'),
  (0.01182726, 'south')],
 -3.5232648749880746)
Topic 28:
([(0.04409739, 'bond'),
  (0.03681842, 'japanese'),
  (0.02122669, 'fleming'),
  (0.019946096, 'ring'),
  (0.017689666, 'japan'),
  (0.017170554, 'miller'),
  (0.01614258, 'victor'),
  (0.011093314, 'tolkien'),
  (0.00995695, 'psychoanalysis'),
  (0.008568351, 'can_seen')],
 -4.583769126673823)
Topic 29:
([(0.02436547, 'slave'),
  (0.022208037, 'indian'),
  (0.014329793, 'wolf'),
  (0.01408431, 'greek'),
  (0.012247402, 'native'),
  (0.010461207, 'oxford'),
  (0.00952866, 'trump'),
  (0.00892509, 'austen'),
  (0.0084893145, 'aubrey'),
  (0.008100312, 'myth')],
 -4.681422887917481)

Again let's return to The Great Sea: A Human History of the Mediterranean. Of which topics is it composed?

In [730]:
corpus_lda_model = lda_model_final[bow_corpus]
for article in corpus_lda_model[0]:
    print(article)
(7, 0.020898186)
(10, 0.15291192)
(11, 0.02973868)
(13, 0.49154004)
(16, 0.032740977)
(17, 0.04881642)
(23, 0.102105364)
(25, 0.104564406)

It's highest correlated topic was 13, which is: game, dragon, player, rule, adventure, monster, class, master, spell, dungeon. Not the best, but with the others we should be able to get more influences to culture and the Mediterranean. (Full disclosure, there are countless books on Dungeons and Dragons in the Corpus that we couldn't fully eliminate, and they do have a tendency to muddy up the results, as seen in topic 13.)

Let's do one final check based on our articles being ordered by length. It would make sense that longer articles would be composed of more topics than shorter ones, so let's graph the topic correlation of the highest correlated optic for each article and see if we can a decreasing relation.

In [731]:
from operator import itemgetter
index_high_corr_list = []

index_counter = 0

for article in corpus_lda_model:
    highest_feature = max(article, key=itemgetter(1))[0]
    max_correlation = max(article, key=itemgetter(1))[1]
    index_high_corr_list.append([index_counter, highest_feature, max_correlation])
    index_counter += 1

index_high_corr_list[:5]
Out[731]:
[[0, 13, 0.49167556],
 [1, 26, 0.41336057],
 [2, 22, 0.29104012],
 [3, 11, 0.21905822],
 [4, 24, 0.46358356]]
In [732]:
x = [tup[0] for tup in index_high_corr_list]
y = [tup[2] for tup in index_high_corr_list]

plt.figure(figsize=(20,10))
scatter = plt.plot(x,y, '.')
line = plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), 'r')
plt.title('Highest Correlation of Topics/Article Length', fontsize=30)
plt.xlabel("Article Length Rank Ordered", fontsize=20)
plt.ylabel("Highest Topic Correlation", fontsize=20)
plt.show(scatter, line)

Not quite as steep of a decline as we expected, but enough to confirm that we're on the right track (AKA haven't done anything too wrong).

Recommender Demonstration

We're now ready to make our recommendations! Just one final explanation of cosine similarity, which we will use to compare the vectors of assigned topics and their coherences per article.

  • Cosine similarity measures between bag of words vectors give us similarities in the range of -1 to 1 (the greater, the more similar)
  • Similarity = cos(θ) = (A‧B)/(‖A‖‖B‖), ‖X‖ being its norm

We order these outputs for recommendations based on a user chosen book. In the following cell we initialize the similarity matrix between all BoW articles, which we will order for querried books and use as the basis for our recommendations. Be sure to cache the following cell.

In [738]:
%%cache mycache_lda_index.pkl index

index = similarities.MatrixSimilarity(lda_model_final[bow_corpus])
[Saved variables 'index' to file '/Users/andrew/Work/Learning/HU_IS/mycache_lda_index.pkl'.]

And now for the recommender!

It has the following steps:

  • A book title is entered by the user
  • The matching book title is found in the corpus (or a similar one is suggested as before)
  • The similarities of all orther books are quierried from index, our similarity matrix
  • These similarities are enumerated such that a similarity score can be matched to the corresponding title
  • The recommendation scores for all books are appended to a previously created empty list
  • These recommendation scores are ordered
  • The following is printed:
    • The first ten most common tokens in the books article
    • The main topic for the book
    • The 2nd to 11th recommendations (first element would be the querried book itself)
In [739]:
def book_recommender(title):
    books_checked = 0
    for i in range(len(books_clean)):
        recommendation_scores = []
        if books_clean[i][0] == title:
            lda_vectors = corpus_lda_model[i]
            sims = index[lda_vectors]
            sims = list(enumerate(sims))
            for sim in sims:
                book_num = sim[0]
                recommendation_score = [books_clean[book_num][0], sim[1]]
                recommendation_scores.append(recommendation_score)
            
            recommendation = sorted(recommendation_scores, key=lambda x: x[1], reverse=True)
            print("Your book's most prominent tokens are:")
            article_tokens = bow_corpus[i]
            sorted_tokens = sorted(article_tokens, key=lambda x: x[1], reverse=True)
            sorted_tokens_10 = sorted_tokens[:10]
            for i in range(len(sorted_tokens_10)):
                print("Word {} (\"{}\") appears {} time(s).".format(sorted_tokens_10[i][0], 
                                                             dictionary[sorted_tokens_10[i][0]], 
                                                             sorted_tokens_10[i][1]))
            print('-----')
            print("Your book's most prominant topic is:")
            print(lda_model_final.print_topic(max(lda_vectors, key=lambda item: item[1])[0]))
            print('-----')
            print('Here are your recommendations for "{}":'.format(title))
            display(recommendation[1:11])
        
        else:
            books_checked +=1
            
        if books_checked == len(books_clean):
            book_suggestions = []
            print('Sorry, but it looks like "{}" is not available.'.format(title))
            other_books = []
            for x in range(len(books_clean)):
                other_book = books_clean[x][0]
                book_silimarity = round(similar_books(other_book, title), 2)
                similarity_score = [other_book, book_silimarity]
                book_suggestions.append(similarity_score)

            print('-----')
            ordered_suggestions = sorted(book_suggestions, key=lambda x: x[1], reverse=True)
            print('Were any of the following maybe what you were looking for?')
            print(ordered_suggestions[:10])

Now to check some books! The following list was made up to comprise a large span of different types of books including popular fantasy books like Harry Potter and the Lord of the Rings, to one of the most loved children's books, to James Bond for action and spy novels, to one of the pinacles of historical fiction, to a major work in science finction, and finishing with a work of philosophy that coincedentally has one of our longest articles. Each recommendation will be followed by a short discussion of the results.

Books to check:

  1. Harry Potter and the Philosopher's Stone
  2. The Fellowship of the Ring
  3. Where the Wild Things Are
  4. The Man with the Golden Gun
  5. Adventures of Huckleberry Finn
  6. Critique of Pure Reason

Harry Potter and the Philosopher's Stone

In [740]:
book_recommender("Harry Potter and the Philosopher's Stone")
Your book's most prominent tokens are:
Word 7491 ("harry_potter") appears 36 time(s).
Word 7499 ("rowling") appears 24 time(s).
Word 115 ("stone") appears 22 time(s).
Word 4409 ("wizard") appears 20 time(s).
Word 33742 ("philosophers_stone") appears 20 time(s).
Word 7496 ("potter") appears 18 time(s).
Word 9449 ("hermione") appears 15 time(s).
Word 44378 ("voldemort") appears 15 time(s).
Word 2752 ("philosopher") appears 13 time(s).
Word 42193 ("hogwarts") appears 11 time(s).
-----
Your book's most prominant topic is:
0.011*"poem" + 0.008*"interview" + 0.007*"song" + 0.007*"music" + 0.006*"march" + 0.006*"april" + 0.006*"sell" + 0.006*"retrieve" + 0.006*"october" + 0.006*"november"
-----
Here are your recommendations for "Harry Potter and the Philosopher's Stone":
[['Harry Potter and the Half-Blood Prince', 0.9549741],
 ['Harry Potter and the Chamber of Secrets', 0.9454624],
 ['Harry Potter and the Deathly Hallows', 0.9347644],
 ['Harry Potter and the Goblet of Fire', 0.9308921],
 ['The Magical Worlds of Harry Potter', 0.9173715],
 ['Harry Potter and the Order of the Phoenix', 0.91128325],
 ['Jaws', 0.90166795],
 ['Magic Under Glass', 0.9015246],
 ['The Girl with the Lower Back Tattoo', 0.9001719],
 ['Mad White Giant', 0.899832]]

Any Harry Potter book always come up with other books in the series, and in one case it came up with the other six, but this was a run where what could be considered a Harry Potter topic was made. We're missing the Prisoner of Azkaban. With the bigram harry_potter appearing so much, as well as rowling, hermione, and voldomort, you'd expect that the model would be able to consistently connect them all together, but these results are fine.

Let's take a quick look at the connection between the Philosopher's Stone and the Prisoner of Azkaban. For that, we'll need another function.

In [771]:
def book_comparer(title1, title2):
    books_checked = 0
    for i in range(len(books_clean)):
        similarity = []
        if books_clean[i][0] == title1:
            lda_vectors1 = corpus_lda_model[i]
            sims = index[lda_vectors1]
            for j in range(len(books_clean)):
                if books_clean[j][0] == title2:
                    lda_vectors2 = corpus_lda_model[j]
                    similarity.append(sims[j])
                    print('The similarity between {} and {} is: {}'.format(title1, title2, similarity[0]))

                    print('-----')
                    print("{}'s most prominent tokens are:".format(title1))
                    article_tokens = bow_corpus[i]
                    sorted_tokens = sorted(article_tokens, key=lambda x: x[1], reverse=True)
                    sorted_tokens_10 = sorted_tokens[:10]
                    for i in range(len(sorted_tokens_10)):
                        print("Word {} (\"{}\") appears {} time(s).".format(sorted_tokens_10[i][0], 
                                                                     dictionary[sorted_tokens_10[i][0]], 
                                                                     sorted_tokens_10[i][1]))
                    
                    print('-----')
                    print("{}'s most prominant topic is:")
                    print(lda_model_final.print_topic(max(lda_vectors1, key=lambda item: item[1])[0]))
                    
                    print('-----')
                    print("{}'s most prominent tokens are:".format(title2))
                    article_tokens = bow_corpus[j]
                    sorted_tokens = sorted(article_tokens, key=lambda x: x[1], reverse=True)
                    sorted_tokens_10 = sorted_tokens[:10]
                    for i in range(len(sorted_tokens_10)):
                        print("Word {} (\"{}\") appears {} time(s).".format(sorted_tokens_10[i][0], 
                                                                     dictionary[sorted_tokens_10[i][0]], 
                                                                     sorted_tokens_10[i][1]))
                    
                    print('-----')
                    print("{}'s most prominant topic is:")
                    print(lda_model_final.print_topic(max(lda_vectors2, key=lambda item: item[1])[0]))
In [772]:
book_comparer("Harry Potter and the Philosopher's Stone", "Harry Potter and the Prisoner of Azkaban")
The similarity between Harry Potter and the Philosopher's Stone and Harry Potter and the Prisoner of Azkaban is: 0.8810135722160339
-----
Harry Potter and the Philosopher's Stone's most prominent tokens are:
Word 7491 ("harry_potter") appears 36 time(s).
Word 7499 ("rowling") appears 24 time(s).
Word 115 ("stone") appears 22 time(s).
Word 4409 ("wizard") appears 20 time(s).
Word 33742 ("philosophers_stone") appears 20 time(s).
Word 7496 ("potter") appears 18 time(s).
Word 9449 ("hermione") appears 15 time(s).
Word 44378 ("voldemort") appears 15 time(s).
Word 2752 ("philosopher") appears 13 time(s).
Word 42193 ("hogwarts") appears 11 time(s).
-----
{}'s most prominant topic is:
0.011*"poem" + 0.008*"interview" + 0.007*"song" + 0.007*"music" + 0.006*"march" + 0.006*"april" + 0.006*"sell" + 0.006*"retrieve" + 0.006*"october" + 0.006*"november"
-----
Harry Potter and the Prisoner of Azkaban's most prominent tokens are:
Word 1239 ("black") appears 18 time(s).
Word 7491 ("harry_potter") appears 17 time(s).
Word 44096 ("prisoner_azkaban") appears 17 time(s).
Word 7496 ("potter") appears 15 time(s).
Word 20634 ("lupin") appears 14 time(s).
Word 44095 ("azkaban") appears 14 time(s).
Word 1844 ("prisoner") appears 13 time(s).
Word 9449 ("hermione") appears 9 time(s).
Word 42193 ("hogwarts") appears 6 time(s).
Word 42965 ("pettigrew") appears 6 time(s).
-----
{}'s most prominant topic is:
0.011*"poem" + 0.008*"interview" + 0.007*"song" + 0.007*"music" + 0.006*"march" + 0.006*"april" + 0.006*"sell" + 0.006*"retrieve" + 0.006*"october" + 0.006*"november"

We were 1% off of having all six other Harry Potter books suggested. The disimilarity of the Prisoner of Azkaban is aparent though. The most frequent word is black - for Serius Black - which would strongly correlate to other books, and the bigram harry_potter appears half as much in prisoner as in the Philosopher's Stone. That they both still had the same main topic is promising as well.

The Fellowship of the Ring

In [742]:
book_recommender("The Fellowship of the Ring")
Your book's most prominent tokens are:
Word 809 ("ring") appears 57 time(s).
Word 29628 ("frodo") appears 41 time(s).
Word 34488 ("gandalf") appears 24 time(s).
Word 21835 ("hobbit") appears 17 time(s).
Word 15323 ("lord_rings") appears 13 time(s).
Word 41147 ("bilbo") appears 13 time(s).
Word 38460 ("fellowship_ring") appears 10 time(s).
Word 403 ("lord") appears 9 time(s).
Word 13711 ("tolkien") appears 9 time(s).
Word 22029 ("shire") appears 9 time(s).
-----
Your book's most prominant topic is:
0.017*"king" + 0.009*"fight" + 0.008*"ship" + 0.008*"land" + 0.008*"magic" + 0.008*"battle" + 0.007*"captain" + 0.007*"save" + 0.007*"rescue" + 0.007*"capture"
-----
Here are your recommendations for "The Fellowship of the Ring":
[['The Two Towers', 0.93340415],
 ['The Children of Húrin', 0.92252624],
 ['The Three-Arched Bridge', 0.92227304],
 ['Blood Fever', 0.9110374],
 ['The Sword of Shannara', 0.91094744],
 ['The Return of the King', 0.90531075],
 ['The Silmarillion', 0.90407443],
 ['MedStar I: Battle Surgeons', 0.90183985],
 ['The Black Irix', 0.90026295],
 ['Dragon Sword and Wind Child', 0.900076]]

The Lord of the Rings is another positive yet problematic entry. As with Harry Potter, you get some, but sometimes not them all. For full disclosure, if you do the Return of the King, you just get the two Towers, but the first book of the series gives us not only the the other two, but also the prequil in the Silmarilion, and a book finished by Tolkein's son in the Children of Hurin. Other books seem random, or are of the fantasy genre. Maybe looking into a version of LDA that places higher importance on those tokens at the top of the frequency list would be a worthwhile endeavor. This is certainly the case for the next entry.

Interestingly, the Hobbit is in many ways unrelated to the Lord of the Rings, but often comes up with books by C.S. Lewis like The Lion, The Witch, and the Wardrobe. This is to a degree unsurprising, as Lewis and Tolkein were members of the Inlkings, a writing group at Oxford, and thus were highly influential on one another. Lewis is mentioned three times in the Hobbit's article.

Let's compare The Fellowship of the Ring and The Hobbit. First, here are the recommendations for the Hobbit.

In [835]:
book_recommender("The Hobbit")
Your book's most prominent tokens are:
Word 13711 ("tolkien") appears 75 time(s).
Word 21835 ("hobbit") appears 45 time(s).
Word 41147 ("bilbo") appears 40 time(s).
Word 5668 ("dwarf") appears 20 time(s).
Word 809 ("ring") appears 19 time(s).
Word 371 ("illustration") appears 17 time(s).
Word 1214 ("influence") appears 16 time(s).
Word 15323 ("lord_rings") appears 16 time(s).
Word 2734 ("mountain") appears 14 time(s).
Word 4573 ("goblin") appears 14 time(s).
-----
Your book's most prominant topic is:
0.017*"king" + 0.009*"fight" + 0.008*"ship" + 0.008*"land" + 0.008*"magic" + 0.008*"battle" + 0.007*"captain" + 0.007*"save" + 0.007*"rescue" + 0.007*"capture"
-----
Here are your recommendations for "The Hobbit":
[["The Magician's Nephew", 0.87710434],
 ['The Lion, the Witch and the Wardrobe', 0.87341166],
 ['Staurofila', 0.86331123],
 ['Beren and Lúthien', 0.8629377],
 ["J. R. R. Tolkien's influences", 0.85555565],
 ['Haroun and the Sea of Stories', 0.847633],
 ['The Final Unfinished Voyage of Jack Aubrey', 0.84714293],
 ['Juan Masili: Ang Pinuno ng Tulisan', 0.84470636],
 ['The Patchwork Girl of Oz', 0.8445532],
 ['Santa Esperanza', 0.8444371]]

As stated, the top two recommendations come from CS Lewis' The Chronicle's of Narnia. Beren and Lúthien is a book finished by Christopher Tolkein, and J. R. R. Tolkien's influences makes sense given the focus of the article on Tolkein's influences when writing the Hobbit.

In [778]:
book_comparer("The Fellowship of the Ring", "The Hobbit")
The similarity between The Fellowship of the Ring and The Hobbit is: 0.7991142272949219
-----
The Fellowship of the Ring's most prominent tokens are:
Word 809 ("ring") appears 57 time(s).
Word 29628 ("frodo") appears 41 time(s).
Word 34488 ("gandalf") appears 24 time(s).
Word 21835 ("hobbit") appears 17 time(s).
Word 15323 ("lord_rings") appears 13 time(s).
Word 41147 ("bilbo") appears 13 time(s).
Word 38460 ("fellowship_ring") appears 10 time(s).
Word 403 ("lord") appears 9 time(s).
Word 13711 ("tolkien") appears 9 time(s).
Word 22029 ("shire") appears 9 time(s).
-----
{}'s most prominant topic is:
0.017*"king" + 0.009*"fight" + 0.008*"ship" + 0.008*"land" + 0.008*"magic" + 0.008*"battle" + 0.007*"captain" + 0.007*"save" + 0.007*"rescue" + 0.007*"capture"
-----
The Hobbit's most prominent tokens are:
Word 13711 ("tolkien") appears 75 time(s).
Word 21835 ("hobbit") appears 45 time(s).
Word 41147 ("bilbo") appears 40 time(s).
Word 5668 ("dwarf") appears 20 time(s).
Word 809 ("ring") appears 19 time(s).
Word 371 ("illustration") appears 17 time(s).
Word 1214 ("influence") appears 16 time(s).
Word 15323 ("lord_rings") appears 16 time(s).
Word 2734 ("mountain") appears 14 time(s).
Word 4573 ("goblin") appears 14 time(s).
-----
{}'s most prominant topic is:
0.017*"king" + 0.009*"fight" + 0.008*"ship" + 0.008*"land" + 0.008*"magic" + 0.008*"battle" + 0.007*"captain" + 0.007*"save" + 0.007*"rescue" + 0.007*"capture"

Again we have the same first topic, but the simiarity is 10% lower than our top 10 threshhold. Tolkein appears an astounding 75 times in the text for the Hobbit, as the author's first major work. Further, the topics of the Lord of the Rings and the Hobbit are distinct for LDA in ways that they might not be for us. The Hobbit has goblins and far more dwarves, the Lord of the Rings has orcs, and the felowship itself is an important token.

Where the Wild Things Are

In [743]:
book_recommender("Where the Wild Things Are")
Your book's most prominent tokens are:
Word 218 ("wild") appears 19 time(s).
Word 34697 ("wild_things") appears 17 time(s).
Word 23028 ("sendak") appears 14 time(s).
Word 603 ("picture") appears 7 time(s).
Word 1013 ("opera") appears 7 time(s).
Word 155 ("produce") appears 5 time(s).
Word 23026 ("maurice_sendak") appears 5 time(s).
Word 662 ("childrens") appears 4 time(s).
Word 744 ("horse") appears 4 time(s).
Word 1529 ("performance") appears 4 time(s).
-----
Your book's most prominant topic is:
0.036*"childrens" + 0.024*"vampire" + 0.020*"animal" + 0.017*"tale" + 0.013*"fairy" + 0.011*"weekly" + 0.011*"rabbit" + 0.010*"publishers_weekly" + 0.010*"mouse" + 0.010*"kirkus_reviews"
-----
Here are your recommendations for "Where the Wild Things Are":
[['Yertle the Turtle and Other Stories', 0.9229879],
 ['How to Eat a Small Country', 0.89570254],
 ['The Amazing Adventures of Kavalier  &  Clay', 0.88564926],
 ['The Tale of Mr. Tod', 0.88387096],
 ['From the Mixed-Up Files of Mrs. Basil E. Frankweiler', 0.86507726],
 ['Little Toot', 0.8636187],
 ['Fungus the Bogeyman', 0.8635562],
 ['Daisy-Head Mayzie', 0.8525672],
 ['The Polar Express', 0.852564],
 ['You Deserve a Drink', 0.8513284]]

Famous kids books are hard to predict, as their articles are often full of parts on cultural significance that may not have been entirely reduced. In this model, if we do say Charlotte's Web, then the results are chaotic. With Where the Wild Things Are we get all kids books, and beyond that many famous ones in Yertle the Turtle and The Polar Express.

What we didn't get was another book by the author, Maurice Sendak. Let's compare Where the Wild Things Are with In the Night Kitchen, another great book by him.

In [836]:
book_comparer("Where the Wild Things Are", "In the Night Kitchen")
The similarity between Where the Wild Things Are and In the Night Kitchen is: 0.7176851630210876
-----
Where the Wild Things Are's most prominent tokens are:
Word 218 ("wild") appears 19 time(s).
Word 34697 ("wild_things") appears 17 time(s).
Word 23028 ("sendak") appears 14 time(s).
Word 603 ("picture") appears 7 time(s).
Word 1013 ("opera") appears 7 time(s).
Word 155 ("produce") appears 5 time(s).
Word 23026 ("maurice_sendak") appears 5 time(s).
Word 662 ("childrens") appears 4 time(s).
Word 744 ("horse") appears 4 time(s).
Word 1529 ("performance") appears 4 time(s).
-----
{}'s most prominant topic is:
0.036*"childrens" + 0.024*"vampire" + 0.020*"animal" + 0.017*"tale" + 0.013*"fairy" + 0.011*"weekly" + 0.011*"rabbit" + 0.010*"publishers_weekly" + 0.010*"mouse" + 0.010*"kirkus_reviews"
-----
In the Night Kitchen's most prominent tokens are:
Word 1327 ("kitchen") appears 9 time(s).
Word 13885 ("mickey") appears 7 time(s).
Word 23028 ("sendak") appears 7 time(s).
Word 662 ("childrens") appears 6 time(s).
Word 3482 ("comic") appears 5 time(s).
Word 10078 ("batter") appears 5 time(s).
Word 2682 ("bottle") appears 4 time(s).
Word 6503 ("cake") appears 4 time(s).
Word 10087 ("naked") appears 4 time(s).
Word 11648 ("milk") appears 4 time(s).
-----
{}'s most prominant topic is:
0.006*"drug" + 0.006*"talk" + 0.006*"leaf" + 0.006*"sleep" + 0.005*"realize" + 0.005*"something" + 0.005*"drink" + 0.004*"hear" + 0.004*"farm" + 0.004*"drive"

What we see is that the content of the books is simply too different. They don't even have the same main topic. In the first we have a journey to a mysterious island with friendly monsters, and in the second we again have a journey, but this time to a kitchen where the main character bakes cakes. Sandek appears mutlipe times in each article, but not enough to make up for the overarching difference. This could only be solved if say the articles themselves were tagged by author in a process within the recommendation, but outside of LDA.

The Man with the Golden Gun

In [744]:
book_recommender("The Man with the Golden Gun")
Your book's most prominent tokens are:
Word 9941 ("bond") appears 42 time(s).
Word 13632 ("fleming") appears 23 time(s).
Word 44517 ("golden_gun") appears 19 time(s).
Word 4352 ("golden") appears 15 time(s).
Word 13636 ("james_bond") appears 9 time(s).
Word 11995 ("jamaica") appears 8 time(s).
Word 13635 ("ian_fleming") appears 7 time(s).
Word 1274 ("service") appears 6 time(s).
Word 1686 ("cape") appears 6 time(s).
Word 13628 ("benson") appears 6 time(s).
-----
Your book's most prominant topic is:
0.044*"bond" + 0.037*"japanese" + 0.021*"fleming" + 0.020*"ring" + 0.018*"japan" + 0.017*"miller" + 0.016*"victor" + 0.011*"tolkien" + 0.010*"psychoanalysis" + 0.009*"can_seen"
-----
Here are your recommendations for "The Man with the Golden Gun":
[['Diamonds Are Forever', 0.966241],
 ['Casino Royale', 0.95011246],
 ['You Only Live Twice', 0.925905],
 ['For Special Services', 0.9222841],
 ['Moonraker', 0.9218398],
 ['Colonel Sun', 0.9195647],
 ['Octopussy and The Living Daylights', 0.91717386],
 ['Goldfinger', 0.9168156],
 ['Dr. No', 0.91662884],
 ['From Russia, with Love', 0.9151635]]

The recommender is good at picking out James Bond books. Period. Every single one of those books is another James Bond book. It is rare that it doesn't come up with at least eight others in the top 10, but that's also because Flemming wrote 14, and in total there are over 50. Flemmings novels are thankfully those that are predicted, but when you have such a large portion of the corpus on one character, it's no surprise that it's easy to find other books about him.

Adventures of Huckleberry Finn

In [745]:
book_recommender("Adventures of Huckleberry Finn")
Your book's most prominent tokens are:
Word 4594 ("huck") appears 82 time(s).
Word 4607 ("twain") appears 40 time(s).
Word 4589 ("finn") appears 32 time(s).
Word 4597 ("huckleberry_finn") appears 32 time(s).
Word 4600 ("mark_twain") appears 30 time(s).
Word 4586 ("adventures_huckleberry") appears 26 time(s).
Word 4596 ("huckleberry") appears 25 time(s).
Word 389 ("adventure") appears 21 time(s).
Word 4606 ("tom_sawyer") appears 18 time(s).
Word 4595 ("huck_finn") appears 15 time(s).
-----
Your book's most prominant topic is:
0.018*"political" + 0.013*"government" + 0.010*"social" + 0.009*"united_states" + 0.008*"economic" + 0.007*"country" + 0.007*"nation" + 0.007*"revolution" + 0.006*"america" + 0.006*"public"
-----
Here are your recommendations for "Adventures of Huckleberry Finn":
[['The Adventures of Tom Sawyer', 0.95865864],
 ['The Strange Death of Tory England', 0.9450343],
 ['Annie Besant', 0.9399871],
 ['Welcome to Obamaland', 0.929611],
 ['It Takes a Family', 0.9199818],
 ['Hello, Bastar', 0.9175509],
 ['The In-Between World of Vikram Lall', 0.91620666],
 ["A Mother's Ordeal", 0.91603124],
 ['Such a Long Journey', 0.9130387],
 ['Freefall: America, Free Markets, and the Sinking of the World Economy',
  0.9119878]]

Adventures of Huckleberry Finn would be a very hard book to make recommendations for based on token content. There're so many topics in play. Getting Tom Sawyer is necessary, and that it's so high is a plus. The natural suggestion would be other classics of US-American literature, but that would require words that describe that to be more weighted than others. What we get instead is solcio-political commentary about the US, which makes sense, but getting major works like The Grapes of Wrath would be better.

Critique of Pure Reason

In [748]:
book_recommender("Critique of Pure Reason")
Your book's most prominent tokens are:
Word 6181 ("kant") appears 160 time(s).
Word 26128 ("transcendental") appears 86 time(s).
Word 7751 ("pure") appears 83 time(s).
Word 941 ("reason") appears 78 time(s).
Word 1859 ("concept") appears 69 time(s).
Word 32991 ("priori") appears 63 time(s).
Word 1174 ("knowledge") appears 59 time(s).
Word 1746 ("object") appears 56 time(s).
Word 1704 ("space") appears 45 time(s).
Word 2140 ("category") appears 43 time(s).
-----
Your book's most prominant topic is:
0.014*"theory" + 0.011*"argue" + 0.009*"social" + 0.009*"nature" + 0.008*"religion" + 0.008*"understand" + 0.007*"individual" + 0.007*"philosophy" + 0.007*"belief" + 0.006*"question"
-----
Here are your recommendations for "Critique of Pure Reason":
[['Critique of Practical Reason', 0.9930755],
 ['Prolegomena to Any Future Metaphysics', 0.9921301],
 ['Critique of Judgment', 0.9872812],
 ["Introduction to Kant's Anthropology", 0.9872484],
 ['The Foundations of Arithmetic', 0.9793486],
 ['God: The Failed Hypothesis', 0.9754984],
 ['Boundaries of the Mind', 0.97432137],
 ['The Phenomenology of Spirit', 0.9741832],
 ['The Bounds of Sense', 0.9736892],
 ['Difference and Repetition', 0.9733588]]

Critique of Pure Reason and all other major works of philosophy fare well with the analysis. This is also likely because there are so many "acceptable suggestions." The same can be said for conceptual science books.

With that being said, and to our credit, Critique of Pure Reason is the first of a trilogy, with the other two being found in the top three, and the top four all being by or about Kant. Further, the Ninth book is about Critique of Pure Reason, and the eigth book being Hegel's master work is another plus due to the influence of Kant on Hegel.

Overall:

  • The model does a fairly good job of suggesting books in a series to others in the same series
  • It does an ok good job at suggesting books by the same author that are not in a series
  • It does a good job of making general genre suggestions
  • The above suggestions though are not related to the cultural significance of the book, so relatively unkown books are still easily recommended
  • It does a bad job on books that have significance beyond their base story
  • It's hard to judge efficiency of books with short vs. long articles, as we frankly wouldn't know what the ones with short articles should be recommended in the first place. What we do note though is that such books are recommended to those with long articles where they are seen as not belonging

As a side note to one wanting to try this - pay very very careful attention to the final step of the cleaning. Dropping words that are in too many articles has a dramatic effect on the results based on the chosen threshhold. Ours is set at 15%, and both 10% and 20% had dramatically worse results - say only two Harry Potter Books, and almost random results for Where the Wild Things Are.

Graphically Representing Topics

In [842]:
import pyLDAvis.gensim

First and foremost, and heartfelt thanks to pyLDAvis for such an amazing package to allow us to really show the data and let it be manipulated to boot. The following graph is composed of two pannels:

The left panel, labeld Intertopic Distance Map: circles represent different topics, with the distance corresponding to their relative similarities. The relative frequency of the topic in the corpus is further represented by he size of a topic's circle. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

The right panel, the bar chart of the top 30 terms: when no topic is selected in the plot on the left, the bar chart shows the top-30 most saliant (footer 1) terms in the corpus. By saliant we mean not simply the frequency of said term in the corpus, but further its ability to distinguish between different topics. By selecting each topic on the left, one modifies the bar chart to show the relevant (footer 2) terms for the selected topic. Relevence can be tuned by parameter λ. A smaller λ gives higher weight to the term's distinctiveness, while a larger λ s corresponds to probablity of the term occurance per topics.

Further, the axes of the graph are two principle coordinates used in the multidimmensional scaling.

In [893]:
vis = pyLDAvis.gensim.prepare(lda_model_final, bow_corpus, dictionary)
pyLDAvis.display(vis)
Out[893]: