Here are some of my takeaways from the chapter Mathematical Foundations Behind Latent Semantic Analysis, written by Dian I. Martin and Michael W. Berry in the book Handbook of Latent Semantic Analysis.

Latent Semantic Analysis (LSA) is a vector space model (VSM) that handles text retrieval from a large information database where the text is heterogeneous and the vocabulary varies. A key advantage is in its ability to find similarities between items even when they do not directly share terms. So for instance, LSA can determine that two documents are conceptually related even if they have no words in common.

A summary of how to implement the LSA:

Suppose you have a collection of documents and a set of terms (types). You may want to figure out how the documents, or terms, or even terms and documents are associated with each other.

  • First, construct a type-by-document matrix based on the binary frequencies of each type in the corresponding document.
  • Apply the weightings to the matrix to amplify important and diminish unimportant types. A common weighting function is log-entropy.
  • Next, transform the weighted matrix into a type-and-document vector space with orthogonal decompositions. The most popular method for this is the singular value decomposition (SVD), and the Lanczos algorithm provides an efficient method for performing SVD.
  • With the vector space built, we can do the analysis by calculating the similarity between type or document vectors (which must scaled by singular values before calculation) in the vector space. For the case when a type (or a query of types) and a document are compared, derive the corresponding pseudo-document vector for the compared type first.
  • Additional types or documents can be added into the existing vector space as needed by “folding-in”, and one can also ignore some types or documents by “folding-out”.

Step 1: Create the Input Matrix

First, construct an \(m \times n\) type-by-document matrix \(\textbf{A}\), and each element in the matrix \(a_{ij}\) is the frequency of ith type in the jth document. After that, apply some weighting function to each non-zero \(a_{ij}\) so that it gives a low weight to a high-frequency type occurring in more documents, and a high weight to types that occur in fewer documents. Why? The types assigned with more weights are the ones that best distinguish particular documents.

In general, one can have

\[a_{ij} = local(i, j) \cdot global(i)\]

where

  • \(local(i, j)\) is some local weighting function on each type and document, relating to how frequetly type $i$ occurs within document $j$. Some options for the local weighting:
    • Binary frequency. \(local(i, j) = 1\) if type \(i\) occurred in document \(j\), otherwise \(0\).
    • Log function. \(local(i, j) = \log(f_{ij}+1)\) where \(f_{ij}\) is the (raw) count of type \(i\) in document \(j\).
  • \(global(i)\) is some global weighting function on each type and document, relating to how frequently type \(i\) occurs in all documents. A global weighting function can be any that maps all term raw counts into \([0,1]\) and assigns lower weight to type occurring more documents. Some options for the global weighting function:
    • Inverse Document Frequency (IDF).
    \[global(i)=\log\frac{n}{n_i}\]

    where \(n_i\) is the number of documents that type \(i\) occurs.

    • Entropy.
    \[global(i)=\sum_{j=1}^n \frac{p_{ij}\log_2 p_{ij}}{\log_2 n}\]

    where

    \[p_{ij} = \frac{f_ij}{f_i}\]

    and \(f_i\) is the total count that type \(i\) appears in the entire collection of \(n\) documents.

The most popular one, according to the authors, is the log-entropy function. So basically we have

\[a_{ij} = \log(f_{ij}+1) \cdot \sum_{j=1}^n \frac{p_{ij}\log_2 p_{ij}}{\log_2 n}\]

And the input matrix is accomplished. It is obvious that the type-by-document input matrix is sparse.

Step 2: Build the Vector Space

After creating the input matrix \(\textbf{A}\), it is then transformed into a type and document vector space by orthogonal decompositions (or orthogonal matrices) in order to exploit truncation of the vectors, and also preserve certain properties of the matrix, including the norms, or vector lengths and distances, of the row and column vectors that form the $m\times n$ type-by-document input matrix A. Specifically, orthogonal matrix decompositions preserve the 2-norm and the Frobenius norm of matrix \(\textbf{A}\).

To be clear, we want to decompose \(\textbf{A}\) into orthogonal matrices, because an orthogonal matrix is perfect for building a vector space:

  • Every column vector is orthonormal (orthogonal to each other and each with norm 1), pointing in totally different directions, forming 90-degree angles between each other.
  • The column vectors form an orthonormal basis for a vector space, so every vector in the vector space can be written as a linear combination of vectors \([q_1,q_2,…,q_n]\), they span the vector space and are linearly independent.

Here are some methods to do the decomposition:

  • QR factorization.
  • ULV low-rank orthogonal decomposition.
  • Semi-discrete decomposition (SDD).
  • Singular value decomposition (SVD). Here we focus on the SVD approach, as it is the most popular one in this context.

Given the rank \(r\) of \(\textbf{A}\) (the number of vectors in the basis of the column space or the vector subspace spanned by the column vectors. Or you can just have the number of linearly independent rows or columns in the matrix), the SVD of \(\textbf{A}\) is given by \(A = U\Sigma V^T\) where

  • \(U\) is an orthogonal matrix. The first \(r\) columns contain \(r\) orthogonal eigenvectors associated with the nonzero \(r\) eigenvalues of \(AA^T\). The rows, or left singular vectors, are called type vectors.
  • \(V\) is also an orthogonal matrix. The first \(r\) columns contain \(r\) orthogonal eigenvectors associated with the nonzero \(r\) eigenvalues of \(A^T A\). The rows, or right singular vectors, are called document vectors.
  • \(\Sigma = diag(\sigma_1, \sigma_2, ..., \sigma_n)\) is a diagonal matrix. The first \(r\) diagonal entries are the nonnegative square roots of the \(r\) nonzero eigenvalues of \(AA^T\) and \(A^T A\). The nonzero diagonal elements are the singular values.

Suppose \(\textbf{A}\) has its best rank-\(k\) approximation \(A_k\), then we can generate \(A_k = U_k\Sigma_k V_k^T\) by ignoring or setting equal to zero but the first \(k\) elements or columns of the type vectors in \(U\), the first \(k\) singular values in \(\Sigma\), and the first \(k\) elements or columns of the document vectors in \(V\). So here we have a dimension reduction from \(r\) to \(k\) by removing “noise” in the database.

The derivation of the decomposition is indeed an eigenproblem.

  • If \(m \ge n\), first find the \(k\) largest eigenvalues and eigenvectors of the matrix \(A^T A\), and in this way we have the \(k\) document vectors with their corresponding singular values. Then compute
\[U_k = AV_k\Sigma_k^{-1}\]
  • If \(m < n\), first find the \(k\) largest eigenvalues and eigenvectors of the matrix \(AA^T\), and in this way we have the \(k\) type vectors with their corresponding singular values. Then compute
\[V_k = A^T U_k\Sigma_k^{-1}\]

The Lanczos algorithm is proven to be accurate and efficient for large, sparse symmetric eigenproblems where only a modest number of the largest or smallest eigenvalues of a matrix are desired.

Now we have the $k$-dimension vector space model. Every type can be represented as a type vector in $U$ in the vector space and every document can be represented as a document vector in $V$ in the vector space as well.

Step 3: Do the Analysis.

In general, the comparison between documents and types are based on the similarity computation between the corresponding vectors.

Derivation of Comparison

For example, suppose we calculate similarity with dot products of both vectors. All the type-to-type dot products are given by:

\[A_k A_k^T = (U_k\Sigma_k V_k^T)(U_k\Sigma_k V_k^T)^T = U_k\Sigma_k\Sigma_k U_k^T = (U_k\Sigma_k)(U_k\Sigma_k)^T\]

then it becomes all the dot products between type vectors. Similarly, All the document-to-document dot products are given by:

\[A_k^T A_k = (U_k\Sigma_k V_k^T)^T(U_k\Sigma_k V_k^T) = V_k\Sigma_k\Sigma_k V_k^T = (V_k\Sigma_k)(V_k\Sigma_k)^T\]

then it becomes all the dot products between document vectors. Therefore, in any calculation of similarity between two vectors in this model, the vectors should be scaled by singular values first.

Calculate the Similarity

The comparison between types can be reached with calculation of the similarity between the singular-values-scaled type vectors, and the comparison between documents can be reached with the calculation of the similarity between the singular-values-scaled document vectors. The “similarity” function can be the dot product, cosine similarity, Minkowski distance, or any other measure.

But how to compare a type (or a query of types) and a document? Suppose we have a query of types (terms) \(q\) containing zeros and weighted type frequencies corresponding to the types specified. Then based on the query, we can generate a pesudo-document vector:

\[q^T U_k \Sigma_k^{-1}\]

where the first component is indeed a mapping to the vector space, and \(d^T\) is the vector whose elements specify which documents to add to the query. It is indeed a projection to the vector space. To retrieve relevant documents specified, we can even add an additional component:

\[q^T U_k \Sigma_k^{-1} + d^T V_k\]

where \(d^T\) is the vector whose elements specify which documents to add to the query. After generating the corresponding pseudo-document vector for the query, we can compare it with some other document as if comparing two documents.

Update to the Vector Space

We can even add new types and documents to the reduced rank type-document vector space by “folding-in”. A new document \(d\), which contains zero and nonzero elements adjusted by weight function, is folded into the vector space by projecting \(d\) onto the span of the current type vectors:

\[d_{new} = d^TU_k\Sigma_k^{-1}\]

Similarly, a new type \(t\), which contains zero and nonzero elements adjusted by weight function, is folded into the vector space by projecting \(t\) onto the span of the current document vectors:

\[t_{new} = t V_k\Sigma_k^{-1}\]

We can also ignore some existing types or documents by folding-out: Simply ignore the unwanted types or documents as if they were absent from the vector space, then they are no longer used in comparisons.

References

Martin, D. I., & Berry, M. W. (2007). Mathematical foundations behind latent semantic analysis. Handbook of latent semantic analysis, 35-55.

Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations, SecondEdition.