a window onto the Sanskrit pramāṇa-NLP corpus
Vātāyana (meaning “window” in Sanskrit) is a tool for exploring intertextuality relations within a corpus. It was developed as part of a dissertation on Sanskrit philosophy (see here), but the algorithm is language-independent and should be able to accomodate any tokenized natural language corpus on which topic modeling can be performed.
The current Pramāṇa NLP corpus featured here consists of approximately 28,000 paragraph-sized passages or "documents" in about 50 texts, totaling 2+ million words.
Vātāyana illuminates intertextual connections between these corpus documents by performing three kinds of pairwise comparisons. Comparison by modeled LDA topics is fast and captures abstract semantics, whereas TF-IDF and Smith-Waterman emphasize word-level correspondences at relatively greater computational cost. The staggering of these methods as part of a composite similarity search algorithm produces results that are fast, multifaceted, and accurate. A performance assessment of the system in the above-mentioned dissertation (see §7) found that Vātāyana correctly identifies 80% or more of the parallel passages a careful Sanskrit scholar can find. It thereby provides an excellent starting point for intertextual reading and research.
app version: 1.2.3
code and documentation on GitHub
I'd love to hear from you. Message me on GitHub or email me (Gmail) at tyler.g.neill.
You can read more about this and other projects at tylerneill.info/projects.