The pdf-correlator is a program that analyses a given collection of local pdf files and provides means to visualise their similarity to one another. Additionally, it can compare the similarity score between the entire collection and a singular incoming (new) pdf.
During the writing of my academic paper, I became interested in the idea of confirmation bias, whereby the researcher accumulates knowledge that confirms either what they already know, or what they are wanting to prove. I considered that a way to combat this could be to have a computer program that looks at your current collection of pdfs (like your Zotero library) and then looks at a new incoming pdf. From these two data sources, the program could then infer the similarity between the entire current library and the new pdf/s.
If the researcher then found that the score was too similar to their current library, they could be insighted to search more widely (or get annoyed and close the program).
After some research I found the Gensim doc2vec machine learning model, which was specifically designed to analyse ‘documents’, rather than singular words (like the word2vec model does, and from which doc2vec stems from). A ‘document’ in the language of this model is all text that resides on a singular line in a .txt document. The model derives patterns from the text it is given and vectorizes the data, meaning that it can then have relationships inferred by comparing the distances and angles between other vectorized ‘documents’ in vector space.
To implement my idea I had to go through these major steps:
glob.globwith recursivity, allowing searches through all subdirectories as well.
model.save_word2vec_formatmethod learned from this jupyter notebook tutorial.
infer_vectormethod. See the very last code block in the pdf-correlator.script