Exaptive uses sophisticated technology to make discoveries easier, especially when researchers are looking through data contained within millions of documents. We use machine learning to facilitate the exploration of data that would otherwise be too vast to support valuable insights.
Machine Learning allows for a model to improve over time.. given new training data.. without requiring more human effort. For example, a common “text-classification benchmark task” is to train a model on messages for multiple discussion board threads.. and then later use it to predict what the topic of discussion was.. Whether it was space, computers, religion, anything. Besides being able to classify new texts, Machine Learning approaches can also attempt to identify the authors or find similar documents. The ability to identify similar documents can lead to a “recommender system” for new content that a user might find interesting.
Machine Learning-based models are commonly desired to be “black-box” in the sense that a user desires to be able to put data in.. and get answers out.. without having to know the details of how this is achieved. However, there is usually a desire to understand the resulting model and why a recommendation is given. There is also a desire to understand a collection of texts, such as search results, where the user may want a summary of a 100 page list of a thousand “ranked results.” In this use case, we build a data landscape, which is a visualization of the documents that conveys their similarities as well as the relationships to key terms which were identified when learning the model.
In one application of data landscape technology, Exaptive processed over 100 million documents which had been “machine read” from “scanned documents” via “optical character recognition”. Some of the documents were hundreds of years old. We recorded counts for roughly 200,000 words.. and then estimated the importance of those words to the documents as a “feature engineering” step. This measure is known as “term frequency-inverse document frequency” or TF-IDF. “Singular value decomposition”.. or S-V-D.. was then used to find high level concepts which are each defined by many words. At that point in the process, documents are described by high level concepts that align with areas of medicine, economics, religion, politics, et cetera.
The concepts that are learned are data-dependent. If only medical documents are used, then the model’s resources will be used to identify more “finely-detailed” categories. We then clustered the documents in that “topic space” to find which documents are similar. The “silhouette coefficient measure” allowed us to automatically select a good “number of clusters.” Next, we projected the documents down to a two dimensional scatterplot using a combination of SVD and multi-dimensional scaling. Based on the density of the documents, we fit a contour map, which looks like a topological map. Color varies across the contour map according to the cluster assignment for documents in that area. Finally, we solve for landmarks which correspond to the x-y location of the key “driver terms” for each cluster.
Using these same concepts, the Exaptive team designed the PubMed® Explorer. PubMed Explorer makes it easy to search PubMed’s extensive collection of papers. One of the visualizations provided is a “term landscape”. The term landscape is similar to the key “term landmarks” from the previously described data landscape. The positions are found in a more direct method by projecting TF-IDF values directly to 2-D. For a collection of search results, the user may then view a two-dimensional landscape where related terms are grouped together spatially. Depending on how this project is performed, it is easy to obtain either the documents locations, or the term locations. This allows us to provide the user with options to create the same visualization for articles or journals, instead of topics. As with the previously described visualizations, the documents are categorized using clustering which provides for distinction with the term and cluster colors.
Many people associate Machine Learning with A-I, or Artificial Intelligence. At Exaptive, we use it to support I-A, or “intelligence augmentation.” The difference is.. that instead of using machine learning to eliminate the need for humans in a process, the technology supports the intelligence of the human researcher, so researchers can accomplish more than what would otherwise be humanly possible.