'KinderMining:' Tackling big data sets by keeping things simple

March 29, 2017

MADISON -- With about 100 lines of code, a Morgridge Institute for Research team has unleashed a fast, simple and predictive text-mining tool that may turbo-charge big biomedical pursuits such as drug repurposing and stem cell treatments.

The algorithm, named "KinderMiner" by inventors, has been put to use exploring one of the largest single archives of research journal papers, Europe PubMed Central (PMC). Within hours, the algorithm can scan the more than 30 million papers online in Europe PMC and provide ranked associations for select target terms and key phrases.

"We started this project to try to find a text mining approach that works more effectively for scientists," says senior author Ron Stewart, associate director of bioinformatics at Morgridge. "Most often, researchers are running manual Google searches and combing through millions of hits to find, for example, certain genes that are important to a biological process or disease. It's often based on hunches and intuition. We're trying to automate and formalize that process."

Finn Kuusisto, a postdoctoral researcher at the Morgridge Institute and the first author on the KinderMiner paper, presented results on Wednesday, March 29 at the American Medical Informatics Association's annual Joint Summits on Translational Science in San Francisco. The summit showcases new applications in bioinformatics that are improving healthcare.

"There are other techniques out there that require a lot more data-wrangling," says Kuusisto. "But in our case, we write about 100 lines of Python code, and our users can be given answers that may significantly speed up their scientific process."

The scientists emphasize that while their queries focused on biomedicine, KinderMiner can be applied to any discipline -- the only constant is the need for a massive corpus to search. The next step will be to create an online search interface available for the scientific community.

To test KinderMining, the team chose two scientific projects that prove to be time-consuming and often intractable. The first is identifying relevant transcription factors to reprogram stem cells, and the second is finding potential drugs with off-label benefits or adverse effects.

For cell reprogramming, there are about 2,000 known transcription factors that might be useful in changing a cell from one state to another, such as creating induced pluripotent stem (iPS) cells from skin cells. They used KinderMining on three reprogramming efforts that are well established in research literature: creating iPS cells, creating cardiomyocytes, and maturation of liver cells.

To show the predictive power of the algorithm, the team censored the literature by date, taking out all papers beginning two years before the published dates of each discovery. They queried only up to 2004 for iPS cells, 2008 for cardiomyocytes and 2009 for liver cells.

The results in all three tests identified numerous relevant transcription factors in the top 20 hits - again, from a potential pool of more than 2,000 factors. This is a substantial benefit to the wet lab scientists, given that the factors likely need to act in combination. For instance, if one needs to test all 2,000 factors four at a time, it represents 100 billion experiments, clearly outside of the realm of possibility.

Stewart notes that KinderMining ranks the factors, and it is likely that the important factors will be in the top 10 or 20. Now if scientists test 10 factors four at a time, it requires a manageable 210 experiments, Stewart says.

They compared their results against a state of the art data mining tool called Mogrify, and the KinderMining results overlap on a large proportion of accurate hits.

"This is kind of like a 'time machine' for biology, where we can go back before any of the big publications came out on reprogramming, and still make a good guess about what genes are most important," says Stewart.

Stewart works in the Morgridge regenerative biology team led by stem cell pioneer James Thomson, and many of Thomson's landmark discoveries provided the original inspiration for this project. "It would be great if we could help someone in the Thomson lab or a related lab come up with a discovery that has great clinical benefit -- but instead of taking 15 years, we do it in three years."

The second big test involved scanning Europe PMC to identify drugs that have the effect of reducing blood glucose. Of the top 50 drugs found, 43 are known diabetes treatments, but they found seven drugs that either raise or lower blood glucose as a secondary, off-label effect. Those hits are especially important as they demonstrate possible prediction of repurposed drug targets.

Repurposed drugs make up about 30 percent of all new drugs or vaccines approved by the U.S. Food and Drug Administration. David Page, a co-author on the study and a professor of biostatistics and medical informatics at the University of Wisconsin-Madison, says he is excited about the potential of KinderMiner to identify promising drugs to repurpose.

"You could spend all your time -- and all your students' time -- scanning the literature for this kind of secondary drug effect and only scratch the surface of what's out there," Page says. "It's better to write an automated machine learning package to do it instead."

Kuusisto and Page have received approval to use de-identified electronic health records from the National VA Hospital, with approximately 10 million records, to continue the drug repurposing work, examining several drug effects such as lowering of cholesterol levels or blood pressure.

Morgridge computational biologist John Steill, another co-author of the KinderMining study, is using the KinderMiner tool to improve gene marker lists, which have numerous uses such as classifying cells or samples by cell type and identifying samples that may produce tumors.

Morgridge Institute for Research

Related Transcription Factors Articles from Brightsurf:

Circular RNA regulates neuronal differentiation by scaffolding an inhibitory transcription complex
In a screening for a functional impact to the neuronal differentiation process, Danish researchers identified a specific circular RNA, circZNF827, which surprisingly 'taps the brake' on neurogenesis.

Transcription factors may inadvertently lock in DNA mistakes
A team of Duke researchers has found that transcription factors have a tendency to bind strongly to ''mismatched'' sections of DNA, i.e. sections of the genome that were not copied correctly.

New role assigned to a human protein in transcription and genome stability
DNA-RNA hybrids, or R loops, are structures that generate genomic instability, a common feature of tumor cells.

CeMM study reveals how a master regulator of gene transcription operates
Using TPD technology, CeMM researchers set out to understand set out to understand the primary role of a key regulator of transcription, the human Mediator complex.

Researchers find new role for dopamine in gene transcription and cell proliferation
A joint group of researchers at the George Washington University and the University of Pittsburgh have found that dopamine and the dopamine D2 receptor modulate expression via the Wnt/β-catenin signaling pathway.

SMAD2 and SMAD3, two almost identical transcription factors but with distinct roles
Both transcription factors regulate the expression of genes involved in embryo development, among other functions, although they exert very different roles.

Study explores role of mediator protein complex in transcription and gene expression
A new study led by Ryerson University called 'The Med31 Conserved Component of the Divergent Mediator Complex in Tetrahymena thermophila Participates in Developmental Regulation' advances existing knowledge about transcription and gene expression.

New members found in a transcription factor complex that maintains beta cells
A protein complex in the nucleus of beta cells contains different proteins that work together to regulate genes important for the development and maintenance of functional beta cells.

Testifying while black: A linguistic analysis of disparities in court transcription
A new study has found that court reporters transcribe speakers of African American English significantly below their required level of accuracy.

Transcription factor network gets to heart of wood formation
Research on high-level switches that control wood formation has applications in timber, paper and biofuels, as well as making forests healthier.

Read More: Transcription Factors News and Transcription Factors Current Events
Brightsurf.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com.