Hackpads are smart collaborative documents. .

Peter Murray-Rust

711 days ago
Fiona N Notes for the day #CMDNAHack
Tweet with hashtag: 
Nadia #CMDNAHack please use when twitting!
Fiona N And find us on Twitter: @DNAdigest @theContentMine @linguamatics 
•Intro from DNAdigest :ok: 
•Intro from ContentMine :ok: 
•Installation of tools :ok:
And off you go! :checkered_flag: 
Command line to reset keyboard mappings in VM
  • setxkbmap gb
Fiona N download papers using getpapers (different year for each table): 
  • getpapers --query '"human genomic" AND PUB_YEAR:[2010 TO 2010]' -o genome2010 -x
Peter M then normalize:
  • norma -q genome2010 -i fulltext.xml -o scholarly.html --transform nlm2html
Fiona N Challenge: Can we build a word cloud? 
Peter M Run word frequencies
  • ami2-word -q genome2010 -i scholarly.html --w.words wordFrequencies --w.stopwords /org/xmlcml/ami2/plugins/word/stopwords.txt
A regular expression for EGA can be found http://www.ebi.ac.uk/miriam/main/collections/MIR:00000512
Antony Q ENA (European Nucleotide Archive): ^[A-Z]+[0-9]+$
Fiona N Here is the regular expression for ArrayExpress: ^[AEP]-\w{4}-\d+$
A regular expression to find DOIs: 
Jana G A regular expression to ENCODE data: 
Other repositories to look for the data identifiers
Need regex accession for: 
dbGap  http://www.ncbi.nlm.nih.gov/gap : see http://www.ncbi.nlm.nih.gov/books/NBK110024/ e.g. study accession id: "phs000879.v1.p1" 
Jee-Hyub K Remove ^ ... $ in regular expressions.
If you remove ^ and $ from the regular expression, you will get some matches (it worked with arrayexpress and refseq).
Peter M Run ami-regex [1]
(assuming the directory is called "genome2010")
we create  a file genome2010/regex.xml containing
<compoundRegex title="genome">
<regex weight="1.0" fields="genome">([Gg]enome)</regex>
<regex weight="1.0" fields="data">([Dd]ata)</regex>
then run
ami2-regex -q genome2010 --r.regex genome2010/regex.xml
Typical output
<results title="genome"><result pre="e of 10 primary tumors and 10 effusions was analyzed using the Array-Ready Oligo set for the Human " name0="genome" value0="Genome" post="platform. Results for selected genes were validated using PCR, Western blotting, and immunohistoche" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="otting, and immunohistochemistry confirmed the array findings for BCAR1, CLDN4, VIL2, and DCN. Our " name0="data" value0="data" post="show that breast carcinoma cells in primary carcinomas and effusions have different gene expression" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="l, and their clinical relevance was analyzed in a larger series of breast carcinoma effusions. Our " name0="data" value0="data" post="demonstrate that in agreement with our previous observations, breast carcinoma cells in effusions a" xpath="/html[1]/body[1]/div[1]/div[3]/p[3]"/>
now create your own file ids.xml like:
<compoundRegex title="ids">
<regex weight="1.0" fields="egad">(EGAD\d{11})</regex>
Fiona N •Challenge 1 : Can we find data accession numbers? 
Jee-Hyub K
  • Europe PubMed Central provides web services for mined accession numbers.
  • Accession number search
Fiona N •Challenge 2 : Can we find data DOIs? 
•Challenge X : …?
•Interrupted by coffee breaks and lunch J
Open Access scientific journals in biology that can be readily mined: 
  • PLoS
  • Biomed Central
  • eLife
711 days ago
Proactive P Notes for the day 11 Dec 2015: 
Fiona N Tweet with hashtag: 
#CMDNAHack please use when twitting!
And find us on Twitter: @DNAdigest @theContentMine @linguamatics 
Peter M THIS COMMAND WORKS for PMR, suggest you use different years
getpapers --query '"human genomic" AND PUB_YEAR:[2010 TO 2010]' -o genome2010 -x
Antony Q Delete empty directories:
cd genome2010
find -empty -delete
cd ..
Peter M then normalize:
norma -q genome2010 -i fulltext.xml -o scholarly.html --transform nlm2html
then word frequencies
ami2-word -q genome2010 -i scholarly.html --w.words wordFrequencies  --w.stopwords /org/xmlcml/ami2/plugins/word/stopwords.txt
and regex
Antony Q ENA (European Nucleotide Archive) regex (from http://www.ebi.ac.uk/miriam/main/collections/MIR:00000372):
José M More often than not, i utilise https://regex101.com for helping me develop and test regular expressions and might be useful for other people here, particularly regular expression newbies. For example: https://regex101.com/r/cA9aK7/1
  • Notes
1029 days ago
Unfiled. Edited by Fiona Nielsen , Peter Murray-Rust 1029 days ago
Fiona N Tim Richardson, Adrian Alexa Ines: We can use some scraper tools from the ContentMine if we want to use journal publications and citations as a starting point, see list of cool tools here: http://contentmine.org/software 
Peter M I'm away but will suggest someone comes along
1207 days ago
Organising groups - Brainstorming
Liang S Backend prototype:
  • process data into "DNADigest format" (uri, id, study title & description)
  • example: 
  • id: phs000001.v3.p1
  • title: <StudyNameReportPage>National Eye Institute (NEI) Age-Related Eye Disease Study (AREDS)</StudyNameReportPage>
  • description: <Description><![CDATA[<p> The Age-Related Eye Disease Study (AREDS) was initially designed as a long-term multi-center, prospective study of the clinical course of age-related macular degeneration (AMD) and age-related cataract. In addition to collecting natural history data, AREDS included a clinical trial of]]>...</Description>
  • post each data record into Solr to build the index
Fiona N Please list your topics of interest: 
  • Connect current prototype to available meta-data from repositories, e.g. EGA, dbGaP. 
  • Francis? Seb?
Francis M
  •  A more detailed overview of how Symfony works by Prathap.
  • How best to automate data input from the CSV files to the database to avoid having to manually enter fields from datasets in the API sandbox.(perhaps using Doctrine from the command line after SSH into server?)
Sebastian P
  • Identify the data to display. What do the users want to see?
  • Extract the data we need to display
  • Convert both XML & XLS to CSV or perhap a different method?
  • How much data about the data sets should the DNADigest tool hold and manage?
Fiona N Who would like to be in this group?
Prathap C Prathap
Thomas D ThomasD
Fiona N Please insert relevant links etc. 
Peter M How can we optimise search? Text-mine descriptions? Easiest is mentions ("named entities")  in running text. 
  • extracting identifiers (and their systems) from text - e.g. Genbank, DOIs
  • extracting text likely to refer to datasets (e.g. "deposited in dataset", )
Johan H
  • Is it possible to do something like a literature "psi-blast"? To find a set of interesting articles, and then expand/narrow on this
Peter M
  • diseases are harder because there are fuzzy boundaries and variable usage ("cancer", "tumo(u)r", ). MESH and UMLS can solve a lot of the synonymy and disambiguation
Johan H
  • Can we have a search engine that suggests more specific categories, from user annotation?
Javier D
  •  Depth of search: from disease, condition, other phenotype descriptors to type of variation and coordinates?
Liang S
  • Is it possible to store all the metadata from different organisations in one place?
  • How to collect metadata (Web crawler, API or user submit)? Metadata format/standard? 
Javier D
  • Object oriented databases may offer the flexibility needed?  Easily  add annotation and group data based on different attributes. Ontolgies structures are evolving fast and change hierarchy.
  • Allow automatic structured searches based on formats like JSON or XML, (text ...)
  • Is curated data? Yes/No (type of curation).
  • How do we improve the quality of data curation/tagging/description in datasets as they are published/worked on.
  • What are funders/publishers doing to help the DNADigest effort?
Sebastian P
  • Software for this? Solr? etc
Peter M Resource from Fiona Nielsen
Handcurated multiple alignments 
Fiona N Who would like to be in this group?
Peter M PeterMR 
Fiona N Robert? 
How can we measure the usage for both data contributors and data users? Researchers and institutions want to measure their impact
Jana G
  • create 'own research' and option to track the progress, results, conclusions 
  • option to contact publishers for more informations
Johan H
  • For incentive: can "regular people" who do a lot of annotation gain free access to paid journals? The journals should be interested in this as well as it increases their visibility 
Peter M
  • [prejudiced view] They aren't interested in visibility [want subscriptions] and text-mining unless they control everything. The "free" access in public libraries ("access to research") is limited to one hour and [probably] involves DRM and no copying scraping
Johan H
  • True - but visiblity -> impact. so that could be the way to argue it... It just has to connect to researchers in the end (who pay)
Fiona N Who would like to be in this group?
How do we build in incentives for collaboration and data sharing?
  • feedback system? How much is a good reputation on the DNADigest site worth?
Fiona N
  • 'credits' for good behaviour? (Similar to stackoverflow?)
Jana G
  • rating/credits for relevancy
Fiona N
  • unlocking more functionality if you contribute to the system? 
Lambert M
  • Contribute as a registered user ? or unlocking functionality for registered user vs unregistered ?
Fiona N
  • And what would that look like from a UI/UX perspective (something like as in stackoverflow?)
  • more suggestions? What does e.g. TripAdvisor do to capture feedback from users? What is the incentive? 
  • How can the public/informed/interested users be involved?
Fiona N Who would like to be in this group?
Sobia R Sobia
Lambert M Lambert
Peter M Data stored in grey literature (Lancet: ?2009 85% of science is wasted).
  • Theses
  • Government/NGO/Charity publications
  • NICE and NIHR publications
Work with EuropePubMedCentral (http://europepmc.org). 
  • PMR is on advboard and we are looking for use cases and added value
 More topics for discussion:
Karamjit G   ethical, privacy, access, public engagement issues of design, prototyping of tool sets for metadata
Thomas D  How many datasets of the type we're interested in exist?  How many will exist in the future?  Does it make a difference if most of the data ends up in a small number of (maybe national-scale?) genomics projects?

Contact Support

Please check out our How-to Guide and FAQ first to see if your question is already answered! :)

If you have a feature request, please add it to this pad. Thanks!

Log in