Hackpads are smart collaborative documents. .

Justin Clark-Casey

711 days ago
Fiona N Notes for the day #CMDNAHack
Tweet with hashtag: 
Nadia #CMDNAHack please use when twitting!
 
Fiona N And find us on Twitter: @DNAdigest @theContentMine @linguamatics 
 
Schedule: 
•Intro from DNAdigest :ok: 
•Intro from ContentMine :ok: 
•Installation of tools :ok:
And off you go! :checkered_flag: 
 
Command line to reset keyboard mappings in VM
  • setxkbmap gb
 
Fiona N download papers using getpapers (different year for each table): 
  • getpapers --query '"human genomic" AND PUB_YEAR:[2010 TO 2010]' -o genome2010 -x
 
Peter M then normalize:
  • norma -q genome2010 -i fulltext.xml -o scholarly.html --transform nlm2html
 
Fiona N Challenge: Can we build a word cloud? 
Peter M Run word frequencies
  • ami2-word -q genome2010 -i scholarly.html --w.words wordFrequencies --w.stopwords /org/xmlcml/ami2/plugins/word/stopwords.txt
A regular expression for EGA can be found http://www.ebi.ac.uk/miriam/main/collections/MIR:00000512
 
Antony Q ENA (European Nucleotide Archive): ^[A-Z]+[0-9]+$
 
Fiona N Here is the regular expression for ArrayExpress: ^[AEP]-\w{4}-\d+$
 
A regular expression to find DOIs: 
 
Jana G A regular expression to ENCODE data: 
ENC[SR|BS|DO|AB|LB|FF][0-9]{3}[A-Z]{3}
 
Other repositories to look for the data identifiers
 
Need regex accession for: 
dbGap  http://www.ncbi.nlm.nih.gov/gap : see http://www.ncbi.nlm.nih.gov/books/NBK110024/ e.g. study accession id: "phs000879.v1.p1" 
 
 
Jee-Hyub K Remove ^ ... $ in regular expressions.
If you remove ^ and $ from the regular expression, you will get some matches (it worked with arrayexpress and refseq).
 
Peter M Run ami-regex [1]
(assuming the directory is called "genome2010")
we create  a file genome2010/regex.xml containing
<compoundRegex title="genome">
<regex weight="1.0" fields="genome">([Gg]enome)</regex>
<regex weight="1.0" fields="data">([Dd]ata)</regex>
</compoundRegex>
 
then run
ami2-regex -q genome2010 --r.regex genome2010/regex.xml
 
Typical output
 
containing:
 
<results title="genome"><result pre="e of 10 primary tumors and 10 effusions was analyzed using the Array-Ready Oligo set for the Human " name0="genome" value0="Genome" post="platform. Results for selected genes were validated using PCR, Western blotting, and immunohistoche" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="otting, and immunohistochemistry confirmed the array findings for BCAR1, CLDN4, VIL2, and DCN. Our " name0="data" value0="data" post="show that breast carcinoma cells in primary carcinomas and effusions have different gene expression" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="l, and their clinical relevance was analyzed in a larger series of breast carcinoma effusions. Our " name0="data" value0="data" post="demonstrate that in agreement with our previous observations, breast carcinoma cells in effusions a" xpath="/html[1]/body[1]/div[1]/div[3]/p[3]"/>
 
now create your own file ids.xml like:
<compoundRegex title="ids">
<regex weight="1.0" fields="egad">(EGAD\d{11})</regex>
</compoundRegex>
 
 
Fiona N •Challenge 1 : Can we find data accession numbers? 
Jee-Hyub K
  • Europe PubMed Central provides web services for mined accession numbers.
  • Accession number search
 
Fiona N •Challenge 2 : Can we find data DOIs? 
•Challenge X : …?
•Interrupted by coffee breaks and lunch J
•Summary 
Open Access scientific journals in biology that can be readily mined: 
 
  • PLoS
  • Biomed Central
  • eLife
 
...
713 days ago
Unfiled. Edited by Justin Clark-Casey 713 days ago
 
Look at very recent PMC release of papers in text-minable form (about 2 days old)
 
752 days ago
Unfiled. Edited by Justin Clark-Casey 752 days ago
Justin C Hackday with ContentMine/DNA Digest Eventbrite description
 
  • Eventbrite Description (draft)
From the volunteers and organizers of ContentMine and DNAdigest  comes a new hackday!
 
ContentMine is an open-source project, funded by the Shuttleworth Foundation, that extracts facts from scientific literature in machine readable form. 
 
DNAdigest is a charity that promotes efficient sharing of genomics data to advance scientific research.
 
We think that there's a lot we can do together.  For one thing, we want to see if we can use the tools of ContentMine to automatically extract references to genomic data sets from new journal publications.  Then we can put them into a big machine-readable database that can be searched by interested humans and machines alike.  We'd love your expertise and help with that experiment :)
 
But we also want to explore with you any other areas where automatic fact extraction could be really useful.  Perhaps you work in a field where you have to manually transcribe data from graphs?  Or perhaps you'd like to scan open-access journals that are only available via the web for a particular word or phrase?  ContentMine has tools to do all these things are more.
 
So if you're interested in machine-mining journals, whether that's for genomics dataset DOIs, your own research or just out of general interest, please come along and join us!  In-depth programming knowledge is not essential - ContentMine is designed to make mining easy and we'll have ContentMine mentors there on the day to help you get familiar with the system.
 
  • Notes
 
 
823 days ago
Unfiled. Edited by Justin Clark-Casey 823 days ago
Justin C Group 3
 
Topics
Consent
Government/big picture
Reseachers
Who uses/accesses data
Problems with databases
 

Contact Support



Please check out our How-to Guide and FAQ first to see if your question is already answered! :)

If you have a feature request, please add it to this pad. Thanks!


Log in