Command line to reset keyboard mappings in VM
  • setxkbmap gb
Fiona N download papers using getpapers (different year for each table): 
  • getpapers --query '"human genomic" AND PUB_YEAR:[2010 TO 2010]' -o genome2010 -x
Peter M then normalize:
  • norma -q genome2010 -i fulltext.xml -o scholarly.html --transform nlm2html
Fiona N Challenge: Can we build a word cloud? 
Peter M Run word frequencies
  • ami2-word -q genome2010 -i scholarly.html --w.words wordFrequencies --w.stopwords /org/xmlcml/ami2/plugins/word/stopwords.txt
A regular expression for EGA can be found http://www.ebi.ac.uk/miriam/main/collections/MIR:00000512
Antony Q ENA (European Nucleotide Archive): ^[A-Z]+[0-9]+$
Fiona N Here is the regular expression for ArrayExpress: ^[AEP]-\w{4}-\d+$
A regular expression to find DOIs: 
Jana G A regular expression to ENCODE data: 
Other repositories to look for the data identifiers
Need regex accession for: 
dbGap  http://www.ncbi.nlm.nih.gov/gap : see http://www.ncbi.nlm.nih.gov/books/NBK110024/ e.g. study accession id: "phs000879.v1.p1" 
Jee-Hyub K Remove ^ ... $ in regular expressions.
If you remove ^ and $ from the regular expression, you will get some matches (it worked with arrayexpress and refseq).
Peter M Run ami-regex [1]
(assuming the directory is called "genome2010")
we create  a file genome2010/regex.xml containing
<compoundRegex title="genome">
<regex weight="1.0" fields="genome">([Gg]enome)</regex>
<regex weight="1.0" fields="data">([Dd]ata)</regex>
then run
ami2-regex -q genome2010 --r.regex genome2010/regex.xml
Typical output
<results title="genome"><result pre="e of 10 primary tumors and 10 effusions was analyzed using the Array-Ready Oligo set for the Human " name0="genome" value0="Genome" post="platform. Results for selected genes were validated using PCR, Western blotting, and immunohistoche" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="otting, and immunohistochemistry confirmed the array findings for BCAR1, CLDN4, VIL2, and DCN. Our " name0="data" value0="data" post="show that breast carcinoma cells in primary carcinomas and effusions have different gene expression" xpath="/html[1]/body[1]/div[1]/div[2]/p[1]"/>
<result pre="l, and their clinical relevance was analyzed in a larger series of breast carcinoma effusions. Our " name0="data" value0="data" post="demonstrate that in agreement with our previous observations, breast carcinoma cells in effusions a" xpath="/html[1]/body[1]/div[1]/div[3]/p[3]"/>
now create your own file ids.xml like:
<compoundRegex title="ids">
<regex weight="1.0" fields="egad">(EGAD\d{11})</regex>
Fiona N •Challenge 1 : Can we find data accession numbers? 
Jee-Hyub K
  • Europe PubMed Central provides web services for mined accession numbers.
  • Accession number search
Fiona N •Challenge 2 : Can we find data DOIs? 
•Challenge X : …?
•Interrupted by coffee breaks and lunch J
Open Access scientific journals in biology that can be readily mined: 
  • PLoS
  • Biomed Central
  • eLife
Look at very recent PMC release of papers in text-minable form (about 2 days old)
Justin C Hackday with ContentMine/DNA Digest Eventbrite description
  • Eventbrite Description (draft)
From the volunteers and organizers of ContentMine and DNAdigest  comes a new hackday!
ContentMine is an open-source project, funded by the Shuttleworth Foundation, that extracts facts from scientific literature in machine readable form. 
DNAdigest is a charity that promotes efficient sharing of genomics data to advance scientific research.
We think that there's a lot we can do together.  For one thing, we want to see if we can use the tools of ContentMine to automatically extract references to genomic data sets from new journal publications.  Then we can put them into a big machine-readable database that can be searched by interested humans and machines alike.  We'd love your expertise and help with that experiment :)
But we also want to explore with you any other areas where automatic fact extraction could be really useful.  Perhaps you work in a field where you have to manually transcribe data from graphs?  Or perhaps you'd like to scan open-access journals that are only available via the web for a particular word or phrase?  ContentMine has tools to do all these things are more.
So if you're interested in machine-mining journals, whether that's for genomics dataset DOIs, your own research or just out of general interest, please come along and join us!  In-depth programming knowledge is not essential - ContentMine is designed to make mining easy and we'll have ContentMine mentors there on the day to help you get familiar with the system.
  • Notes
