If possible, collecting contextual information (e.g., previous and following words of data DOIs) in order to find out how data are used would be interesting. This can be used to classify if a DOI is used as data citation.
Section tagger can also provide contextual information.
Are you also mining accession numbers? If so, what would be the list?