Database queries established that public patent databases now contain more than one million protein sequences extracted from patent publications submitted to the European (EPO), Japanese (JPO) and United States (USPTO) patent offices. However, there has never been any published analysis of this data. This thesis investigates the composition of these sequences and includes an assessesment of their potential value as an expanding sequence resource. From queries developed to interrogate these databases it was possible to establish growth rates, redundancy, significant applicants, size distribution, species, genomic coverage and drug target distribution. Assessments of target family distribution and total sequence content indicated that, while there is still a focus on potential drug targets, the recent publication of large sequence filings have extended patent database coverage to include most of the human genome. Overall the results suggest the scientific value of the patent sequences as a source of experimentally independently determined sequences is diminishing. However some value was extracted by developing an analysis protocol and filtering strategies to discover those patented proteins that independantly confirmed novel gene predictions. By this process verification data was found for 15 unkown human proteins which were absent from UniProt. |