OBS: From simple patent sequence search to variant analysis
DNA, RNA sequences as well as proteins have been disclosed in patents since the 60s and a few even before that. Many laws have been created and modified over time to allow different types of biological material to be patented, such as naturally occurring sequences, modified sequences, sequences used in diagnostics, sequences from plants and many other types. We have recently seen that vaccines are a hot topic and some, such as RNA vaccines, do include sequences. Industrial domains that publish sequences could seem surprising, but the food industry or detergent manufacturers for instance are some of them. Obviously, pharmaceutical industry, biotech, agrochemical and seed companies produce the bulk of sequence patents. So, why is patent sequence searching important and why is it different to other types of patent searching?
Patent Sequence Data
Starting in the 90s, and the human genome project, genomic and mRNA sequences started to become more common in patents. In some cases, whole genomes (from bacteria, fungus) which can be made up of millions of base pairs were published. Private companies disclosed and, in some cases claimed millions of short sequences. All this is happening when all patents were purely filed on paper. Electronic filing of patents and supplemental materials such as sequence listings finally became available in part due to sequence patent . Since then, we have seen an increase in the number of patents with sequences, and despite the massive rise in Chinese patents, the worldwide newly published number of patents with sequences still follows a linear curve.
Historical trend of newly published sequence patents from 2000 to 2020 available in Orbit BioSequence.
The historical big three authorities (USPTO, EPO and WIPO) publish their sequences. Some other authorities are very compliant such as JPO, KIPO and CIPO. Others are less systematic or stuck in the past, unfortunately. But even for highly compliant authorities, rules and laws on what sequences should be disclosed vary. It is, thus, highly recommended to have a family view of your patents since a sequence patent might be different in an EPO document than in the USPTO or WIPO documents of the same family.
Why patent sequence search is different?
Traditional IP searching is done with keywords. Since searching with keywords is imperfect, they are often combined with patent classes, synonymous lists, and many other features that, basically, attempt to alleviate the pain induced by the lack of accuracy of keywords.
Biological sequence searches are different for several reasons. First, there is a common language to describe DNA/RNA and amino acid sequences, entirely independent from the native language the patent is written in. So, no need for natural language translation.
Second, since sequences can be very long, several publication standards have existed over time to treat them separately in a sequence listing. Thus, a large majority of published sequences are simple to treat electronically. This can be contrasted with chemistry where images are still an acceptable form of publication.
Third, unless your sequence is very short, you will always want to find sequences similar to yours, not just identical. This is particularly important since small errors (OCR mistakes, publisher errors) can be somehow controlled. By contrast, if you searched for the keywords “bread yeast”, you would not find “bead yeast” even if the latter could be a spelling mistake.
Fourth, for the last 20 years or so, sequences published in a patent are numbered and referred to by the keywords SEQ ID NO. It is easy in most cases to know if, say, hit sequence 5 is claimed since it is referred to by its number in the claims section as SEQ ID NO. 5. This is a unique feature of sequences and one that is critically important, allowing us to highlight sequence instances as (claimed) for instance.
A sequence aligned to a sequence appearing in three USPTO documents, claimed or not in Orbit BioSequence.
Alignments and algorithms
Patent sequence searching consists in aligning your query sequence to sequences in a database using specific algorithms and parameters. This is all pretty complicated, however, it can be limited to a few use cases. At the risk of oversimplifying the problem, either you use a long gene sequence to find similar sequences, or you use a short sequence. For the former, everything works well, just make sure to compare your gene to both nucleotide and protein databases since you don’t know beforehand what a patent could claim or disclose. In the latter case of a short sequence, things are bit more difficult. You might want to find sequences that are perfectly matching your query or permit a few mismatches. Do you want to allow gaps? If you use 3 or 6 antibody CDRs, do you want all aligned to other CDRs, or heavy or light chains? All those questions might lead to different algorithms and parameters.But don’t worry, we have extensive documentation and a great helpdesk. Complex problems don’t always lead to simple solutions!
Pairwise vs. Variant multiple sequence alignment
Eventually, you will see pairwise alignments, in other words, your query sequence aligned to a patent sequence. This will give you intricate details of the differences between the two sequences and, combined with the available patent information, will help you decide if this alignment is relevant to your FTO, patentability, … Indeed, there can be a lot of sequences in the same patent family and many families. You will need to browse through many, but we can help with filters that will lead to only the most relevant alignments and families. However, you will miss a global view of the alignments. How many patent sequences have a lysine at position 34 of your query? This can only be done with a variant analysis.
A variant analysis will stack all patent sequences aligned to your query and will give you a global view for each query position. In other words, it will create a multiple alignment based on your query sequence. You can query, modify, and export the dataset, and most importantly, explore the variations to give you new insight into what your competitors are doing or what area are never modified for instance.
Variations at several positions using Orbit BioSequence Variant Analysis
Orbit BioSequence (OBS)
With an extensive access to patent sequences as well as non-patent sequences, Orbit BioSequence is the perfect tool for your FTO, patentability and business intelligence searches. By easily combining patent data and sequences, OBS will make your patent sequence searches a lot easier than other tools purely dedicated to sequences. Antibody and CDR, genes, primers can all be used, combined, and explored.
Interested to find out more? Contact us for specific advice or support, or watch the recording of our recent webinar Smart & visual sequence variations explorer in patent data By Orbit BioSequence.