Patsnap: Supercharging FTO Search with Degenerate Sequence Searching

July 10, 2023

|By

LONDON, July 10, 2023 /PRNewswire/ — Patsnap’s biological sequence database (Bio) statistics show that the occurrence of such special sequences in global patent literature is not insignificant. There are approximately 7.4 million nucleotide sequences, accounting for 7.12% of the total number of nucleotides, and 1.31 million protein sequences, accounting for 7.55%. This indicates a significant number of generic sequences that can affect search results due to the presence of special symbols, posing substantial risks for FTO analyses. 

Patsnap’s Solution:

Therefore, to mitigate the risk of overlooking these critical sequences, Patsnap’s Algorithm Engineering Team has developed a deep learning model using in-house NLP, CV, entity recognition, and coreference resolution technologies.

This model is designed to identify and parse degenerate sequences and their substitutions in sequence listings and full-text patents, and it established a Degenerate Sequence Searching Database as part of our Bio Professional package.

Using a specialized sequence alignment algorithm, this database not only enables the retrieval of such sequences but also provides a true similarity score. Therefore, by performing searches within the degenerate sequence database, we can effectively mitigate the risk of inadvertently overlooking crucial information during freedom to operate (FTO) and novelty searches.

Given the potential scale of variations in degenerate sequences, which can reach the tens of billions, traditional sequence alignment algorithms fail to meet the real-time retrieval demands. Patsnap tackles this challenge by employing a deeply customized sequence alignment algorithm that dynamically loads substitution information for degenerate sequences during the retrieval process, ensuring precise retrieval within reasonable time frames.

During the scanning phase, Patsnap introduces a compression algorithm to construct a seed word table for heuristic searches, significantly reducing unnecessary comparisons and improving retrieval efficiency. When aligning query sequences with target sequences, Patsnap’s proprietary algorithm incorporates degenerate substitution information, resulting in more accurate alignment and query results, as well as more intuitive and visually appealing alignment outcomes for different variants of the query sequence and target sequence.

Biological sequences form the bedrock of innovation in biotechnology, with countless advancements revolving around these sequences. However, the unique nature of biological sequences poses a challenge for conventional keyword-based information retrieval methods, often leading to the oversight of crucial information and potential risks.

The sequences presented in patent claims encompass a wide range of variations, not only describing the sequences themselves but also requiring a specific level of homology. As a result, researchers heavily rely on homology sequence alignment algorithms to explore sequence databases, using predefined homology thresholds to ensure comprehensive results. This approach is widely employed in current biological sequence database searches.

Nevertheless, a pressing question remains: can these similar sequence searches genuinely identify all potential target sequences? While these methods have proven effective, their ability to capture every relevant sequence warrants further examination. It is crucial to explore the limitations of current search methodologies and strive for enhanced approaches that leave no potential target sequence undiscovered. 

Special Sequences in Patents 

Combining similar sequence searches with keyword based results aggregation significantly reduces the risk of overlooking crucial information and FTO issues.

However, sequences in patents differ from those found in other biological databases as they exhibit many “patent-specific” characteristics. To expand the scope of patent protection and create search barriers for competitors, patent drafters often employ a description method similar to the “Markush structure” used in chemistry. By introducing degenerate symbols, wildcards, operators, and other information between positions in the parent sequence, and describing the specific parameters of these symbols through explanatory documents, we refer to them as “Degenerate Sequences.”

The image below illustrates a degenerate sequence described in patent claims: 

25. The library of any one of claims 1-24, wherein the polypeptide comprises an amino acid sequence according to Formula (III):

EVGSYX₁X₂X₃X₄X₅X₆CX₇X₈X₉X₁₀X₁₁X₁₂CX₁₃X₁4SGRSAGGGGTENLYFQGSGGS (SEQ ID NO: 3), wherein X1 is A,D,I,N,P,or Y,x2is A,F,N,S,or V,X3 is A,H,L,P,S,V,or Y,X4 is A,H,S,or Y,X5 is A,D,P,S,V,or Y,X6 is A,D,L,S,or Y, X7is D,P,or V, X8 is A,D,

H,P,S,orT,X9is A,D,F,H,P,or Y,X10 is L,P,or Y,Xl1is F,P,or Y,X12 is A,P,S, or Y, X13 is A,D,N,S,T, or Y,and X14 is A,S, or Y.

26. The library of claim 25, wherein each of the polynucleotides in the library encodes a polypeptide comprising an amino acid sequence according to Formula (III).

27. The library of any one of claims 1-26, wherein the polypeptide comprises an amino acid sequence selected from the group consisting of SEQ ID NOS: 25-46.

28. The library of any one of claims 1-27, wherein the TBM comprises an antibody light chain variable region.

29. The library of claim 28, wherein the polypeptide further comprises a heavy chain variable region C-terminal to the light chain variable region.

Degenerate sequences themselves do not possess any biological significance; they solely serve the purpose of the patent. However, when combined with the description of the homology range, such an approach not only comprehensively protects innovative achievements but also becomes a “decisive blow” against the current conventional sequence homology search methods.  Let’s take a look at an example below.

Query sequence:

“EVGSYPAPSDACPSDYFYCDASGRSAGGGGTENLYFQGSGGS” 

Target sequence: 

“EVGSYXXXXXXCXXXXXXCXXSGRSAGGGG TENLYFQGSG GS” 

The similarity score obtained from the BLAST algorithm is only 67%, but the actual similarity is 100%. 

This happens because conventional sequence homology alignment algorithms do not consider scenarios involving degenerate sequences during their initial development. Therefore, without special processing, excluding degenerate sequences would lead to two situations when using conventional algorithms: 

1) Inability to search for the sequence.

2) Exclusion of sequences due to similarity scores falling below the threshold. 

Both scenarios pose significant challenges for sequence searchers, as they not only impede the comparison of sequences with patent claims but also increase the likelihood of overlooking critical sequence information. 

Experience Degenerate Sequence Searching Now

In June of 2023, Patsnap’s biological sequence Bio database introduced a powerful degenerate sequence search feature, causing a paradigm shift in the patent domain. This disruptive advancement provides researchers with an immensely robust tool that offers an extensive collection of degenerate sequences, allowing users to effortlessly obtain the most accurate and relevant information in their searches.

To schedule a demo or learn more, visit patsnap.com/solutions/bio.

About Patsnap: Founded in 2007, Patsnap is the company behind the world’s leading AI-powered innovation intelligence platform. Patsnap provides global businesses with a connected, easy-to-use platform that helps them make better decisions in the innovation process. Customers are innovators across multiple industry sectors, including agriculture and chemicals, consumer goods, food and beverage, life sciences, automotive, oil and gas, professional services, aviation and aerospace, and education.