Desiree Sainthrope stands at the intersection of traditional legal scholarship and the fast-paced world of legal technology. With her deep background in drafting complex trade agreements and navigating the labyrinth of global compliance, she has witnessed firsthand how data accessibility and artificial intelligence are dismantling the monopolies that once defined legal research. As an authority on the evolving implications of AI in the legal field, she offers a unique perspective on why the sector is currently experiencing what many call a “Cambrian explosion” of new startups. Today, she shares her insights on the technological shifts and historical initiatives that have made this new era of legal innovation possible.
Building comprehensive case law databases once required thousands of workers for manual tagging and cross-referencing. How has agentic AI specifically automated the identification of case treatments and citations, and what are the remaining engineering hurdles for a small team trying to match legacy accuracy?
In the traditional landscape of legal tech, the sheer scale of human labor required was staggering, with legacy players employing thousands of people just to collect data, label overruled cases, and manage citation linking. Today, the game has changed entirely because we can deploy “AI editors” that function as tireless digital laborers, running 24/7 to analyze every single citation and determine case treatments with remarkable precision. For a modern startup, this means agentic coding allows a tiny team—sometimes as small as nine people or even just a husband-and-wife duo—to build hundreds of specialized web scrapers that can navigate different court websites and varying data structures. However, the engineering hurdles remain significant because matching the accuracy of a legacy giant isn’t just about the AI; it requires a level of elite engineering expertise to ensure the scrapers don’t break and the data remains pristine. Even with these tools, it took one modern startup three full years to build a database that includes all federal and state appellate courts, proving that while AI lowers the floor, the ceiling for professional-grade accuracy is still incredibly high.
Open-access initiatives have released tens of millions of digitized pages dating back to the pre-constitutional era. How do startups integrate these massive historical datasets with modern filings, and what specific “cleaning” steps are necessary to ensure this data is reliable for professional litigation?
The integration of historical data is largely fueled by massive open-access projects, such as the Harvard Law School Caselaw Access Project, which released digital scans of 40 million pages of court decisions covering some 40,000 volumes. These records are vital because they reach back to the pre-constitutional era, providing a foundational depth that was previously locked behind the paywalls of companies like LexisNexis or Westlaw. To make this data reliable for modern litigation, startups must perform what we call “cleaning” and enhancement, a process where nonprofits like the Free Law Project have already made over a million individual improvements to the raw data. This involves merging different datasets, correcting OCR errors from old book scans, and standardizing the digital format so that a case from 1780 can be cross-referenced against a filing from 2024. This rigorous refinement process ensures that when an attorney relies on a historical precedent, the digital version is as authoritative and accurate as the physical volume it was scanned from.
Startups often choose between licensing data from established entities or building proprietary scrapers from scratch. What are the long-term risks of relying on third-party licenses during market consolidations, and how does owning a database change a company’s valuation and operational independence?
The decision to license data versus building a proprietary database is perhaps the most critical strategic choice a legal tech founder can make, as relying on third-party licenses creates a precarious dependency. We saw a vivid example of this risk when the startup Alexi, which relied on data licensed from Fastcase, faced massive turmoil after Fastcase merged with vLex and was subsequently acquired by Clio. This kind of market consolidation led to litigation that forced Alexi to reduce its headcount from 60 people down to just 17, illustrating how a shift in corporate ownership can suddenly threaten a startup’s core functionality. By contrast, owning a proprietary database provides a company with total operational independence and significantly higher valuation, as they are no longer vulnerable to the whims or legal disputes of a data provider. While building a database from scratch is an arduous journey, it transforms the company from a mere software layer into a robust infrastructure provider that controls its own destiny in an increasingly consolidated market.
Most emerging platforms focus on federal and state appellate courts while excluding trial court rulings. What is the strategic logic behind omitting trial-level data, and how does this omission affect the depth of insights available to attorneys handling niche or emerging legal issues?
The strategic logic behind focusing on appellate courts is rooted in the concept of legal precedence; trial court rulings generally do not set the binding precedent that forms the backbone of American law. For a lean startup, omitting trial-level data is a pragmatic way to manage resources, as collecting every trial ruling is an exponentially larger and more chaotic task than gathering appellate decisions. However, this omission creates a notable gap for attorneys, particularly those working on niche or emerging legal issues where there may not yet be a wealth of appellate guidance. Without trial court data, lawyers lose the ability to see how specific judges are ruling on new types of motions or how similar cases are being settled at the ground level before they ever reach an appeal. While this focus allows startups to launch more quickly, it means the depth of “boots on the ground” legal intelligence is often sacrificed for the sake of covering the more influential higher court rulings.
Recent rulings have affirmed that official state code annotations are not copyrightable, while court modernization has simplified web scraping. How do these legal and technical shifts lower the barrier to entry, and what specific steps should a new player take to navigate remaining paywalls like PACER?
The landscape of legal data changed forever in 2020 when the U.S. Supreme Court ruled that annotations in official state codes cannot be copyrighted, effectively declaring that the law belongs to the public rather than private publishers. This ruling, combined with a $125 million settlement over PACER overcharges, has significantly lowered the financial and legal barriers that once kept new players out of the market. For a new startup, navigating these remaining hurdles requires a combination of aggressive web scraping of modernized court portals and leveraging the commercial licenses offered by organizations like the Free Law Project. They must also stay abreast of ongoing litigation and settlements that are slowly dismantling the paywalls of systems like PACER, making it cheaper and more legally defensible to aggregate federal filings. By taking advantage of these shifts, a new player can now assemble a competitive database that would have cost tens of millions of dollars to acquire just a decade ago.
Early innovators often shared data with potential competitors to foster market growth and accessibility. How has this culture of data sharing evolved in the current era of generative AI, and what role do consultants with “database-building muscles” play in the rapid launch of new services?
The early days of the legal research “Cambrian explosion” were defined by a unique spirit of collaboration, where pioneers like Ed Walters and Phil Rosenthal at Fastcase performed what many call “the Lord’s work” by sharing their hard-earned data with startups that were technically their competitors. In the current generative AI era, this culture has evolved from direct data sharing to a “dissipation of knowledge,” where the specific skills required to build a massive legal database are now more widely available in the labor market. We now see a specialized class of consultants who have “exercised the muscle” of database building at previous companies and can be hired to help new services launch with incredible speed. This “rinse and repeat” process means that the institutional knowledge once guarded by a few legacy giants has leaked out, allowing new founders to bypass years of trial and error by hiring experts who have already mastered the art of large-scale legal data ingestion.
What is your forecast for the legal research market?
I anticipate that the gap between legacy providers and these agile new startups will close almost entirely within the next few years, as initiatives to digitize commercially published volumes of opinions become comprehensive. We are moving toward a market where the “raw material” of the law—the cases and statutes themselves—is a commodity, and the real value will lie in the sophisticated AI layers that can provide deep, predictive insights. I expect to see even more specialized, niche research tools that cater to specific practice areas, driven by the fact that building a proprietary, high-quality database is no longer a feat reserved for billion-dollar corporations. Ultimately, this will lead to a more democratic legal system where affordable, high-end research tools are available to every attorney, not just those at the largest firms.
