Groundbreaking DocILE dataset and benchmark to elevate what’s possible with AI-enabled data extraction
Rossum, the pioneer in cloud-native Intelligent Document Processing (IDP), announced that it published the world’s largest research dataset to accelerate scientific progress in business document information extraction (IE). Large datasets are crucial to improving and measuring how AI models perform, which is why the groundbreaking DocILE (Document Information Localization and Extraction) benchmark is so important. It is the world’s largest collection of business documents for the most practical information extraction tasks in IDP. Rossum’s R&D efforts continue to focus on delivering faster and more accurate document information extraction methods, so customers can minimize slow, tedious, and error prone manual document processing.
“This is an important milestone because it advances IDP research as a whole, where everyone can now develop and test more advanced algorithms on a benchmark of challenging and highly practical tasks,” said Milan Å ulc, Ph.D., Head of Rossum’s AI Labs. “The new dataset will increase accuracy levels in document information extraction by accelerating research in areas such as novel machine learning architectures and training objectives. This will ultimately lead to global optimization of business communication and workflows, further increasing the amount of the time saved for our customers.”
Read More:Â Gartner Expects Sales Enablement Budgets To Increase By 50% By 2027
Datasets and benchmarks as it relates to business document IE are very rare because such documents often contain sensitive information and are legally protected. DocILE is addressing this issue by building a benchmark composed of documents from two public data sources: UCSF Industry Documents Library and Public Inspection Files (PIF). The dataset consists of more than a hundred thousand documents – real or synthetically generated (6,700 annotated business documents and 100,000 synthetically generated documents) – with labels for practical IE tasks. Additionally, it comes with a large dataset of approximately a million unlabeled documents that can be used for unsupervised learning.
Read More:Â SalesTechStar Interview with Matthew Sentena, Senior Vice President, Global Sales at Digital.ai
The DocILE benchmark was created as a cooperation of researchers from Rossum, Czech Technical University in Prague, University of La Rochelle, and the Autonomous University of Barcelona. It follows the peer-reviewed position paper Business Document Information Extraction: Towards Practical Benchmarks, presented by Rossum’s AI Labs at the recent CLEF 2022 conference.
The benchmark is hosted as a competition at ICDAR 2023, the largest research conference on document analysis, and as a CLEF 2023 lab – see the lab teaser (arXiv preprint, accepted to ECIR 2023). Rossum sponsors the competition with a prize pool of $9000 to attract open-source contributions. To find out more about the dataset, download the detailed dataset paper (arXiv preprint). By utilizing real-world business documents, the research community can focus on advances that will have a large impact on how businesses operate globally.
While Rossum continues to lead the IDP market with its AI and machine learning capabilities, this technology is rapidly evolving. It is paramount that any company focused on AI must consistently research its next technological expansions. Utilizing the new dataset will enable ongoing innovation within the IDP field.