Rossum Publishes World’s Largest Research Dataset and Benchmark to Accelerate Scientific Progress in Intelligent Document Processing (IDP)

STS News Desk

February 27, 2023

***Groundbreaking DocILE dataset and benchmark to elevate what’s possible* *with AI-enabled data extraction***

Rossum, the pioneer in cloud-native Intelligent Document Processing (IDP), announced that it published the world’s largest research dataset to accelerate scientific progress in business document information extraction (IE). Large datasets are crucial to improving and measuring how AI models perform, which is why the groundbreaking DocILE (Document Information Localization and Extraction) benchmark is so important. It is the world’s largest collection of business documents for the most practical information extraction tasks in IDP. Rossum’s R&D efforts continue to focus on delivering faster and more accurate document information extraction methods, so customers can minimize slow, tedious, and error prone manual document processing.

“This is an important milestone because it advances IDP research as a whole, where everyone can now develop and test more advanced algorithms on a benchmark of challenging and highly practical tasks,” said Milan Šulc, Ph.D., Head of Rossum’s AI Labs. “The new dataset will increase accuracy levels in document information extraction by accelerating research in areas such as novel machine learning architectures and training objectives. This will ultimately lead to global optimization of business communication and workflows, further increasing the amount of the time saved for our customers.”

Datasets and benchmarks as it relates to business document IE are very rare because such documents often contain sensitive information and are legally protected. DocILE is addressing this issue by building a benchmark composed of documents from two public data sources: UCSF Industry Documents Library and Public Inspection Files (PIF). The dataset consists of more than a hundred thousand documents – real or synthetically generated (6,700 annotated business documents and 100,000 synthetically generated documents) – with labels for practical IE tasks. Additionally, it comes with a large dataset of approximately a million unlabeled documents that can be used for unsupervised learning.

The DocILE benchmark was created as a cooperation of researchers from Rossum, Czech Technical University in Prague, University of La Rochelle, and the Autonomous University of Barcelona. It follows the peer-reviewed position paper Business Document Information Extraction: Towards Practical Benchmarks, presented by Rossum’s AI Labs at the recent CLEF 2022 conference.

The benchmark is hosted as a competition at ICDAR 2023, the largest research conference on document analysis, and as a CLEF 2023 lab – see the lab teaser (arXiv preprint, accepted to ECIR 2023). Rossum sponsors the competition with a prize pool of $9000 to attract open-source contributions. To find out more about the dataset, download the detailed dataset paper (arXiv preprint). By utilizing real-world business documents, the research community can focus on advances that will have a large impact on how businesses operate globally.

While Rossum continues to lead the IDP market with its AI and machine learning capabilities, this technology is rapidly evolving. It is paramount that any company focused on AI must consistently research its next technological expansions. Utilizing the new dataset will enable ongoing innovation within the IDP field.

Write in to psen@itechseries.com to learn more about our exclusive editorial packages and programs.

Latest News

Pentagram Expands UK Business Connectivity Portfolio Through BT Wholesale Partnership

STS News Desk

The STS news desk represents a team of tech journalists who coordinate trending stories and breaking news on behalf of the SalesTechStar newsroom.

Rossum Publishes World’s Largest Research Dataset and Benchmark to Accelerate Scientific Progress in Intelligent Document Processing (IDP)

STS News Desk

***Groundbreaking DocILE dataset and benchmark to elevate what’s possible* *with AI-enabled data extraction***

Write in to psen@itechseries.com to learn more about our exclusive editorial packages and programs.

Latest News

Pentagram Expands UK Business Connectivity Portfolio Through BT Wholesale Partnership

ITS Logistics May Port/Rail Ramp Freight Index: Strait of Hormuz Closure Sends Fuel Shock Through Supply Chains

Xometry Streamlines Data Center Supply Chain, Offering Single-Platform Sourcing for Critical Infrastructure Components

Hotwire Global and Lilypath Announce Strategic Partnership to Evolve Executive Visibility for the AI Era

Experian Partners With ServiceNow to Scale Trusted Decisioning to Agentic AI

Trending Articles

Salestech For Invisible Selling: When Buyers Don’t Even Realize They’re In A Sales Journey

From Prospecting to Closing: Salestech Is Compressing the Sales Cycle

Selling in 60 Seconds: The Rise of the B2B Vertical Video Stack

From Outreach to Orchestration: The Strategic Evolution of Salestech

Stop Typing, Start Talking: Building Your Brand’s Custom Voice AI

How Salestech Is Integrating AI, Automation, and Analytics into Sales Workflows?

Why SalesTech Is Becoming a Critical Investment for Digital-First Businesses

How API-First SalesTech Is Redefining Revenue Operations?

STS News Desk

You Might Also Like

More From Author

Pentagram Expands UK Business Connectivity Portfolio Through BT Wholesale Partnership

Insurtech Insights Renews Multi-Year Partnership with The MicDrop Agency as Official PR Agency Partner for USA Conferences

Site Impact Appoints Ron Merritt as Chief Revenue Officer to Fuel Next Phase of Growth

About Us

Quick Links

Visit Out Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we'll get in touch with you!

Rossum Publishes World’s Largest Research Dataset and Benchmark to Accelerate Scientific Progress in Intelligent Document Processing (IDP)

STS News Desk

Groundbreaking DocILE dataset and benchmark to elevate what’s possible with AI-enabled data extraction

Write in to psen@itechseries.com to learn more about our exclusive editorial packages and programs.

Latest News

Stay With Us

Trending Articles

STS News Desk

You Might Also Like

About Us

Quick Links

Visit Out Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we'll get in touch with you!

***Groundbreaking DocILE dataset and benchmark to elevate what’s possible* *with AI-enabled data extraction***