T
ToolShelf
MINERU HTML
// MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Age...

MinerU HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Age...

13EmergingUnknown
License
Apache-2.0
Updated
Today

What it does

MinerU-HTML(Dripper) is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation. - 2025.12.1 πŸŽ‰ The AICC dataset is released, welcome to use! AICC dataset contains 7.3T web pages extracted and converted to Markdown format by

Getting Started

git
git clone https://github.com/opendatalab/MinerU-HTML

Platforms

πŸͺŸwindows🍎mac🐧linux

Install Difficulty

moderate

Built With

html

Community Reactions