MINERU HTML
// MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Age...
MinerU HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Age...
13EmergingUnknown
What it does
MinerU-HTML(Dripper) is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation. - 2025.12.1 π The AICC dataset is released, welcome to use! AICC dataset contains 7.3T web pages extracted and converted to Markdown format by
Getting Started
git
git clone https://github.com/opendatalab/MinerU-HTML