Skip to content

HTML

Production-grade HTML content extraction library with automatic encoding detection (15+ encodings), intelligent article recognition, link/media extraction, and multi-format output.

Features

  • Intelligent Article Recognition - Automatically identifies and extracts page body content, removing navigation, ads, and other noise
  • Content Sanitization - Automatically sanitizes HTML, removing dangerous tags and attributes to prevent XSS attacks
  • Metadata Extraction - Automatically extracts titles, images, links, videos, audio, and other structured information
  • Multi-Format Output - Plain text, Markdown, and JSON output formats
  • Automatic Encoding Detection - Supports UTF-8, GBK, Shift_JIS, Windows-1252, and 15+ encodings
  • Batch Processing - Concurrent batch extraction with built-in Processor object pool reuse
  • Link Extraction - Standalone link extraction API with type-based grouping
  • Audit System - Pluggable audit pipeline with multiple sinks and event filtering
  • Security Protection - Input size limits, depth limits, path traversal prevention, and panic recovery

Installation

bash
go get github.com/cybergodev/html

Quick Start

go
package main

import (
    "fmt"
    "log"

    "github.com/cybergodev/html"
)

func main() {
    data := []byte(`<html><head><title>Example</title></head>
        <body><h1>Title</h1><p>Body content</p></body></html>`)

    result, err := html.Extract(data)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(result.Title) // Output: Example
    fmt.Println(result.Text)  // Output: Title\n\nBody content
}

Architecture Overview

The HTML library is built around three core types:

text
                Config


             Processor ──→ Result
              │    │         │
              │    │         ├── Text / Title
              │    │         ├── Images / Videos / Audios
              │    │         ├── Links
              │    │         └── WordCount / ReadingTime
              │    │
              │    ├── Cache
              │    ├── Statistics
              │    └── AuditLog

              ├── Scorer (custom scoring ── extensible)
              └── AuditSink (audit output ── extensible)
TypeResponsibilityDescription
ConfigConfigurationControl center for all behavior, provides 4 presets
ProcessorEngineStateful processing engine managing cache, statistics, and audit
ResultResultStructured output containing text and all metadata

Package Functions vs Processor

Package FunctionsProcessor
Usagehtml.Extract(data)p, _ := html.New(cfg); p.Extract(data)
CacheNone (uses internal temporary pool)Yes, configurable TTL and capacity
StatisticsNoneYes, query hit rate and other metrics
AuditNoneYes, configurable audit pipeline
LifecycleNo management neededRequires defer p.Close()
Concurrent SafeYesYes

Choosing the Right Approach

  • One-time extraction (CLI tools, scripts) → Package functions
  • High-frequency server calls (web services, crawlers) → Processor
  • Need audit/monitoring → Processor
StagePageWhat You'll Learn
Getting StartedQuick StartInstallation, basic usage, two calling modes
CoreContent ExtractionExtract family, Config, Result interpretation
FormatsOutput FormatsMarkdown / JSON output, custom templates
PerformanceCache & ReuseProcessor lifecycle, cache tuning, batch processing
ExtensionsLink ExtractionLink extraction, grouping, resource discovery
SecurityAudit PipelineAudit system, custom Sinks, security monitoring
AdvancedTesting & CustomCustom Scorer, ContentNode, test mode
ReferenceCheat SheetCommon API quick reference

Next Steps