Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor processes OCR observations through a three-stage pipeline that progressively reconstructs document structure:
1

Observations → Text Lines

Groups OCR observations into lines using vertical proximity analysis
2

Text Lines → Paragraphs

Merges lines into paragraphs while preserving poetry and special formatting
3

Paragraphs → Formatted Text

Converts structured blocks into readable text with proper spacing

Pipeline Architecture

The complete pipeline is orchestrated by the reconstructParagraphs function:
export const reconstructParagraphs = (
  input: ReconstructInput,
  options: ReconstructOptions = {}
): ReconstructResult => {
  // Stage 1: Observations → Text Lines
  const lines = mapObservationsToTextLines(
    input.observations,
    input.page,
    {
      horizontalLines: input.layout?.horizontalLines,
      rectangles: input.layout?.rectangles,
      ...(options.line ?? {}),
    }
  );

  // Stage 2: Text Lines → Paragraphs
  const paragraphs = mapTextLinesToParagraphs(
    lines,
    options.paragraph ?? {}
  );

  // Stage 3: Paragraphs → Formatted Text
  const text = formatTextBlocks(
    paragraphs,
    options.format?.footerSymbol
  );

  return { lines, paragraphs, text };
};
Reference: src/index.ts:35

Stage 1: Observations to Text Lines

Purpose

Convert raw OCR observations (individual words) into structured text lines with rich metadata.

Process

1

Preprocessing

Normalize coordinates, filter noise, and handle RTL text direction
observations = flipAndAlignObservations(
  observations,
  page.width,
  page.dpiX,
  options
);
2

Line Grouping

Group observations by vertical proximity using adaptive spacing analysis
const marked = indexItemsAsLines(
  observations,
  page.dpiY,
  options.pixelTolerance,
  options.lineHeightFactor
);
3

Metadata Detection

Identify centering, headings, footnotes, and poetry
// Internal: centering detection algorithm
if (textIsCentered(o.bbox, page.width, options)) {
  e.isCentered = true;
}
if (footerLineY !== undefined && o.bbox.y > footerLineY) {
  e.isFootnote = true;
}
4

Poetry Detection

Apply multiple heuristics to identify poetic content
// Internal: poetry detection uses multiple heuristics
if (groupMatchesPoetryCriteria(group, page.width, avgProseWordDensity, options)) {
  for (const observation of group) {
    observation.isPoetic = true;
  }
}
Reference: src/utils/paragraphs.ts:82

Adaptive Line Detection

The algorithm automatically adjusts to document characteristics:
  • Spacing Analysis: Calculates median and 75th percentile gaps between observations
  • Line Height Factor: Adapts based on gap-to-height ratio:
    • 0.15 for small gaps (tight line grouping)
    • 0.25 for medium gaps (standard spacing)
    • 0.4 for large gaps (widely spaced lines)
  • DPI Scaling: Adjusts pixel tolerances based on document resolution
Reference: src/utils/layout.ts:196

Stage 2: Text Lines to Paragraphs

Purpose

Merge text lines into coherent paragraphs while preserving poetry and special formatting.

Process

The algorithm separates body content from footnotes and processes each independently:
export const mapTextLinesToParagraphs = (
  textLines: TextBlock[],
  options: ParagraphOptions = {}
) => {
  const bodyBlocks = groupProseToParagraphs(
    textLines.filter((t) => !t.isFootnote),
    resolvedOptions.verticalJumpFactor,
    resolvedOptions.widthTolerance
  );

  const footerBlocks = groupProseToParagraphs(
    textLines.filter((t) => t.isFootnote),
    resolvedOptions.verticalJumpFactor,
    resolvedOptions.widthTolerance
  );

  return bodyBlocks.concat(footerBlocks);
};
Reference: src/utils/paragraphs.ts:236

Break Detection Signals

The paragraph grouping algorithm uses four coordinated signals:

Vertical Jump

Significant spacing increase between full-width lines

Indent Start

Line that newly indents from the right-edge baseline

List Start

Repeated left-edge starts with short continuations

Short Line

Lines significantly narrower than reference width
Reference: src/utils/marking.ts:525

Poetry Preservation

Poetic content receives special treatment:
for (const line of textLines) {
  if (line.isPoetic) {
    // Poetry lines are NOT merged into paragraphs
    result.push(line);
  } else {
    // Prose lines accumulate for paragraph grouping
    current.push(line);
  }
}
Reference: src/utils/paragraphs.ts:204
Poetry lines maintain their individual line breaks to preserve artistic and structural integrity.

Stage 3: Paragraphs to Formatted Text

Purpose

Convert structured text blocks into a readable string with proper line breaks and spacing.

Process

export const formatTextBlocks = (
  textBlocks: TextBlock[],
  footerSymbol?: string
) => {
  let isAtLeastOneFootnoteHit = false;

  const paragraphs = textBlocks.flatMap((t) => {
    // Insert footer symbol before first footnote
    if (footerSymbol && t.isFootnote && !isAtLeastOneFootnoteHit) {
      isAtLeastOneFootnoteHit = true;
      return [footerSymbol, t.text];
    }

    // Add blank line after headings
    if (t.isHeading) {
      return [t.text, ''];
    }

    return [t.text];
  });

  return paragraphs.join('\n');
};
Reference: src/index.ts:11

Formatting Rules

Headings receive a blank line after them for visual separation:
Chapter Title

First paragraph text...
Optional footer symbol marks the start of footnote section:
...main text ends
---
1. First footnote
2. Second footnote
Each poetic line appears on its own line:
Poetic line one
Poetic line two
Paragraphs are separated by single newlines:
First paragraph text.
Second paragraph text.

Configuration Options

Line Detection Options

type MapObservationsToTextLinesOptions = {
  // RTL text handling
  isRTL?: boolean;

  // Spacing tolerance
  pixelTolerance?: number;        // Default: 5px at 72 DPI
  lineHeightFactor?: number;       // Adaptive if not provided

  // Layout elements
  horizontalLines?: BoundingBox[]; // For footnote detection
  rectangles?: BoundingBox[];      // For heading detection

  // Centering detection
  centerToleranceRatio?: number;   // Default: 0.05 (5%)
  minMarginRatio?: number;         // Default: 0.2 (20%)

  // Poetry detection
  poetryDetectionOptions?: Partial<PoetryDetectionOptions>;
  poetryPairDelimiter?: string;    // Default: " "

  // Debugging
  log?: (message: string, ...args: any[]) => void;
};
Reference: src/types.ts:69

Paragraph Grouping Options

type ParagraphOptions = {
  // Vertical spacing threshold
  verticalJumpFactor?: number;  // Default: 2

  // Short line threshold
  widthTolerance?: number;       // Default: 0.85 (85%)
};
Reference: src/types.ts:142

Complete Example

import { reconstructParagraphs } from 'kokokor';

const result = reconstructParagraphs(
  {
    observations: [
      { bbox: { x: 100, y: 0, width: 200, height: 20 }, text: "First" },
      { bbox: { x: 310, y: 0, width: 200, height: 20 }, text: "line" },
      // ... more observations
    ],
    page: {
      width: 800,
      height: 1200,
      dpiX: 300,
      dpiY: 300,
    },
    layout: {
      horizontalLines: [],
      rectangles: [],
    },
  },
  {
    line: {
      isRTL: true,
      poetryDetectionOptions: {
        minWordCount: 2,
      },
    },
    paragraph: {
      verticalJumpFactor: 2,
      widthTolerance: 0.85,
    },
    format: {
      footerSymbol: '---',
    },
  }
);

console.log(result.text);
// First line
// Second paragraph...

Next Steps

TextBlock Type

Learn about metadata and properties

Poetry Detection

Understand poetry identification algorithms

RTL Support

Explore right-to-left text handling

API Reference

View complete API documentation