Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor uses sophisticated heuristics to group text lines into paragraphs. Two key parameters control this behavior: verticalJumpFactor and widthTolerance. Understanding how these work helps you tune paragraph detection for different document types.

Core Concepts

Paragraph detection in Kokokor uses four coordinated signals:
  1. Vertical Jump Detection - Detects spacing increases between lines
  2. Indent Detection - Identifies right-edge indentation from baseline
  3. List-Start Detection - Recognizes repeated left-edge patterns
  4. Short Line Detection - Marks paragraph-ending lines
The verticalJumpFactor and widthTolerance options primarily control signals 1 and 4.

verticalJumpFactor

What It Does

The verticalJumpFactor determines when a vertical gap between lines is large enough to indicate a new paragraph. It works by comparing consecutive gaps:
// A new paragraph starts when:
// currentGap > previousGap * verticalJumpFactor

Default Value

{
  paragraph: {
    verticalJumpFactor: 2.0 // default
  }
}

How It Works

Consider three consecutive lines:
Line A         (y: 100)
  gap: 25px
Line B         (y: 125)
  gap: 60px    ← Is this a paragraph break?
Line C         (y: 185)
With verticalJumpFactor = 2.0:
  • currentGap = 60
  • previousGap = 25
  • threshold = 25 * 2.0 = 50
  • 60 > 50Yes, new paragraph

Tuning the Factor

// verticalJumpFactor: 1.5
// Even small spacing increases create new paragraphs
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5
  }
});

// Example: gap of 40px after 30px → new paragraph
// 40 > 30 * 1.5 (45)? No
// But gap of 50px after 30px → new paragraph
// 50 > 30 * 1.5 (45)? Yes

Important Notes

The vertical jump signal only activates when the preceding lines are full-width (not short lines). This prevents false breaks after natural line endings.
// This will NOT trigger a vertical break:
This is a long line that ends the paragraph.
Short line.        ← Short line
  (big gap)
Next paragraph.    ← Gap is ignored due to short line above

// The short line already signals the paragraph break,
// so vertical jump detection is suppressed

widthTolerance

What It Does

The widthTolerance determines what constitutes a “short line” that indicates a paragraph ending. Lines narrower than this threshold trigger a new paragraph for the following line.

Default Value

{
  paragraph: {
    widthTolerance: 0.85 // default (85% of reference width)
  }
}

How It Works

Kokokor computes a reference width from the document:
  1. Collects all line widths
  2. Calculates the 75th percentile (p75) width
  3. This becomes the reference width
Then for each line:
thresholdWidth = referenceWidth * widthTolerance

if (line.width < thresholdWidth) {
  // This is a "short line" - next line starts new paragraph
}

Example Calculation

// Document with line widths: [400, 420, 410, 300, 415, 405, 350]
// Sorted: [300, 350, 400, 405, 410, 415, 420]
// p75 width = 415 (75th percentile)

// With widthTolerance = 0.85:
thresholdWidth = 415 * 0.85 = 352.75

// Line classification:
// 400px → full-width (400 > 352.75)
// 420px → full-width (420 > 352.75)
// 350px → SHORT LINE (350 < 352.75) → triggers paragraph break
// 300px → SHORT LINE (300 < 352.75) → triggers paragraph break

Tuning the Tolerance

// widthTolerance: 0.75
// More lines are considered "short"
const result = reconstructParagraphs(input, {
  paragraph: {
    widthTolerance: 0.75
  }
});

// With reference width 400:
// threshold = 400 * 0.75 = 300
// Lines < 300px are short
// More paragraph breaks

How They Work Together

The two parameters work in coordination:
// Example document:
This is a long line of text that continues.  (width: 420)
This is another long line in same paragraph. (width: 415)
Short line.                                  (width: 300)
                                             (gap: 50px)
This starts a new paragraph with more text.  (width: 410)
And this continues that paragraph.           (width: 405)
                                             (gap: 80px, previous gap: 25px)
This is another paragraph after big gap.     (width: 418)

// With defaults (verticalJumpFactor: 2.0, widthTolerance: 0.85):
// Reference width (p75): ~415
// Threshold width: 415 * 0.85 = 352.75

// Paragraph 1: Lines 1-2
//   Line 3 is short (300 < 352.75) → triggers break

// Paragraph 2: Lines 4-5
//   Gap of 80px vs previous 25px
//   80 > 25 * 2.0? Yes → triggers break

// Paragraph 3: Line 6

Tuning for Document Types

Dense Academic Papers

Academic papers often have consistent spacing and few short lines:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.8,  // Sensitive to spacing
    widthTolerance: 0.90       // Only very short lines
  }
});

Books and Novels

Books have clear paragraph breaks with indentation and spacing:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.0,  // Standard sensitivity
    widthTolerance: 0.85       // Standard threshold
  }
});

Technical Documents

Technical docs may have lists, code blocks, and varied formatting:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Less sensitive
    widthTolerance: 0.75       // More short line breaks
  }
});

Poetry Collections

Poetry is handled separately, but for prose sections:
const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5,  // Very sensitive
    widthTolerance: 0.95       // Preserve short lines
  }
});
Poetry detection (isPoetic flag) happens before paragraph grouping. Poetic lines are never merged into paragraphs regardless of these settings.

Multi-Column Layouts

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Conservative on spacing
    widthTolerance: 0.70       // Aggressive on width (columns are narrow)
  }
});

Diagnostic Tips

Too Many Paragraphs?

// Reduce paragraph breaks by:
// 1. Increase verticalJumpFactor (less sensitive to spacing)
// 2. Increase widthTolerance (fewer short lines)

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 2.5,  // Was: 2.0
    widthTolerance: 0.90       // Was: 0.85
  }
});

Too Few Paragraphs?

// Increase paragraph breaks by:
// 1. Decrease verticalJumpFactor (more sensitive to spacing)
// 2. Decrease widthTolerance (more short lines)

const result = reconstructParagraphs(input, {
  paragraph: {
    verticalJumpFactor: 1.5,  // Was: 2.0
    widthTolerance: 0.80       // Was: 0.85
  }
});

Debug Paragraph Detection

// Use low-level API for detailed control
import { mapObservationsToTextLines, mapTextLinesToParagraphs } from 'kokokor';

const lines = mapObservationsToTextLines(observations, page, {
  log: console.log // Enable debug logging
});

console.log('Line widths:', lines.map(l => l.bbox.width));
console.log('Line gaps:', lines.map((l, i) => 
  i > 0 ? l.bbox.y - lines[i-1].bbox.y : 0
));

const paragraphs = mapTextLinesToParagraphs(lines, {
  verticalJumpFactor: 2.0,
  widthTolerance: 0.85
});

console.log(`${lines.length} lines → ${paragraphs.length} paragraphs`);

Advanced: Reference Width Calculation

Kokokor uses a robust p75 percentile for reference width:
// Internal algorithm (simplified):
function computeReferenceWidth(lines) {
  const widths = lines.map(l => l.bbox.width).sort((a, b) => a - b);
  
  // Use p75 if we have enough lines, otherwise max width
  if (widths.length >= 4) {
    const p75Index = Math.floor((widths.length - 1) * 0.75);
    return widths[p75Index];
  }
  
  return widths[widths.length - 1];
}
This approach is resilient to outliers and works well with varied document layouts.

Next Steps

Advanced Configuration

Explore all configuration options

Basic Usage

Back to basic usage patterns