Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor provides several option types to configure the text reconstruction pipeline, including line detection, paragraph grouping, poetry detection, and centering analysis.

ReconstructOptions

Optional knobs for one-shot paragraph reconstruction via reconstructParagraphs.

Type Definition

type ReconstructOptions = {
  line?: Partial<MapObservationsToTextLinesOptions>;
  paragraph?: ParagraphOptions;
  format?: {
    footerSymbol?: string;
  };
};

Fields

line
Partial<MapObservationsToTextLinesOptions>
Line-detection options. See MapObservationsToTextLinesOptions for available fields.
paragraph
ParagraphOptions
Paragraph-detection options. See ParagraphOptions for available fields.
format
object
Text formatting options.
Optional symbol to insert before the first footnote in the formatted output.Example: "---" or "***"

Usage Example

import { reconstructParagraphs } from 'kokokor';

const result = await reconstructParagraphs(
  {
    observations: [...],
    page: { width: 2550, height: 3300, dpiX: 300, dpiY: 300 }
  },
  {
    line: {
      pixelTolerance: 10,
      poetryPairDelimiter: ' ... '
    },
    paragraph: {
      verticalJumpFactor: 2.5,
      widthTolerance: 0.8
    },
    format: {
      footerSymbol: '---'
    }
  }
);

MapObservationsToTextLinesOptions

Configuration options for the main text line mapping function. This type combines all the configuration needed to process OCR observations into structured text lines.

Type Definition

type MapObservationsToTextLinesOptions = CenteringOptions & {
  horizontalLines?: BoundingBox[];
  isRTL?: boolean;
  lineHeightFactor?: number;
  log?: (message: string, ...args: any[]) => void;
  pixelTolerance?: number;
  poetryDetectionOptions?: Partial<PoetryDetectionOptions>;
  poetryPairDelimiter?: string;
  rectangles?: BoundingBox[];
};

Fields

centerToleranceRatio
number
default:"0.05"
Inherited from CenteringOptions. The tolerance for center point alignment as a ratio of image width.
minMarginRatio
number
default:"0.1"
Inherited from CenteringOptions. The minimum margin required on each side as a ratio of image width.
horizontalLines
BoundingBox[]
Optional array of horizontal line elements detected in the document.These are typically used to identify sections, headers, footers, or decorative elements. When provided, text appearing below these lines may be classified as footnotes.
isRTL
boolean
Are the coordinates from a RTL language? If true, the x-axis will be flipped.
lineHeightFactor
number
Optional fixed line height factor for grouping observations into lines.If not provided, the system will compute an adaptive factor based on document analysis.Typical values:
  • 0.5: Very tight line grouping
  • 1.0: Standard line height
  • 1.5: Generous line spacing tolerance
log
(message: string, ...args: any[]) => void
Optional logging function for debugging and monitoring the text processing pipeline.When provided, the system will output detailed information about its decisions and intermediate processing steps.
pixelTolerance
number
default:"5"
Additional vertical tolerance in pixels (at 72 DPI) for line grouping.This value is automatically scaled based on the document’s actual DPI.
  • Higher values make line grouping more permissive (more text on same line)
  • Lower values make line grouping stricter (more separate lines)
Default: 5 pixels at 72 DPI
poetryDetectionOptions
Partial<PoetryDetectionOptions>
Configuration options for poetry detection algorithms.If not provided, default poetry detection settings will be used. Set to null or undefined to disable poetry detection entirely.See PoetryDetectionOptions for available fields.
poetryPairDelimiter
string
default:"' '"
Delimiter used when merging a detected poetry pair (hemistichs) into a single line.Example: " ... " formats a pair as صدر ... عجز.
rectangles
BoundingBox[]
Optional array of rectangular elements detected in the document.These are typically used to identify text boxes, highlighted sections, or headers. When provided, text within these rectangles may be classified as headings.

Usage Example

import { mapObservationsToTextLines } from 'kokokor';

const lines = mapObservationsToTextLines(
  observations,
  { width: 2550, height: 3300, dpiX: 300, dpiY: 300 },
  {
    pixelTolerance: 10,
    centerToleranceRatio: 0.03,
    minMarginRatio: 0.15,
    poetryPairDelimiter: ' ... ',
    horizontalLines: [
      { x: 50, y: 2800, width: 2450, height: 5 }
    ],
    rectangles: [
      { x: 100, y: 100, width: 2350, height: 80 }
    ],
    log: (msg, ...args) => console.log(msg, ...args)
  }
);

ParagraphOptions

Options for grouping text lines into paragraphs.

Type Definition

type ParagraphOptions = {
  verticalJumpFactor?: number;
  widthTolerance?: number;
};

Fields

verticalJumpFactor
number
default:"2"
Factor for detecting paragraph breaks based on vertical spacing.Higher values make break detection stricter (require larger gaps between paragraphs).
widthTolerance
number
default:"0.85"
Threshold for identifying short lines that indicate paragraph endings.Lower values mark fewer lines as “short” (stricter definition of short lines).

Usage Example

import { mapTextLinesToParagraphs } from 'kokokor';

const paragraphs = mapTextLinesToParagraphs(
  lines,
  {
    verticalJumpFactor: 2.5,  // Require larger gaps for breaks
    widthTolerance: 0.8       // Stricter short-line detection
  }
);

PoetryDetectionOptions

Configuration options to fine-tune the poetry detection algorithm.

Type Definition

type PoetryDetectionOptions = Partial<CenteringOptions> & {
  maxVerticalGapRatio: number;
  minWidthRatioForMerged: null | number;
  minWordCount: number;
  pairWidthSimilarityRatio: number;
  pairWordCountSimilarityRatio: number;
  wordDensityComparisonRatio: number;
};

Detection Heuristics

Poetry detection uses multiple heuristics:
  1. Paired hemistichs: Two short lines that appear to be halves of a verse
  2. Merged lines: Single lines with distinctive spacing/density characteristics
  3. Centering: Lines that are centered with appropriate margins

Fields

centerToleranceRatio
number
default:"0.05"
Inherited from CenteringOptions. Used for centering-based poetry detection.
minMarginRatio
number
default:"0.1"
Inherited from CenteringOptions. Used for centering-based poetry detection.
maxVerticalGapRatio
number
default:"2.0"
Maximum allowed vertical gap between observations to be considered a poetry pair.This controls how close two lines must be to be considered hemistichs (halves of a verse).The gap is measured as a ratio of the average height of the two observations:
  • 2.0 means the gap can be up to 200% of the average line height
  • 1.5 would require closer spacing (150% of average height)
  • 3.0 would allow wider spacing (300% of average height)
minWidthRatioForMerged
null | number
default:"0.6"
For merged lines heuristic: The minimum width a line must have to be considered.This prevents very short lines (like page numbers) from being analyzed for poetry.Specified as a ratio of the image width:
  • 0.6 means the line must span at least 60% of the page width
  • 0.4 would include shorter lines (40% of page width)
  • 0.8 would require longer lines (80% of page width)
  • null disables width filtering
minWordCount
number
default:"2"
The minimum number of words a line must contain to be considered poetry.This helps filter out noise like page numbers, single-word labels, or artifacts.Higher values reduce false positives but may miss short poetic lines.
pairWidthSimilarityRatio
number
default:"0.4"
For paired lines heuristic: How similar in width two hemistichs must be.This determines whether two lines are balanced enough to be verse halves.The similarity check: |width1 - width2| / average(width1, width2) < ratio
  • 0.4 means widths can differ by up to 40% of their average
  • 0.2 would require more similar widths (20% difference)
  • 0.6 would allow more variation (60% difference)
pairWordCountSimilarityRatio
number
default:"0.5"
For paired lines heuristic: How similar in word count two hemistichs must be.This ensures that verse halves have balanced content length.The similarity check: |count1 - count2| / max(count1, count2) < ratio
  • 0.5 means word counts can differ by up to 50% of the larger count
  • 0.3 would require more similar counts (30% difference)
  • 0.7 would allow more variation (70% difference)
wordDensityComparisonRatio
number
default:"0.8"
For merged lines heuristic: Word density threshold for identifying poetry.Poetry typically has lower word density (more spacing) than prose.A line is considered poetic if its density (words per pixel) is less than this ratio multiplied by the average prose density of the document.
  • 0.8 means poetic lines should have ≤80% of average prose density
  • 0.9 would require density closer to prose (≤90%)
  • 0.7 would require sparser text (≤70% of prose density)

Usage Example

import { reconstructParagraphs } from 'kokokor';

// Custom poetry detection
const result = await reconstructParagraphs(
  {
    observations: [...],
    page: { width: 2550, height: 3300, dpiX: 300, dpiY: 300 }
  },
  {
    line: {
      poetryDetectionOptions: {
        maxVerticalGapRatio: 1.5,           // Require closer line spacing
        minWordCount: 3,                    // Require at least 3 words
        pairWidthSimilarityRatio: 0.3,      // Require more similar widths
        wordDensityComparisonRatio: 0.7,    // Require sparser text
        centerToleranceRatio: 0.03,         // Stricter centering
        minMarginRatio: 0.15                // Larger margins required
      },
      poetryPairDelimiter: ' ... '
    }
  }
);

// Disable poetry detection
const noPoetrResult = await reconstructParagraphs(
  {
    observations: [...],
    page: { width: 2550, height: 3300, dpiX: 300, dpiY: 300 }
  },
  {
    line: {
      poetryDetectionOptions: undefined
    }
  }
);

CenteringOptions

Configuration options for determining if text content is centered on a page.

Type Definition

type CenteringOptions = {
  readonly centerToleranceRatio: number;
  readonly minMarginRatio: number;
};

Fields

centerToleranceRatio
number
default:"0.05"
The tolerance for center point alignment as a ratio of image width.This determines how precisely text must be centered to be considered “centered”.Examples:
  • 0.05 means the observation’s center can be within 5% of the page width from the true center
  • 0.02 would require more precise centering (within 2%)
  • 0.1 would allow looser centering requirements (within 10%)
minMarginRatio
number
default:"0.1"
The minimum margin required on each side as a ratio of image width.This ensures that “centered” text has adequate whitespace around it.Examples:
  • 0.1 means there must be at least 10% of the page width as whitespace on both left and right sides
  • 0.2 would require larger margins (20% on each side)
  • 0.05 would allow tighter margins (5% on each side)

Usage Example

import { reconstructParagraphs } from 'kokokor';

const result = await reconstructParagraphs(
  {
    observations: [...],
    page: { width: 2550, height: 3300, dpiX: 300, dpiY: 300 }
  },
  {
    line: {
      centerToleranceRatio: 0.03,  // Stricter centering (3%)
      minMarginRatio: 0.15          // Larger margins (15%)
    }
  }
);

// Check if text is centered
for (const line of result.lines) {
  if (line.isCentered) {
    console.log('Centered text:', line.text);
  }
}

See Also