Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Kokokor uses sophisticated heuristics to detect and preserve poetry formatting. Poetry is kept as separate lines rather than being merged into paragraphs, maintaining the artistic structure of verse.

Poetry Detection Methods

Kokokor identifies poetry using three complementary heuristics:
Two lines with similar width and word count that are centered as a unit. Common in Arabic and classical poetry.Detection criteria:
  • Similar widths (within 40% difference by default)
  • Similar word counts (within 50% difference)
  • Centered when considered together
  • Compatible vertical spacing
Single lines that are centered with lower word density than prose.Detection criteria:
  • Centered on the page
  • Width ≥ 60% of page width
  • Word density < 80% of average prose density
  • At least 2 words (configurable)
Both methods use centering detection with configurable tolerances.Parameters:
  • centerToleranceRatio: How close to center (default: 5% of page width)
  • minMarginRatio: Minimum whitespace on each side (default: 10%)

Basic Poetry Example

import { reconstructParagraphs } from 'kokokor';

const document = {
  observations: [
    // Prose paragraph
    { text: 'This is a regular paragraph', bbox: { x: 100, y: 100, width: 400, height: 20 } },
    { text: 'of prose text that will', bbox: { x: 100, y: 125, width: 380, height: 20 } },
    { text: 'be merged together.', bbox: { x: 100, y: 150, width: 320, height: 20 } },
    
    // Poetry (centered, lower density)
    { text: 'Roses are red', bbox: { x: 350, y: 210, width: 300, height: 20 } },
    { text: 'Violets are blue', bbox: { x: 340, y: 240, width: 320, height: 20 } },
    
    // More prose
    { text: 'Back to regular prose text', bbox: { x: 100, y: 300, width: 400, height: 20 } },
  ],
  page: {
    width: 1000,
    height: 1200,
    dpiX: 72,
    dpiY: 72,
  },
};

const result = reconstructParagraphs(document, {
  line: {
    poetryDetectionOptions: {
      minWordCount: 2,
      centerToleranceRatio: 0.05,
      minMarginRatio: 0.1,
      wordDensityComparisonRatio: 0.8,
    },
  },
});

console.log(result.text);
// Output:
// This is a regular paragraph of prose text that will be merged together.
//
// Roses are red
// Violets are blue
//
// Back to regular prose text

// Check which lines are poetry
result.lines.forEach((line, i) => {
  console.log(`Line ${i}: ${line.isPoetic ? 'POETRY' : 'prose'}`);
});

Arabic Poetry (Hemistichs)

Arabic poetry often uses hemistichs - two balanced parts of a verse:
import { reconstructParagraphs } from 'kokokor';

const arabicPoetry = {
  observations: [
    // First verse - two hemistichs
    { text: 'صدر البيت الأول', bbox: { x: 200, y: 100, width: 340, height: 30 } },
    { text: 'عجز البيت الأول', bbox: { x: 600, y: 100, width: 340, height: 30 } },
    
    // Second verse - two hemistichs
    { text: 'صدر البيت الثاني', bbox: { x: 190, y: 150, width: 350, height: 30 } },
    { text: 'عجز البيت الثاني', bbox: { x: 590, y: 150, width: 350, height: 30 } },
  ],
  page: {
    width: 1240,
    height: 1754,
    dpiX: 72,
    dpiY: 72,
  },
};

const result = reconstructParagraphs(arabicPoetry, {
  line: {
    isRTL: true,
    poetryPairDelimiter: ' ... ',  // Traditional delimiter
    poetryDetectionOptions: {
      pairWidthSimilarityRatio: 0.4,
      pairWordCountSimilarityRatio: 0.5,
      maxVerticalGapRatio: 2.0,
    },
  },
});

console.log(result.text);
// Output:
// صدر البيت الأول ... عجز البيت الأول
// صدر البيت الثاني ... عجز البيت الثاني

Mixed Prose and Poetry Document

import { reconstructParagraphs } from 'kokokor';

const mixedDocument = {
  observations: [
    // Chapter title (will be detected as heading if in rectangle)
    { text: 'Chapter 1: The Journey Begins', bbox: { x: 100, y: 80, width: 500, height: 25 } },
    
    // Prose introduction
    { text: 'The story begins with a', bbox: { x: 100, y: 130, width: 380, height: 18 } },
    { text: 'long journey across the', bbox: { x: 100, y: 153, width: 370, height: 18 } },
    { text: 'mountains.', bbox: { x: 100, y: 176, width: 180, height: 18 } },
    
    // Embedded poetry (centered)
    { text: 'Mountains high and valleys deep', bbox: { x: 250, y: 220, width: 500, height: 18 } },
    { text: 'Where ancient secrets sleep', bbox: { x: 270, y: 243, width: 460, height: 18 } },
    
    // Continuation of prose
    { text: 'The traveler continued', bbox: { x: 100, y: 290, width: 360, height: 18 } },
    { text: 'onward, inspired by the', bbox: { x: 100, y: 313, width: 370, height: 18 } },
    { text: 'verse above.', bbox: { x: 100, y: 336, width: 200, height: 18 } },
  ],
  page: {
    width: 1000,
    height: 1400,
    dpiX: 150,
    dpiY: 150,
  },
};

const result = reconstructParagraphs(mixedDocument, {
  line: {
    poetryDetectionOptions: {
      minWordCount: 3,
      minWidthRatioForMerged: 0.5,
      centerToleranceRatio: 0.05,
      minMarginRatio: 0.15,
      wordDensityComparisonRatio: 0.8,
    },
  },
  paragraph: {
    verticalJumpFactor: 2,
    widthTolerance: 0.85,
  },
});

console.log(result.text);
// Output:
// Chapter 1: The Journey Begins
//
// The story begins with a long journey across the mountains.
//
// Mountains high and valleys deep
// Where ancient secrets sleep
//
// The traveler continued onward, inspired by the verse above.

// Analyze the structure
result.paragraphs.forEach((para, i) => {
  console.log(`Paragraph ${i + 1}:`);
  console.log(`  Text: ${para.text}`);
  console.log(`  Is Poetry: ${para.isPoetic || false}`);
  console.log(`  Is Centered: ${para.isCentered || false}`);
  console.log(`  Is Heading: ${para.isHeading || false}`);
});

Configuration Options

poetryDetectionOptions.minWordCount
number
default:"2"
Minimum number of words for a line to be considered poetry. Filters out noise like page numbers.
poetryDetectionOptions.centerToleranceRatio
number
default:"0.05"
How close to center a line must be (as ratio of page width). 0.05 = within 5% of page width from true center.
poetryDetectionOptions.minMarginRatio
number
default:"0.1"
Minimum whitespace required on each side (as ratio of page width). 0.1 = 10% margin on each side.
poetryDetectionOptions.wordDensityComparisonRatio
number
default:"0.8"
For wide poetry: maximum word density as ratio of prose density. 0.8 = poetry must have ≤80% of prose density.
poetryDetectionOptions.minWidthRatioForMerged
number | null
default:"0.6"
Minimum width for wide poetry lines (as ratio of page width). Set to null to disable wide poetry detection.
poetryDetectionOptions.pairWidthSimilarityRatio
number
default:"0.4"
For hemistichs: maximum width difference (as ratio of average width). 0.4 = widths can differ by up to 40%.
poetryDetectionOptions.pairWordCountSimilarityRatio
number
default:"0.5"
For hemistichs: maximum word count difference (as ratio of max count). 0.5 = counts can differ by up to 50%.
poetryDetectionOptions.maxVerticalGapRatio
number
default:"2.0"
For hemistichs: maximum vertical gap (as ratio of average height). 2.0 = gap can be up to 200% of line height.
poetryPairDelimiter
string
default:"' '"
Delimiter used when merging hemistichs. Use ' ... ' for traditional Arabic poetry formatting.

How Poetry Detection Works

1

Calculate Prose Baseline

Kokokor analyzes the entire document to calculate average word density for prose content. This serves as a baseline for comparison.
2

Check Wide Lines

For single lines that are wide enough (≥60% of page width by default), Kokokor checks:
  • Is it centered with sufficient margins?
  • Is word density lower than prose baseline?
  • Does it have enough words (not just fragments)?
3

Check Pairs

For pairs of observations on the same line, Kokokor checks:
  • Do they have similar widths?
  • Do they have similar word counts?
  • Are they centered when combined?
  • Is the vertical gap appropriate?
4

Mark as Poetry

Lines identified as poetry receive the isPoetic: true flag and are preserved as separate lines in the output.

Best Practices

Tune for Your Content: Different document types may need different thresholds. Poetry anthologies might use stricter detection, while mixed documents might need looser settings.
Centered Headings: Very short centered headings might be misidentified as poetry. Use the minWordCount parameter to filter these out, or provide rectangle layout elements to mark headings explicitly.
Prose Punctuation: Kokokor automatically filters lines containing parentheses, commas, or semicolons from wide poetry detection, as these are more common in prose.

See Also

Arabic Text

RTL text processing and Arabic hemistichs

Multi-column

Complex layouts with headings and footnotes