Documentation Index Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Kokokor uses sophisticated heuristics to identify poetic content in OCR output, distinguishing poetry from prose based on visual layout, word density, and structural patterns.
Poetry detection is crucial for preserving the artistic and structural integrity of verses, where line breaks carry semantic meaning.
Why Poetry Detection Matters
Poetry requires different handling than prose:
Line Breaks Each line must remain separate (not merged into paragraphs)
Visual Layout Centering and spacing are semantically meaningful
Hemistichs Two-part verses common in Arabic/Persian poetry
Word Density Poetry typically has more spacing than prose
Detection Strategy
The algorithm uses multiple coordinated heuristics:
// Internal detection algorithm (not exported)
// Kokokor uses this logic internally within mapObservationsToTextLines
function detectPoetryInGroup (
group : Observation [],
imageWidth : number ,
avgProseWordDensity : number ,
options : PoetryDetectionOptions
) {
// Heuristic 1: Single wide poetic line
if ( group . length === 1 && minWidthRatioForMerged !== null ) {
return isWidePoeticLine (
group [ 0 ],
imageWidth ,
avgProseWordDensity ,
options
);
}
// Heuristic 2: Paired hemistichs
if ( group . length === 2 ) {
return isPoetryPair ( group [ 0 ], group [ 1 ], imageWidth , options );
}
return false ;
}
These algorithms are internal to Kokokor. Configure behavior via poetryDetectionOptions in mapObservationsToTextLines.
Heuristic 1: Paired Hemistichs
Concept
Traditional poetry (especially Arabic and Persian) often splits verses into two balanced parts called hemistichs :
صدر البيت (first hemistich) عجز البيت (second hemistich)
Detection Criteria
Two observations form a poetry pair when they meet ALL conditions:
Minimum Word Count
Both hemistichs must have at least minWordCount words (default: 2) const words1 = getWordCount ( obs1 . text );
const words2 = getWordCount ( obs2 . text );
if ( words1 < minWordCount || words2 < minWordCount ) {
return false ;
}
Compatible Widths
Widths must be similar within tolerance (default: 40%) const avgWidth = ( obs1 . bbox . width + obs2 . bbox . width ) / 2 ;
const widthDiffRatio = Math . abs ( obs1 . bbox . width - obs2 . bbox . width ) / avgWidth ;
return widthDiffRatio < pairWidthSimilarityRatio ; // Default: 0.4
Reference: src/utils/poetry.ts:67
Compatible Word Counts
Word counts must be similar within tolerance (default: 50%) const maxWords = Math . max ( words1 , words2 );
const wordCountDiffRatio = Math . abs ( words1 - words2 ) / maxWords ;
return wordCountDiffRatio < pairWordCountSimilarityRatio ; // Default: 0.5
Reference: src/utils/poetry.ts:74
Compatible Vertical Gap
Vertical distance must be within tolerance (default: 200% of height) const centerY1 = obs1 . bbox . y + obs1 . bbox . height / 2 ;
const centerY2 = obs2 . bbox . y + obs2 . bbox . height / 2 ;
const dy = Math . abs ( centerY1 - centerY2 );
const avgHeight = ( obs1 . bbox . height + obs2 . bbox . height ) / 2 ;
return dy <= maxVerticalGapRatio * avgHeight ; // Default: 2.0
Reference: src/utils/poetry.ts:81
Combined Centering
When combined, the hemistichs must be centered on the page const combinedBbox = {
x: Math . min ( obs1 . bbox . x , obs2 . bbox . x ),
width: Math . max (
obs1 . bbox . x + obs1 . bbox . width ,
obs2 . bbox . x + obs2 . bbox . width
) - Math . min ( obs1 . bbox . x , obs2 . bbox . x ),
// ... height and y
};
return textIsCentered (
combinedBbox ,
imageWidth ,
centeringOptions
);
Reference: src/utils/poetry.ts:276
Adaptive Centering
For hemistichs with significant gaps (visual separation), centering tolerance is relaxed:
const hasSignificantGap = gap > imageWidth * 0.07 || gap > avgWidth * 0.15 ;
if ( hasSignificantGap ) {
return {
centerToleranceRatio: ( options . centerToleranceRatio ?? 0.05 ) * 2.5 ,
minMarginRatio: ( options . minMarginRatio ?? 0.1 ) * 0.75 ,
};
}
Reference: src/utils/poetry.ts:116
Asymmetry Detection
Rejects pairs with asymmetric sparse gaps (likely not poetry):
const pageCenter = imageWidth / 2 ;
const innerLeft = leftObs . bbox . x + leftObs . bbox . width ;
const innerRight = rightObs . bbox . x ;
const leftDelta = Math . abs ( pageCenter - innerLeft );
const rightDelta = Math . abs ( innerRight - pageCenter );
const asymmetry = Math . abs ( leftDelta - rightDelta );
const isVerySparsePair = gap > avgWidth * 2 ;
return isVerySparsePair && asymmetry > imageWidth * 0.12 ;
Reference: src/utils/poetry.ts:98
Heuristic 2: Wide Poetic Lines
Concept
Some poetry appears as single wide lines rather than split hemistichs. These are identified by comparing to prose characteristics.
Detection Criteria
Minimum Word Count
Must have at least minWordCount words (default: 2)
No Prose Punctuation
Filters out prose that might otherwise match const PROSE_PUNCTUATION_PATTERN = / [ ،,؛;؟?۔.:() ] / ;
if ( PROSE_PUNCTUATION_PATTERN . test ( obs . text )) {
return false ; // Likely prose
}
Reference: src/utils/constants.ts:73
Centered on Page
Must be centered with adequate margins if ( ! textIsCentered ( obs . bbox , imageWidth , options )) {
return false ;
}
Poetry-Like Density
Word density must be lower than average prose const obsDensity = wordCount / obs . bbox . width ;
const densityRatio = obsDensity / avgProseWordDensity ;
// Threshold varies by line width
const widthRatio = obs . bbox . width / imageWidth ;
const requiredDensityRatio = widthRatio > 0.75
? wordDensityComparisonRatio * 0.95 // Stricter for very wide
: 0.5 ; // More lenient
return densityRatio < requiredDensityRatio ;
Reference: src/utils/poetry.ts:142
Minimum Width Check
Only lines spanning significant page width are considered:
if ( obs . bbox . width <= imageWidth * minWidthRatioForMerged ) {
return false ; // Too narrow
}
// Default minWidthRatioForMerged: 0.6 (60% of page width)
Reference: src/utils/poetry.ts:151
Prose Density Baseline
Both heuristics rely on calculating average prose word density as a baseline:
// Internal function: calculates baseline word density
// This is done automatically by mapObservationsToTextLines
function calculateProseDensityBaseline (
observations : Observation [],
imageWidth : number ,
options : PoetryDetectionOptions
) : number {
let totalWords = 0 ;
let totalWidth = 0 ;
for ( const obs of observations ) {
const wordCount = getWordCount ( obs . text );
// Identify likely prose (not centered, wide, moderate word count)
const isLikelyProse =
! textIsCentered ( obs . bbox , imageWidth , options ) &&
obs . bbox . width > imageWidth * 0.4 &&
wordCount >= minWordCount &&
wordCount <= MAX_PROSE_WORD_COUNT ; // Default: 25
if ( isLikelyProse ) {
totalWords += wordCount ;
totalWidth += obs . bbox . width ;
}
}
return totalWords / totalWidth ; // Words per pixel
}
Prose Identification
Prose is identified by:
Not centered (left-aligned text)
Width > 40% of page width
Word count between minimum and maximum (2-25 words)
Configuration Options
type PoetryDetectionOptions = {
// Centering detection
centerToleranceRatio : number ; // Default: 0.05 (5%)
minMarginRatio : number ; // Default: 0.1 (10%)
// Paired hemistichs
maxVerticalGapRatio : number ; // Default: 2.0 (200%)
pairWidthSimilarityRatio : number ; // Default: 0.4 (40%)
pairWordCountSimilarityRatio : number ; // Default: 0.5 (50%)
// Wide poetic lines
minWidthRatioForMerged : number | null ; // Default: 0.6 (60%)
wordDensityComparisonRatio : number ; // Default: 0.95 (95%)
// General
minWordCount : number ; // Default: 2
};
Reference: src/types.ts:283
Real-World Examples
Example 1: Arabic Poetry Pair (Hemistichs)
Input Observations:
[
{
bbox: { x: 150 , y: 200 , width: 220 , height: 18 },
text: "في البدء كانت الكلمة" // 4 words
},
{
bbox: { x: 430 , y: 200 , width: 210 , height: 18 },
text: "والكلمة عند الله" // 3 words
}
]
Analysis:
✓ Word counts: 4 and 3 (within 50% tolerance)
✓ Widths: 220px and 210px (within 40% tolerance)
✓ Vertical gap: 0px (same Y coordinate)
✓ Combined width: 490px starting at x=150
✓ Combined center: (150 + 640) / 2 = 395px
✓ Page center: 400px (within 5% tolerance)
Result: isPoetic = true
Output:
في البدء كانت الكلمة والكلمة عند الله
Example 2: Wide Poetic Line
Input Observation:
{
bbox : { x : 100 , y : 150 , width : 600 , height : 20 },
text : "يا ليل الصب متى غده" // 5 words
}
Page Width: 800px
Avg Prose Density: 0.015 words/pixel
Analysis:
✓ Word count: 5 (>= 2)
✓ No prose punctuation
✓ Width: 600px (75% of page, >= 60% threshold)
✓ Centered: x=100, width=600, center=400 vs page center=400
✓ Density: 5/600 = 0.0083 words/pixel
✓ Density ratio: 0.0083/0.015 = 0.55 (< 0.95)
Result: isPoetic = true
Example 3: Prose (Not Poetry)
Input Observation:
{
bbox : { x : 50 , y : 300 , width : 700 , height : 20 },
text : "This is a regular paragraph of text, with commas and punctuation."
}
Analysis:
✗ Contains prose punctuation (commas, period)
✗ High word density (prose-like)
✗ Not centered (x=50, only 50px left margin)
Result: isPoetic = false
Custom Configuration Example
import { reconstructParagraphs } from 'kokokor' ;
const result = reconstructParagraphs (
{ observations , page , layout },
{
line: {
poetryDetectionOptions: {
// Stricter centering for poetry
centerToleranceRatio: 0.03 , // 3% instead of 5%
minMarginRatio: 0.15 , // 15% instead of 10%
// More lenient hemistich matching
pairWidthSimilarityRatio: 0.5 , // 50% instead of 40%
pairWordCountSimilarityRatio: 0.6 , // 60% instead of 50%
// Require minimum 3 words
minWordCount: 3 ,
// Disable wide poetic line detection
minWidthRatioForMerged: null ,
// Lower density threshold
wordDensityComparisonRatio: 0.85 , // 85% instead of 95%
},
poetryPairDelimiter: ' ... ' , // Custom separator
},
}
);
Integration with Pipeline
Poetry detection runs during Stage 1 (Observations → Text Lines):
const avgProseWordDensity = calculateProseDensityBaseline (
observations ,
page . width ,
options . poetryDetectionOptions
);
for ( const group of groups ) {
if ( groupMatchesPoetryCriteria (
group ,
page . width ,
avgProseWordDensity ,
options . poetryDetectionOptions
)) {
for ( const observation of group ) {
observation . isPoetic = true ;
}
}
}
Reference: src/utils/paragraphs.ts:159
Disabling Poetry Detection
To disable poetry detection entirely:
const result = reconstructParagraphs (
{ observations , page , layout },
{
line: {
poetryDetectionOptions: undefined , // Disable detection
},
}
);
Poetry detection runs once per document during line grouping. The prose density calculation is O(n) where n is the number of observations.
Optimizations:
Prose density calculated once, reused for all groups
Early rejection based on word count (cheapest check)
Width and word count checks before expensive centering calculations
Next Steps
TextBlock Metadata Learn about the isPoetic flag
Processing Pipeline See where poetry detection fits
Configuration Explore all configuration options
RTL Support Poetry detection for RTL languages