Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/kokokor/llms.txt

Use this file to discover all available pages before exploring further.

Overview

TextBlock represents a higher-level text unit (typically a line or paragraph) that has been assembled from individual OCR observations and enriched with semantic information about its role and characteristics within the document. This type extends the basic Observation with additional metadata that helps in document structure analysis, formatting preservation, and content classification.

Type Definition

type TextBlock = Observation & {
  isCentered?: boolean;
  isFootnote?: boolean;
  isHeading?: boolean;
  isPoetic?: boolean;
};

Fields

bbox
BoundingBox
required
The bounding box defining the exact position and dimensions of the text within the document coordinate system. Inherited from Observation.
text
string
required
The recognized text content. Inherited from Observation.
isCentered
boolean
Indicates whether the text is centered on the page.This is determined by analyzing the text’s position relative to page margins and ensuring adequate whitespace on both sides. Centered text often indicates:
  • Document titles and headings
  • Poetry or verse content
  • Section headers
  • Epigraphs or quotes
The centering detection uses configurable tolerance and margin ratios to account for slight misalignments and varying document layouts.
isFootnote
boolean
Indicates whether this text is identified as a footnote.Footnotes are typically detected by their position relative to horizontal line elements in the document. Text appearing below the last significant horizontal line is often classified as footnote content. This classification helps in:
  • Separating main content from supplementary information
  • Proper document structure reconstruction
  • Academic and formal document processing
isHeading
boolean
Indicates whether the text represents a heading or title.Headings are often identified by their visual presentation, such as:
  • Being enclosed within rectangular borders or boxes
  • Having distinctive spacing or positioning
  • Being centered or specially formatted
This classification helps in:
  • Document outline generation
  • Hierarchical content structuring
  • Navigation and indexing
isPoetic
boolean
Indicates whether this text is identified as a line of poetry or verse.Poetic content is detected using multiple heuristics including:
  • Line length and spacing patterns
  • Word density analysis
  • Centering and alignment characteristics
  • Paired hemistich detection
Poetic lines receive special treatment during processing:
  • They are not merged into standard paragraphs
  • Line breaks are preserved as semantically significant
  • Spacing and formatting are maintained more strictly
This is crucial for preserving the artistic and structural integrity of poems, verses, and other formatted literary content.

Usage Example

import { reconstructParagraphs } from 'kokokor';

const result = await reconstructParagraphs({
  observations: [...],
  page: { width: 2550, height: 3300, dpiX: 300, dpiY: 300 }
});

// Access text blocks with metadata
for (const block of result.lines) {
  console.log(block.text);
  
  if (block.isHeading) {
    console.log('This is a heading');
  }
  
  if (block.isPoetic) {
    console.log('This is poetry - preserve line breaks');
  }
  
  if (block.isCentered) {
    console.log('This text is centered on the page');
  }
  
  if (block.isFootnote) {
    console.log('This is a footnote');
  }
}

See Also