GoodocWord

AI Overview😉

  • The potential purpose of this module is to analyze and understand the layout and structure of words within a document, including their bounding boxes, font styles, and recognition confidence. This module is likely part of an Optical Character Recognition (OCR) system that extracts text from images or scanned documents.
  • This module could impact search results by improving the accuracy of document indexing and retrieval. By better understanding the layout and structure of words within a document, the search engine can more effectively identify relevant documents and rank them accordingly. Additionally, this module could enable advanced search features, such as searching for specific fonts or layouts.
  • To be more favorable for this function, a website could ensure that its document structures are well-formatted and easily parseable, with clear and consistent font styles and layouts. Additionally, providing high-quality images or scans of documents could improve the accuracy of the OCR system and subsequent search results. Websites could also consider providing additional metadata, such as font information or document structure, to aid in the analysis and indexing process.

Interesting Module? Vote 👇

Voting helps other researchers find interesting modules.

Current Votes: 0

GoogleApi.ContentWarehouse.V1.Model.GoodocWord (google_api_content_warehouse v0.4.0)

A word representation

Attributes

  • Baseline (type: integer(), default: nil) - The baseline's y-axis offset from the bottom of the word's bounding box, given in pixels. (A value of 2, for instance, indicates the baseline is 2px above the bottom of the box.)
  • Box (type: GoogleApi.ContentWarehouse.V1.Model.GoodocBoundingBox.t, default: nil) -
  • Capline (type: integer(), default: nil) - The capline is the y-axis offset from the top of the word bounding box. A positive value n indicates that capline is n-pixels above the top of this word.
  • CompactSymbolBoxes (type: GoogleApi.ContentWarehouse.V1.Model.GoodocBoxPartitions.t, default: nil) - For space efficiency, we sometimes skip the detailed per-symbol bounding boxes in Symbol.Box, and use this coarser representation instead, where we just store Symbol boundaries within the Word box. Most client code should not have to worry directly about this, it should be handled in the deepest layers of writing/reading goodocs (for example, see Compress() and Uncompress() in ocean/goodoc/goovols-bigtable-volume.h). Note(viresh): I experimented with this compression, and here are some numbers for reference. If the zlib-compressed page goodoc string size was 100 to start with, then this compaction makes it 65. As a possible future relaxation to consider: if we add in, for each symbol, a "top" and "bottom" box offset then the size would be 75 (that's with "repeated int32 top/bottom_offset" fields inside BoxPartitions, instead of inside each symbol).
  • Confidence (type: integer(), default: nil) - Word recognition confidence. Range depends upon OCR Engine.
  • IsFromDictionary (type: boolean(), default: nil) - word. The meaning and range depends on the OCR engine or subsequent processing. Specifies whether the word was found
  • IsIdentifier (type: boolean(), default: nil) - a number True if word represents
  • IsLastInSentence (type: boolean(), default: nil) - True if the word is the last word in any sub-paragraph unit that functions at the same level of granularity as a sentence. Examples: "She hit the ball." (regular sentence) "Dewey defeats Truman" (heading) "The more, the merrier." (no verb) Note: not currently used. Code to set this was introduced in CL 7038338 and removed in OCL=10678722.
  • IsNumeric (type: boolean(), default: nil) - in the dictionary True if the word represents
  • Label (type: GoogleApi.ContentWarehouse.V1.Model.GoodocLabel.t, default: nil) -
  • Penalty (type: integer(), default: nil) - Penalty for discordance of characters in a
  • RotatedBox (type: GoogleApi.ContentWarehouse.V1.Model.GoodocRotatedBoundingBox.t, default: nil) - If RotatedBox is set, Box must be set as well. See RotatedBoundingBox.
  • Symbol (type: list(GoogleApi.ContentWarehouse.V1.Model.GoodocSymbol.t), default: nil) - Word characters, the text may
  • alternates (type: GoogleApi.ContentWarehouse.V1.Model.GoodocWordAlternates.t, default: nil) -
  • text (type: String.t, default: nil) - As a shortcut, the content API provides the text of words instead of individual symbols (NOTE: this is experimental). This is UTF8. And the main font for the word is stored in Label.CharLabel.
  • writingDirection (type: String.t, default: nil) - Writing direction for this word.

Summary

Types

t()

Functions

decode(value, options)

Unwrap a decoded JSON object into its complex fields.

Types

Link to this type

t()

@type t() :: %GoogleApi.ContentWarehouse.V1.Model.GoodocWord{
  Baseline: integer() | nil,
  Box: GoogleApi.ContentWarehouse.V1.Model.GoodocBoundingBox.t() | nil,
  Capline: integer() | nil,
  CompactSymbolBoxes:
    GoogleApi.ContentWarehouse.V1.Model.GoodocBoxPartitions.t() | nil,
  Confidence: integer() | nil,
  IsFromDictionary: boolean() | nil,
  IsIdentifier: boolean() | nil,
  IsLastInSentence: boolean() | nil,
  IsNumeric: boolean() | nil,
  Label: GoogleApi.ContentWarehouse.V1.Model.GoodocLabel.t() | nil,
  Penalty: integer() | nil,
  RotatedBox:
    GoogleApi.ContentWarehouse.V1.Model.GoodocRotatedBoundingBox.t() | nil,
  Symbol: [GoogleApi.ContentWarehouse.V1.Model.GoodocSymbol.t()] | nil,
  alternates:
    GoogleApi.ContentWarehouse.V1.Model.GoodocWordAlternates.t() | nil,
  text: String.t() | nil,
  writingDirection: String.t() | nil
}

Functions

Link to this function

decode(value, options)

@spec decode(struct(), keyword()) :: struct()

Unwrap a decoded JSON object into its complex fields.