Contacts
Get in touch
Close

Structuring Contextual Chunks:Bringing Human-Like Understanding to Scraped Web Data

generated-image(3)

When humans read a webpage, we don’t just consume words – we perceive structure. Titles tell us where we are, subheadings group related ideas, and nested lists give us relationships. This hierarchy is what makes web pages intuitive.

Now imagine stripping that away. If you scrape the text of a page without preserving its headings and structure, you’re left with a flat blob of words. To a machine, “Payment Options” looks no different from “Refund Policy.” That loss of hierarchy can seriously degrade the accuracy of downstream tasks like Retrieval- Augmented Generation (RAG), search, and summarization. This was the exact problem we faced: scraped data didn’t contain the hierarchical context that humans naturally perceive.

The Problem: Flat Text Is Contextless

Consider this raw scraped output:

A human instantly recognizes that “Credit Cards” and “PayPal” are children of “Payment Options.” But to a machine, unless we explicitly encode that relationship, they’re just three unrelated strings. If we were to chunk this directly without structure, our RAG system might retrieve “PayPal” under a query about “Refund Policy,” simply because both sections mention “payment.”

Two Options: Plain Text vs. Markdown-Aware Chunking

When designing a chunking pipeline, there are two paths:

  1. Plain-Text Chunking
  • Split text into chunks of fixed size (say 1,000 characters).
  • Pros: Simple, fast, works when no formatting is available.
  • Cons: No sense of hierarchy, risks mixing unrelated concepts, and chunks may cut off headings mid-way.

Example (plain chunking):

The heading “Credit Cards” might get buried in the middle of a chunk, losing its importance.

  1. Markdown-Aware Chunking (Our Approach)

Leverage the markdown ( # , ## , ### ) present in the scraped output. Parse heading levels to reconstruct parent-child relationships. Chunk data in a way that preserves context.

Example (contextual chunking):

Here, every chunk carries its context path, making it much closer to how humans understand the page.

Clearly, markdown-aware chunking is more powerful when the structure is available.

Our Solution: Contextual Chunking Pipeline

To overcome the problem of context loss, we built a pipeline that:

  1. Parses Markdown Headings
  • Each heading is analyzed based on the number of # .
  • Heading hierarchy is rebuilt: # is top-level, ## is nested, and so on.
  • Skipped levels are handled gracefully (e.g., jumping from # to ### ).

Example:

Reconstructed hierarchy:

  • Payment Options :
  • Credit Cards
  • PayPal
  1. Builds Sections with Context + Content

Each section is represented as:

This ensures that even when separated, the chunk still knows “who its parent is.”

  1. Combines Sections Into Chunks
  • Small sections can share a chunk if they fit within the size limit.
  • Larger sections stand alone.
  • If a section exceeds the chunk size limit, it is split into multiple sub- chunks, each retaining the same context.

Example of splitting a long section:

Even after splitting, every chunk carries the heading path.

  1. Outputs Context-Aware Chunks

Final chunks are formatted like this:

These are then fed into the embedding model, ensuring both semantic content and structural context are preserved.
 

Why This Approach Works

Contextual chunking solves multiple issues simultaneously:
Better Retrieval: Queries like “What payment methods are available?” now correctly retrieve “Credit Cards” and “PayPal” under the “Payment Options” parent.
Reduced Noise: Unrelated content doesn’t get grouped together simply because it fits in a fixed-size chunk.
Scalable Handling of Large Content: Huge sections (like policies, manuals, or documentation pages) can be split into multiple smaller but still contextually linked chunks.
Improved RAG Accuracy: Models work best when given chunks that mirror human-readable sections.
 

Practical Example: Plain vs. Contextual Chunking

Let’s compare how a RAG system might behave:
Plain-Text Chunking Query:
  • Question: “Do you accept American Express?”
  • Retrieved Chunk: Might include “We offer several ways to pay… Refund Policy…” → inaccurate context, potential hallucination.

Contextual Chunking Query:

  • Question: “Do you accept American Express?”
  • Retrieved Chunk:
  • Answer: Clear, accurate, and aligned with the source page.

Conclusion :

Scraping web data is easy. Making that data useful is the real challenge. Plain-text chunking is quick, but it strips away the navigational cues that humans rely on. By contrast, markdown-aware contextual chunking restores hierarchy, allowing downstream AI systems to reason about data the way humans naturally do.

In practice, this has been transformative for our RAG workflows:

  • Retrieval is more precise.
  • Responses are more contextually accurate.
  • Large documents remain manageable. 

It’s one of those behind-the-scenes engineering details that visitors never notice directly – but they feel the impact in the quality of the answers they receive. At the end of the day, good AI doesn’t just process data.

It understands its structure.

Leave a Comment

Your email address will not be published. Required fields are marked *