When humans read a webpage, we don’t just consume words – we perceive structure. Titles tell us where we are, subheadings group related ideas, and nested lists give us relationships. This hierarchy is what makes web pages intuitive.
Now imagine stripping that away. If you scrape the text of a page without preserving its headings and structure, you’re left with a flat blob of words. To a machine, “Payment Options” looks no different from “Refund Policy.” That loss of hierarchy can seriously degrade the accuracy of downstream tasks like Retrieval- Augmented Generation (RAG), search, and summarization. This was the exact problem we faced: scraped data didn’t contain the hierarchical context that humans naturally perceive.
The Problem: Flat Text Is Contextless
Consider this raw scraped output:

A human instantly recognizes that “Credit Cards” and “PayPal” are children of “Payment Options.” But to a machine, unless we explicitly encode that relationship, they’re just three unrelated strings. If we were to chunk this directly without structure, our RAG system might retrieve “PayPal” under a query about “Refund Policy,” simply because both sections mention “payment.”
Two Options: Plain Text vs. Markdown-Aware Chunking
When designing a chunking pipeline, there are two paths:
- Plain-Text Chunking
- Split text into chunks of fixed size (say 1,000 characters).
- Pros: Simple, fast, works when no formatting is available.
- Cons: No sense of hierarchy, risks mixing unrelated concepts, and chunks may cut off headings mid-way.
Example (plain chunking):

The heading “Credit Cards” might get buried in the middle of a chunk, losing its importance.
- Markdown-Aware Chunking (Our Approach)
Leverage the markdown ( # , ## , ### ) present in the scraped output. Parse heading levels to reconstruct parent-child relationships. Chunk data in a way that preserves context.
Example (contextual chunking):

Here, every chunk carries its context path, making it much closer to how humans understand the page.
Clearly, markdown-aware chunking is more powerful when the structure is available.
Our Solution: Contextual Chunking Pipeline
To overcome the problem of context loss, we built a pipeline that:
- Parses Markdown Headings
- Each heading is analyzed based on the number of # .
- Heading hierarchy is rebuilt: # is top-level, ## is nested, and so on.
- Skipped levels are handled gracefully (e.g., jumping from # to ### ).
Example:

Reconstructed hierarchy:
- Payment Options :
- Credit Cards
- PayPal
- Builds Sections with Context + Content
Each section is represented as:

This ensures that even when separated, the chunk still knows “who its parent is.”
- Combines Sections Into Chunks
- Small sections can share a chunk if they fit within the size limit.
- Larger sections stand alone.
- If a section exceeds the chunk size limit, it is split into multiple sub- chunks, each retaining the same context.
Example of splitting a long section:

Even after splitting, every chunk carries the heading path.
- Outputs Context-Aware Chunks
Final chunks are formatted like this:

Why This Approach Works
Practical Example: Plain vs. Contextual Chunking
- Question: “Do you accept American Express?”
- Retrieved Chunk: Might include “We offer several ways to pay… Refund Policy…” → inaccurate context, potential hallucination.
Contextual Chunking Query:
- Question: “Do you accept American Express?”
- Retrieved Chunk:

- Answer: Clear, accurate, and aligned with the source page.
Conclusion :
Scraping web data is easy. Making that data useful is the real challenge. Plain-text chunking is quick, but it strips away the navigational cues that humans rely on. By contrast, markdown-aware contextual chunking restores hierarchy, allowing downstream AI systems to reason about data the way humans naturally do.
In practice, this has been transformative for our RAG workflows:
- Retrieval is more precise.
- Responses are more contextually accurate.
- Large documents remain manageable.
It’s one of those behind-the-scenes engineering details that visitors never notice directly – but they feel the impact in the quality of the answers they receive. At the end of the day, good AI doesn’t just process data.
It understands its structure.







