Chunked Content Meets Taxonomy for Better Information Retrieval

July 9, 2025

Imagine searching a giant library of documents for one specific answer. It can feel like finding a needle in a haystack. When there’s too much information, getting the useful bits quickly is a big challenge. This is a problem I faced with a large content repository. In response, I built a system that breaks content into chunks and connects those chunks to a taxonomy (an organized set of topics) using important terms found in each chunk. By linking each content piece to key terms and grouping those terms into categories, I created a “map” that makes it easier to find what you need. In this post, I’ll introduce how this works and why it helps with information retrieval (finding the right info when you search).

The library in The Name of the Rose

The library in The Name of the Rose is a labyrinthine tower at the heart of the monastery, filled with countless ancient manuscripts. It’s designed as a maze, with rooms arranged in a cryptic pattern and guarded by secrecy, riddles, and fear. Access is strictly limited, and only the librarian knows its true layout. The monks believe some knowledge is dangerous, especially heretical or pagan works, so many books are hidden, restricted, or even poisoned. Finding a specific book is nearly impossible without insider knowledge—by design, the library protects its contents from discovery as much as it preserves them.

Most enterprise information systems are just as impossible, but not by design, just be organic and unfettered growth, like a pile under a crumbling cliff.

Chunked Content and Taxonomy: Organizing the Chaos

Chunked content means I split documentation into smaller, manageable pieces (“chunks”). Instead of treating a whole manual or huge article as one blob of text, I handle it in parts; for example, each article or section is a chunk. Each chunk is then linked to a taxonomy through the terms it contains. A taxonomy is a fancy word for a classification system, basically a way to group related things. Here, it’s like an organized list of topics or categories. I connect content chunks to the taxonomy by identifying terms (important keywords or phrases) that occur in the chunk and tagging the chunk with those terms.

For instance, if one document chunk is about cloud storage, it might contain terms like “Azure Blob Storage” or “data backup.” My system will tag that chunk with those terms. Each term in turn belongs to a category in the taxonomy (for example, “Cloud Services” or “Data Management”). By doing this, each content chunk “knows” what topics it falls under based on the terms it mentions. This structure supports information retrieval because it creates multiple pathways to find the content:

  • By Term: You can search for a specific keyword and find all chunks that mention that term.

  • By Category: You can look at a broad topic (category) and find chunks related to that topic (through the terms in that category).

  • By Connections: Chunks that share terms are linked indirectly, so you can discover related content easily.

In short, chunking and tagging content with a taxonomy turns unstructured information into a navigable graph of knowledge. It’s like turning a messy pile of books into a well-organized library catalog.

Sweet, sweet taxonomy

Unsupervised vs. Manual Taxonomy

In my project, the taxonomy wasn’t hand-made by a person ahead of time - it was unsupervised, meaning it was generated from the content itself using algorithms. I used tools like NLTK (Natural Language Toolkit) for text processing and scikit-learn (a machine learning library) for clustering similar terms. Essentially, the computer looked at all the terms in the content and grouped those that often appear together or have similar contexts, forming categories automatically. I ended up with about 20 categories (and a few broader groups I called supercategories) discovered this way.

Pros of an Unsupervised Taxonomy:

  • Scalable: The algorithm can churn through a huge corpus of text and create groupings quickly, which would be hard for a human to do one-by-one.

  • Data-Driven: It might find hidden patterns or topic groupings that a person might not think of. The structure emerges from the content itself, so it’s tailored to the actual data.

  • Little Upfront Effort: You don’t need an expert to define all categories beforehand. The categories “grow” from the content, which is great if you don’t have a pre-existing taxonomy.

Cons of an Unsupervised Taxonomy:

  • Sometimes Unintuitive: Because a computer groups terms by statistics, the resulting categories might not always make immediate sense to humans. The naming of these clusters can be odd or unclear (for example, you might get a category that’s just labeled by the most frequent term, which isn’t always descriptive). In my case, I noticed the category names could be refined for clarity.

  • Needs Cleanup: The automatically generated taxonomy might include redundant or overly broad/narrow categories. Often, you still need a person to review the clusters and possibly merge or rename them to align with real-world understanding.

  • Lacks Expert Judgment: The algorithm doesn’t know your domain’s nuances. A manual taxonomy made by an expert might group things more meaningfully from a user’s perspective, whereas the unsupervised one might group by term usage that isn’t important conceptually.

By contrast, a manual taxonomy is crafted by human experts in a top-down way (deciding the main categories, sub-categories, etc.).

Pros of a Manual Taxonomy:

  • Human-Friendly: Categories are typically well-defined and intuitive because they come from domain knowledge. For example, an expert might decide on clear categories like “Storage Services” vs “Database Services” in a cloud documentation site, which make sense to readers.

  • Precise and Consistent: Humans can ensure each content piece is tagged correctly and consistently according to meaning, avoiding some false connections an automatic method might make.

Cons of a Manual Taxonomy:

  • Time-Consuming: Building and maintaining it is a lot of work. Someone has to continually update categories and tags as new content comes in.

  • Doesn’t Scale Easily: For very large or constantly changing content sets, a manual approach might struggle to keep up. It might also miss emergent topics that weren’t obvious to the taxonomy designers.

  • Subjective: Different experts might categorize the same content differently. It can introduce bias based on what people expect to see, rather than what’s in the content.

In my project, I chose the unsupervised route, letting the data speak for itself. This gave us a starting taxonomy that I could refine later. The key is that whichever approach you use, connecting content chunks via terms to a taxonomy makes it easier to retrieve information, because you’ve added structure and meaning on top of plain text.

Building the Knowledge Graph with Neo4j

To make all these connections useful, I built a knowledge graph (essentially a network of nodes and relationships) using Neo4j (a graph database), Python, and NLTK. Each node in the graph represents an entity: I have Content nodes for each content chunk, Term nodes for each important term, and Category nodes for each category in the taxonomy (and even SuperCategory nodes for the higher-level groupings). I also define relationships between these nodes:

  • MENTION: This relationship links a Term to a Content node. For example, if an article (Content) mentions “XCI Data Model”, I create a link (Term: "XCI Data Model") -[:MENTION]-> (Content: Article A). In Neo4j queries, I often treat this as an undirected connection. I just know the content and term are connected by a mention.

  • HAS_TERM: This relationship connects a Category node to a Term node. It means that the term is part of that category (perhaps discovered by my clustering algorithm). For instance, if “Data Governance” is a category and it includes the term “compliance policy”, I’d have (Category: Data Governance) -[:HAS_TERM]-> (Term: "compliance policy").

  • HAS_CHILD: This relationship connects a SuperCategory to a Category (or sometimes Category to sub-category). It represents the hierarchy in the taxonomy. For example, I had a super-category called “Enterprise Data Management” which had child categories like “Data Governance” and “Data Quality” [link]-> (Category: Data Governance)`.

Neo4J info retreival

Altogether, this graph forms a semantic map of my content. Content nodes link to Term nodes (the terms they mention), and those term nodes link to Category nodes (the topics those terms fall under). This way, content is indirectly connected to categories as well, via the terms. I can visualize a small portion of the graph like this:

Info retrieval structure

In the diagram above, Document A mentions two terms (“XCI Data Model” and “Compliance Policy”). Document B also mentions “Compliance Policy.” This creates a connection: both documents share a term, so they’re related through that term node. Each term belongs to a category (the term XCI Data Model is under the Data Modeling category, and Compliance Policy is under the Data Governance category). The Data Governance category is further grouped under a super-category Enterprise Data Management.

Such a graph structure is powerful for retrieval. It means I can traverse the graph to answer questions like:

  • “Find all content that mentions XCI Data Model”.
    (follow the MENTION links from the “XCI Data Model” term node to content nodes).

  • “What topics does Document A cover?”
    (see all the Term nodes connected to that content, and what categories they belong to).

  • “Give me content related to Data Governance”.
    (find the category node for Data Governance, get all Term nodes under it, then all content that mentions those terms).

  • “How are two terms related?”
    (check if they share content nodes or share a category, etc., by traversing the connections).

Neo4j is ideal for this because it’s designed for managing nodes and relationships, and it lets us query the graph with a query language called Cypher. I wrote Python scripts to populate the graph: the script reads each markdown document, extracts terms using NLTK (I focused on nouns and noun phrases as key entities), then adds nodes and links accordingly. After building the graph database, I could use Cypher queries to retrieve information.

Querying the Graph for Answers

Once the graph was built, asking questions was much easier. Instead of combing through documents, I query the graph. For example, suppose I want to find all content related to the term “XCI Data Model.” I can write a Cypher query like this:

`MATCH (``t:Term``)-[:MENTION]-(``c:Content``)`
`WHERE t.name = $term`
`RETURN ``c.node_id`` AS ``content_``id``;`

This query says: find a Term node with a given name (here $term would be “XCI Data Model”) and get the content IDs of all Content nodes that have a MENTION relationship with that term. In simpler words, “find me all the content chunks that mention the term ‘XCI Data Model’.” The result would be a list of content IDs or titles that are relevant.

I can also query by category. Let’s say I have a category called “Data Governance” and I want all content under that category. Because categories connect to content through terms, the query can traverse two hops: Category -> Term -> Content:

    MATCH (cat:Category)-[:HAS_TERM]->(t:Term)-[:MENTION]-(c:Content)
    WHERE cat.name = $category
    RETURN c.title AS content_title;

This would retrieve the titles of content chunks associated with the category “Data Governance” by finding all terms in that category and then all content that mention those terms. I could extend this further if I have supercategories: I’d add another hop from SuperCategory to Category.

Cypher queries make it straightforward to navigate the graph structure. Even complex questions can be answered with a few lines. For example, if I wanted to see content that share two different terms (like an overlap between topics), I could match a pattern of one content connected to two different term nodes. The graph’s flexibility means my search can be more semantic. The graph understands relationships, not just keywords in isolation.

Measuring Retrieval Performance with Golden Questions and F-Score

Building a cool graph-based retrieval system is one thing, but I also wanted to measure how well it works. I did this by creating golden questions (or golden queries) and evaluating precision and recall, culminating in an F-score for each query. The F-score is a common metric in information retrieval that balances precision (how many of the returned results were relevant) and recall (how many of the relevant results were returned). An F-score of 1.0 means perfect precision and recall (you got everything you should, and nothing you shouldn’t), whereas a lower score means there’s room for improvement.

Golden Questions are essentially test cases with known answers. For example, I might pose the question: *“What content should be retrieved for the term ‘XCI Data Model’?” I as project owners know which documents are truly about XCI Data Model, so I create a golden answer set for that term. In practice, I did this by manually identifying the relevant content IDs for a set of test terms and storing these in a YAML configuration file. Each entry in the golden set maps a query (like a term or category name) to the list of content IDs that should be returned.

I then add the corresponding Cypher query that my system uses for that term or category in the same YAML file under a queries section. Here’s an example of how the YAML config looks (simplified for illustration):

golden_queries:
    "XCI Data Model": ["6b83810b-1ca4-4d10-bb00-32036dce3e66"]
    "Data Governance": ["12345", "67890"]

queries:
    "XCI Data Model": "MATCH (t:Term)-[:MENTION]-(c:Content) WHERE t.name = $term RETURN c.node_id"
    "Data Governance": "MATCH (cat:Category)-[:HAS_TERM]->(t:Term)-[:MENTION]-(c:Content) WHERE cat.name = $category RETURN c.node_id"

In the snippet above, under golden_queries I list the expected content IDs for two example queries. For “XCI Data Model” I expect one particular content (with ID 6b83810b-...), and for “Data Governance” I expect two contents (IDs 12345 and 67890) as the correct answers. Under queries, I provide the actual Cypher query strings the system will run for each of those search terms.

With this setup, I run a Python script that executes each query against the Neo4j graph and compares the results with the golden set.

The script calculates:

  • Precision: What fraction of the content returned by the query was relevant (for example, is in my golden list)?

  • Recall: What fraction of the relevant content (golden list) was returned by the query?

  • F-score: The harmonic mean of precision and recall, giving a single score that balances both.

I got a report showing these metrics for each golden question. For instance, one term might have an F-score of 0.67, meaning it retrieved some correct content but missed some or included extra stuff, while another term scored a perfect 1.0, meaning my query brought back exactly the right content This quantitative feedback is extremely useful. It helped us identify where my graph or queries might need tweaking - maybe a term wasn’t linking to a content piece it should have (missing an edge), or perhaps my query logic was too broad and grabbed irrelevant content. By reviewing precision and recall, I could refine the taxonomy or relationships. For example, if recall was low, maybe I needed to add synonyms or alternate term forms to the graph. If precision was low, maybe my category grouping was too general, pulling in unrelated content.

Using golden questions and F-score evaluation turns development into a measurable process. I’ve not just guessing that the system works. I can prove it with numbers. It’s a bit like a unit test suite for a search system.

Conclusion and Next Steps

Organizing chunked content with an unsupervised taxonomy and mapping it in a graph has shown to be a powerful approach for information retrieval. I took unstructured content and gave it structure: content nodes, term nodes, and category nodes. All structured as a graph which made it easier to ask questions and get answers. By letting the taxonomy form from the content itself, I sped up the initial setup and discovered natural groupings in my data. Of course, I also saw the downside: some automatically generated categories were not immediately clear and needed better naming. In the future, a hybrid approach could be best - use unsupervised methods to get a baseline, then refine the taxonomy manually for clarity.

The graph-based system not only helps in finding information today but can serve as a foundation for more advanced applications. For example, it aligns well with emerging needs like feeding structured knowledge to AI systems or powering smart search features that go beyond keyword matching. Technical writers, content strategists, and knowledge managers can leverage this approach to make large document collections more navigable and user-friendly.

Ready to dive deeper or try it out yourself? My project is open-source. Check out the GitHub repository for the information retrieval graph to see the code, data, and documentation behind this solution. It’s a great starting point if you’re looking to implement a content graph or just curious about the technical details. By exploring the repo, you can learn more about how I built the graph, view the full YAML configs, and even run the system on your own content*.* I encourage you to visit the repository and join the conversation on making content easier to find through smart architecture.

You can review my repo. This was a result of a recent Hackathon, and so the repo is a bit messy.

    Nifty tech tag lists from Wouter Beeftink