Project Dolly Shield, Reborn: Using LLMs to Improve Enterprise Data Architecture Frameworks


Introduction
I recently submitted my research proposal for my master of data science thesis project. It represents the culmination of preparation work I’ve been doing for the last year, which is extensively documented on this website.
My research question is: Can deep learning concepts like semantic search, text-to-SQL, and RAG be used to improve data insights in an enterprise setting
I proposed researching how text-to-SQL, semantic search, and Retrieval Augmentation Generation can be used as natural next-step evolutions of enterprise data architectures. Enterprise data architectures are all over the map and have evolved over time as new technologies emerge. Different pieces from different use cases get stitched together over time. Through compartamentalized parallel builds, you could try and replace the whole thing. Often times, you need to do smaller evolutions from what exists today to where you want to go.
The goal of my research is to focus on areas where a large majority of companies could make evolutional changes on their existing data architecture to begin using modern deep learning tools in smart and effective ways.
This proposal is written in research paper form and uses IEEE citations. Please see the References section for cited sources.
Overview
Problem setup
If you walk into any given company of a certain size, chances are they
have data that's generated in the standard course of their business.
This can be anything from audience data generated from a consumer seeing
an ad on TV to procurement data for contracts at the Pentagon. All of
this data is collected, organized, and stored in some type of database.
The vast majority of time for a large majority of companies, this is a
relational database. Data can be accessed either by using SQL directly
or by putting a reporting semantic layer on top of it that creates a UI
for end-users to use.
The fundamental principle of a relational database is that it's a
collection of tables that have a defined set of relationships between
them. These tables all have primary and foreign keys that can be used to
link across to each other, if direct relationship exists. This is
perfectly fine if all of your tables are designed to create perfect
links but this is often not realistic practical. Different data comes in
from different sources and it could be related to each other but might
not have a direct link when it lands. In addition, it's only tabular
data and not all data is tabular. You can have contracts written in text
that are relevant or product requirement documents that need to be
compared to the data. This is all unstructured data and the vast
majority of companies are unequipped to handle it.
My perspective
I believe that the data architectures most enterprises have adopted over
the last twenty years need to be updated to better utilize emerging deep
learning tools. I also believe this can be done as a natural evolution
of existing architecture frameworks rather than having to rebuild
everything. There will continue to be a need for relational data and
databases so it's delusional to expect enterprise to move completely
away from them.
The existing architectures rely almost exclusively on creating
deterministic relationships between datasets in order to derive insights
from their various combinations. To the extent that probabilistic
determination is present, it's mainly used in research to create certain
types of products that can get away with it.
By exploring new tools like semantic search and RAG, I believe there's a
way to naturally evolve the existing architecture into one that can more
readily combine structured and unstructured data to produce holistic
insights and analysis. If you later combine it with existing tools that
create dashboards or other types of reports, you can see a path towards
more useful analytics and insights being used by a larger group of
people.
Why care
Under traditional enterprise setups, there are a few ways that data is
accessed:
- A user has the permissioning and skillset to run a SQL
query direct against a relational database. - A semantic dashboard or self service reporting layer has been connected to the relational database and the user can click and drag fields and get a report.
- The user has neither of the above and asks someone else who does, either through a formal intake procedure or as a friendly request for help to a coworker.
- When data is not available in a semantic layer but is needed for a one off analysis, an analyst team can be engaged to run a SQL query for you and help pull all the data together.
In none of these instances is the user getting much in the way of
contextual insight. Sure, dashboard with enough bells and whistles can
tell a story but those take a lot of effort to create and maintain. In
addition, the fragmentation of skillsets and knowledge can create both
bottlenecks and fiefdoms, allowing people to control information flow.
The more end users that have more direct access to data, the bigger
security risk you are creating. The ability to query a relational
database is tightly control, and ideally getting tighter, to ensure
compliance with applicable company policies and laws around data access
and security are maintained. In addition, inexperienced users may
attempt to run queries that take a significant amount of time to
process, which can become hugely expensive.
To better understand how and why things can change based on what's available in deep learning today, let me take you on a walk through the research that brings us where we are today.
Prior Research
Embeddings
Embeddings have a long and colorful history that stretches well before the modern LLM era. Word2Vec was introduced in 2013 as a way to create static embeddings where words with similar meanings were close in a vector space[34]. ELMO was released in 2018 to create embeddings that changed based on the overall context[35]. In 2019, Sentence-BERT was introduced as a way to derive semantic meaning from sentences embeddings, rather than embeddings from individual tokens[2]. All of these embeddings get stored in vector stores, which could be efficiently searched using Facebook AI Similarity Search (FAISS). First introduced in 2017, FAISS created a framework to efficiently search dense vectors for relevant results[18]. Instead of having to do key word searches across sparse vectors, you could now perform semantic search and derive true intention.
Large Language Models
Large Language Modelsa (LLMs) trace their roots the groundbreaking 2017 research paper 'Attention Is All You Need' by researchers at Google [32]. This introduced the Transformer architecture, which eliminated the need for Recurrent Neural Networks (RNN) by relying entirely on self-attention mechanisms. Before self-attention, Natural Language Processing (NLP) models primarily processed tokens sequentially. The Transformer architecture allowed for parallel processing, allowing for larger datasets to be processed significantly faster[32].
This set the stage for the introduction of the first Generative Pretrained Transformer (GPT) model by OpenAI in 2018 [33]. GPT-1 generated text using a Transformer decoder with an autoregression, where each token could be predicted sequentially while the data was processing in parallel using the self-attention mecahnism. Later the same year, BERT (Bidirectional Encoder Representations from Transformers) introduced the ability to read text from both directions instead of just one[1]. BERT was an encoder-only model that wasn't designed with text generation in mind, while GPT-1 was encoder-decoder, supporting text generation.
Retrieval Augmentation Generation
In 2020, Patrick Lewis, Ethan Perez, and others published research showing how relevantdocuments for specific use cases could be stored and retrieved for usewith an LLM instead of the model having to have seen it in pretrainingor finetuning[3]. The documents get converted to embeddings and then stored in a vector store database, where semantic search could be
performed and relevant results surfaced to the LLM as an augmentation on
what they already know. This research set the stage for Naive RAG in
2023[21] and modular RAG in 2024[22], which are both increasingly
advanced ways of creating RAG systems. While Naive RAG follows a more
traditional setup, modular RAG allows for RAG systems to become
"lego-like reconfigurable frameworks"[22].
If you're interested in learning more about RAG, I reccommend reading the 2023 survey research paper, 'Retrieval-augmented generation for large
language models: A survey'[23].
Other research papers I found relevant and interesting includes:
- 'KnowledgeGPT - using RAG on large data graphs composed of all relevant data', published in 2023[24]
- 'Enhancing RAG with iterative retrieval-generation synergy', published in 2023[25]
- 'Information seeking question generation using dynamic meta-information retrieval and knowledge graphs', published in 2022[26]
- 'Retrieval-augmented multimodal languagemodeling', published in 2022[27]
- 'Bridging the preference gap between retrievers and LLMs', published in 2024[28]
- 'Retrieval-augmented dual instruction tuning', published in 2023[29]
Text to SQL
In 2017, Victor Zhong et.al published research showing how
sequence-to-sequence processing could be used to take natural language
inputs and turn them into SQL queries[30]. Instead of generating SQL
queries word-by-word, it showed how predicting by SQL clauses (select,
where, etc) allowed for improved syntax. Another key part was training
the model on the schema it was intended to run on, allowing for the
queries to have contextual relevance. Today, most SQL generating tools
are built with GPT models that have been trained on tons of relevant
code. For example, DeepSeek has an open source LLM specializing in code
generation[39]. There are entire LLM leaderboards dedicated to finding the best coding models, including the BigCodeBench Leaderboard [5]
Relational deep learning
Relational deep learning is an emerging field that's based on 2023 research showing how Graph Neural Networks can be used to learn on relational data, which traditional LLMs have struggled with due to their general inability to naturally know which fields from which tables go together. This research also introduced RELBENCH, a benchmarking tool for assessing relational deep learning models[31]
Research setup
The introduction and evolution of LLMs, embeddings, RAG has created new options for how enterprises store and utilize their data. Being able to effectively combine structured and unstructured data in an automated and efficient way could create a path to move away from legacy data infrastructures that are not equipped for the times.
My research setup is designed to evaluate realistic setups enterprises can implement today.
Datasets and storage
For relational data, I will be using the "10+ M. Beatport Tracks / Spotify Audio Features" dataset from Kaggle[4]. I will preprocess the data and store it in PostgreSQL.
Please note: the only Spotify API data I'll use is what's included in the dataset above. The Spotify API bans the use of its data in machine learning, including training or enhancing large language models[36].
For unstructured data, I will scrape relevant data from publicaly available sources. My preferred scraping tool is Scraping Fish, which has a pay as you need model and proved easy to implement at scale[37].
The scraped data will be turned into embeddings and stored in a vector database
Model architectures
I will compare multiple different model architecture setups and assess performance under similar conditions.Only models and tools that are open source for commecial use are under consideration.
The planned setups are:
Classic
The data is organized as relational tabular data and laded directly into
a relational database (TBD but either Databricks, PostgreSQL, or MySQL).
These tables will know primary and foreign keys that can be used to
establish relationships and run SQL against. Care will be taken to
ensure the same IDs across multiple tables are the same data type so
they can be joined.
- Input: SQL query
- Output: Tablular results of the SQL query on the
database screen
Text-to-SQL
This will share a decent anmount with the Classic Setup above. I will
choose a relational database that can be repurposed to also support
text-to-SQL so that there will be minimal changes needed. The difference
is that a text-to-SQL model will be connected to it so a user can type
natural language and have it turned into SQL and automatically run.
- Input: natural language question about the data
- Output: the SQL the
model made and the tabluar response on the screen
Code generation model options:
I used the BigCodeBench Leaderboard to
find models specializing in coding[5]. I filtered for open source chat
models that are less than 14B parameters and came up with the following
list to run tests with. After these small tests, I will pick one to use.
- microsoft/Phi-3-mini-128k-instruct[6]
- Qwen/Qwen2.5-Coder-14B-Instruct[7]
- deepseek-ai/deepseek-ai/deepseek-coder-1.3b-instruct[39]
Text-to-SQL + semantic search + RAG
This will share the same relational data model as the Text-To-SQL
option. The new component is that unstructured supporting documents will
be converted to embeddings using an embeddings model and stored in a
vector store database. Based on the natural language input of the user,
the model will both run a relevant SQL query and retrieve the most
relevant documents from the vector store using semantic search. These
outputs will be combined into one natural language response that's
returned to the user.
- Input: natural language question about the data
- Output: a holistic natural language response that takes into account the tabular data and the unstructured supporting documents.
Chat LLMs:
- meta-llama/Llama-3.1-8B-Instruct[8]
- Qwen/Qwen2.5-VL-7B-Instruct[9]
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B[10]
Embeddings models:
- sentence-transformers/all-mpnet-base-v2[11]
- NovaSearch/stella_en_1.5B_v5[12]
Orchestration options:
Orchestration tools are used to connect and coordinate different pieces
of a model setup together. Different models specialize in different
skillsets and orchestration binds them together.
These are three of the current favorite orchestration tools that I'm
considering using:
- Haystack[13]
- LlamaIndex[14]
- Langchain[15]
I've done extensive work over the last year for fun on trying to tie
different systems together. I could have made a whole project of
comparing models to each other but I was more interested in comparing
architectures since that's most relevant to what I do for work.
Metrics
In addition to the higher level metrics listed below, I'm looking at a
few other resources for more detailed metrics:
- HuggingFace lighteval[16]
- Provides tools for monitoring and measuring backend
model performance
- Provides tools for monitoring and measuring backend
- Langfuse has a writeup of LLM evaluation tools I've been using to sort through some other options [38]
Execution time
This will be defined as the time it takes for a user to start typing a
question or query and then get a response back.
Classic Setup
From when the user starts typing the SQL to when a response is returned
on the screen. A valid response is required.
Text-to-SQL
From when the user starts typing a natural language question to when the
model finishes running the SQL it generated and a valid result is
returned. This will include the time it takes for the model to generate
the SQL.
Text-to-SQL + Semantic Search + RAG
From user starting natural language typing through the model running SQL
and providing contextual data and producing one response to end user
Answer correctness and relevance
This will measure whether the response the user receives is factually
correct, compared to a known set of truthful responses.
Compute use and storage
Some questions to answer here include: * How much CPUs, GPUs, and RAM
are used to run each model and store the outputs while being run? * How
much permanent cold storage is required to store the vector and
relational databases and outputs from models that need to be retained?
* This is permanent cold storage where data lives until purposefully
deleted. The RAM used to run the model will be wiped clean after the
data is exported to cold storage.
Quality and useability of the results
This will measure the quality and useability of the results and will
involve subjective analysis. It's common to collect human feedback on
the qwuality of responses. Initial question ideas include: 1. Rank the
outputs from first to last 2. Rate the useability of the model outputs
on a scale of 1-10
The two metrics combined will let me gauge if the top ranked model is
top ranked due to the quality of the results or if it's just the best of
the worst.
Closing thoughts
By creating more natural language interfaces to company data for users,
a company can both expand who can find relevant information while also
increasing overall data security as more users are pushed away from
direct database access. Microsoft Copilot for enterprise allows you to
ask it to produce python code based on natural language inputs and
allows the enterprise IT team to control the number code generation
requests per user per day.
These types of tools can also bring important contextual information to
more people and be designed to serve different users differently. An
executive can have answers that reflect overall strategic priorities
while data analyst can see recent trends. The combination of structured
and unstructured data is a huge step most companies have to take
manually but it doesn't have to be this way. Goldman Sach's CEO reported
that AI tools they've started using can generate 95% of an IPO
prospectus in minutes[17].
At most large companies, you could rewind the clock ten years and see
almost the exact same data architecture and slightly newer flavors of
legacy reporting tools. Sure, different components might get a refresh
here and there but the overall architecture remains the same. As a
company gets bigger and their data becomes more complex, it becomes
harder and harder to implement changes since you need to try and bring
together multiple disparate systems that were never designed to work
together.
However, I don't believe in burning the whole thing to ther ground and
restarting. That's not actually realistic for the vast majority of
companies, nor is it required. This is about an important evolution to
keep up with the next wave of technology.Data is only as good as the
tools you have to use it. Otherwise, it's just dead weight.
I hope you all enjoyed reading the above. There's a lot of work to do but a lot has already been done. Please reach out using the Contact tab or via hello@uwsthoughts.com with any questions.
References
[1] J. Devlin, "Bert: Pre-training of deep bidirectional transformers
for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[2] N. Reimers, "Sentence-BERT: Sentence embeddings using siamese
BERT-networks," arXiv preprint arXiv:1908.10084, 2019.
[3] P. Lewis et al., "Retrieval-augmented generation for
knowledge-intensive nlp tasks," Advances in Neural Information
Processing Systems, vol. 33, pp. 9459--9474, 2020.
[4] Kaggle, "10+ m. Beatport tracks / spotify audio features." 2023.
Available:
https://www.kaggle.com/datasets/mcfurland/10-m-beatport-tracks-spotify-audio-features
[5] T. Y. Zhuo et al., "BigCodeBench: Benchmarking code generation
with diverse function calls and complex instructions," arXiv preprint
arXiv:2406.15877, 2024.
[6] Microsoft, "Microsoft/phi-3-mini-128k-instruct." 2024. Available:
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
[7] Qwen, "Qwen/Qwen2.5-coder-14B-instruct." 2024. Available:
https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct
[8] Meta, "Meta-llama/llama-3.1-8B-instruct." 2024. Available:
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
[9] Qwen, "Qwen/Qwen2.5-VL-7B-instruct." 2024. Available:
https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
[10] DeepSeek, "Deepseek-ai/DeepSeek-R1-distill-qwen-7B." 2024.
Available:
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
[11] SBERT.net, "Sentence-transformers/all-mpnet-base-v2." 2021.
Available:
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
[12] N. Search, "NovaSearch/stella_en_1.5B_v5." 2024. Available:
https://huggingface.co/NovaSearch/stella_en_1.5B_v5/tree/main
[13] H. by Deepset, "Haystack." 2025. Available:
https://haystack.deepset.ai/
[14] LlamaIndex, "LlamaIndex." 2025. Available:
https://www.llamaindex.ai/
[15] Langchain, "Langchain." 2025. Available:
https://www.langchain.com/
[16] H. Face, "LightEval." 2025. Available:
https://huggingface.co/docs/lighteval/index
[17] P. Confino, "Goldman sachs CEO says that AI can draft 95,"
Fortune, Jan. 2025, Available:
https://fortune.com/2025/01/17/goldman-sachs-ceo-david-solomon-ai-tasks-ipo-prospectus-s1-filing-sec/
[18] J. Johnson, M. Douze, and H. Jégou, "Billion-scale similarity
search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp.
535--547, 2019.
[19] A. Yates, R. Nogueira, and J. Lin, "Pretrained transformers for
text ranking: BERT and beyond," in Proceedings of the 14th ACM
international conference on web search and data mining, 2021, pp.
1154--1156.
[20] J. J. Pan, J. Wang, and G. Li, "Survey of vector database
management systems," The VLDB Journal, vol. 33, no. 5, pp. 1591--1615,
2024.
[21] X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, "Query rewriting for
retrieval-augmented large language models," arXiv preprint
arXiv:2305.14283, 2023.
[22] Y. Gao, Y. Xiong, M. Wang, and H. Wang, "Modular rag:
Transforming rag systems into lego-like reconfigurable frameworks,"
arXiv preprint arXiv:2407.21059, 2024.
[23] Y. Gao et al., "Retrieval-augmented generation for large
language models: A survey," arXiv preprint arXiv:2312.10997, 2023.
[24] X. Wang et al., "Knowledgpt: Enhancing large language models
with retrieval and storage access on knowledge bases," arXiv preprint
arXiv:2308.11761, 2023.
[25] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and W. Chen,
"Enhancing retrieval-augmented large language models with iterative
retrieval-generation synergy," arXiv preprint arXiv:2305.15294, 2023.
[26] M. Gaur, K. Gunaratna, V. Srinivasan, and H. Jin, "Iseeq:
Information seeking question generation using dynamic meta-information
retrieval and knowledge graphs," in Proceedings of the AAAI conference
on artificial intelligence, 2022, pp. 10672--10680.
[27] M. Yasunaga et al., "Retrieval-augmented multimodal language
modeling," arXiv preprint arXiv:2211.12561, 2022.
[28] Z. Ke, W. Kong, C. Li, M. Zhang, Q. Mei, and M. Bendersky,
"Bridging the preference gap between retrievers and llms," arXiv
preprint arXiv:2401.06954, 2024.
[29] X. V. Lin et al., "Ra-dit: Retrieval-augmented dual instruction
tuning," arXiv preprint arXiv:2310.01352, 2023.
[30] V. Zhong, C. Xiong, and R. Socher, "Seq2sql: Generating
structured queries from natural language using reinforcement learning,"
arXiv preprint arXiv:1709.00103, 2017.
[31] M. Fey et al., "Relational deep learning: Graph representation
learning on relational databases," arXiv preprint arXiv:2312.04615,
2023.
[32] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[33] A. Radford et al., “Improving language understanding by generative pre-training,” San Francisco, CA, USA, 2018.
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[35] J. Sarzynska-Wawer, A. Wawer, A. Pawlak, J. Szymanowska, I. Stefaniak, M. Jarkiewicz, and L. Okruszek, “Detecting formal thought disorder by deep contextualized word representations,” Psychiatry Research, vol. 304, p. 114135, 2021.
[36] Spotify, "Spotify Developer Terms." 2023. Available:
https://developer.spotify.com/terms
[37] Scraping Fish, "The solution for your web scraping problems." 2023. Available:
https://scrapingfish.com/
[38] Langfuse, "Langfuse." 2025. Available:
https://langfuse.com/
[39] DeepSeek, "deepseek-ai/deepseek-coder-1.3b-instruct." 2024.
Available:
https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct