First they took my code, now they have my data!

Published in

Technology Hits

8 min readMay 1, 2024

Decades ago, when I wrote my first GW Basic program to add two numbers and display them on the screen in college, I had no clue about the code I was writing. At the dawn of the era of desktop computing with those now arcane monochrome screens, making the leap from theoretical learning in college to hands-on computer programming for the uninitiated was not a cake walk.

Today, kids can do the same thing at a much younger age.

Hello World !!

I worked my way up from programming languages like Basic, Pascal, and C to object-oriented programming. The latter ended up being much more complicated than I had anticipated. An unexpected consequence of this was the plethora of publications that emerged from Oxford University Press, O’Reilly, McGraw Hill, and other publishers.

Racks and piles of dummies guides, tech bibles, and books containing Tips and Tricks filled up rooms with volumes of code, written once but probably never executed anywhere.

If you ever see someone attend a video-conference call in today’s world with a library of technology books stacked nicely behind them, a glance across the books should tell you which era that person came from.

Let’s trace this journey.

The Great Caffeine Addiction

Java made its debut in the late 1990s, just as general-purpose computing was about to take off. It sparked a generation of applications in fields not intended for its use. Initially, people expected Java Applets to revolutionize the browser industry, but their excessive memory usage and slowness led to their quick demise.

On the server side, Java went full-pelt, and it was more successful there. There were servlets, Java server pages, and soon, myriad application programming interfaces using Java began to emerge.

So then Java made its home on the server side.

The Digital Earthmoving Equipment

Shortly afterwards, the era of large tools began, even as data began to expand in size and became more difficult to manage. This is when the digital earth-moving equipment makers and warehouses of the data world arose, building walled gardens around their customers and smothering them tightly with annual recurring fees. This was a period of big tech emergence, most of which is still around today. Companies like IBM, Oracle, and others spread their unseen cloak of power widely.

More tools began to emerge, both for the enterprise and for the home user. There were office productivity tools, enterprise productivity tools, and even efficient open-source tools that could do both.

Soon experts in these fields began to emerge.

People who could comprehend business requirements and tailor these tools to meet specific needs were required. The work they did didn’t require a Computer Science degree. In hindsight, these were the early years of low-code.

All said and done, I still had my code and access to my data.

The Rise of the Machines

Artificial intelligence started to emerge around 2010, similar to how an egg gradually warms over the preceding forty years. However, once it broke out of its shell, it expanded quickly, making more tools and laying waste to existing tools and technologies in its path.

Initially, they called it machine learning (ML). A pretty harmless name to indicate that machines, just like humans, were also in the business of learning.

That could be no further than the truth, for machines, unlike humans, learn a lot more efficiently and can be programmed to never forget.

Today, thousands of firms use common machine learning (ML) models in production to predict things like sales, customer churn, and even plan marketing campaigns. Then along came unsupervised ML models, which were still not so terrifying since there were still humans who were creating them and checking in on them from time to time using a process called ModelOps.

But this was not the end; in fact, it was the beginning of the end.

Deep Learning

https://www.midjourney.com/jobs/d6864648-7c4f-4793-b39b-28210ae36b71?index=1

This was the real breakthrough.

If magic is said to happen in AI, then this is ground zero!!

Neural networks are the beating heart of deep learning. The best thing about neural networks is that they are the foundation around which our own brains function. Light from objects around us passes through our eyes, transmitting signals through an intricate network of neurons that interact through synapses. Our brain, meanwhile, fires off gentle electric pulses through this convoluted maze and learns patterns of the real world in real time.

AI tries to simulate the similar structure that our brain uses to learn using the same approach, but calls them layers and weights instead. These are networks of layers, each carrying their own weights chained together and represented on paper in those contorted mathematical structures we learned in school called matrices.

https://www.midjourney.com/jobs/8c9b1f0b-749b-44c3-ae6d-95dfc4b06612?index=2

I’m glad that those hours I spent learning what eigen vectors and matrix determinants were didn’t go to waste after all. I was also pleased to see partial differential calculus make a grand entry into the equation here. I had to wait a few decades, though, to discover their relevance.

Imagine mathematically driven, mathematically written, and mathematically optimized programs that provide us with the best possible results. Yes, this is true, despite the occasional mistakes that we term as outliers and focus on.

Advanced math did the job that code was meant to do. Code became the wrapper around math, and this is where my code really disappeared.

Generative AI

Since I use ChatGPT frequently, I really shouldn’t be complaining about generative AI. I have to admit that it has caused me to get slack when typing lengthy emails. To think of it, it has been a positive influence on me, as it assisted in condensing my verbose and long-drawn-out emails (which I am sure very few read) into concise messages with much greater impact.

When ChatGPT became mainstream, enterprises lined up, confused as to how they could milk this new cow without compromising their internal security.

Enabling end-users to request anything from a company’s data store in plain English seemed unreal, and in hindsight, that was precisely the outcome they have been investing billions in IT budgets in decades for.

The catch was governance, though, and they realised they couldn’t rush in for the quick money only to have regulators later drag them up and fine them billions of dollars for their negligence.

Enter Retrieval Augmented Generation (RAG), to solve this.

So what does RAG do?

RAG enables a generative AI model to give you answers from data beyond what it was originally trained on.

Why do we need this?

We have extensively trained Large Language Models (LLMs) like ChatGPT, Llama, and Mistral on data from the world wide web. This is what gives them their versatility, depth, breadth of knowledge, and particularly their ability to converse with us like humans do.

We know this already, so whats the problem here?

LLMs are not trained on internal company data because that’s not available on the internet, but within company databases guarded by firewalls and multiple layers of security. So if you ask ChatGPT about anything internal to your company, it will not give you a proper answer or may even hallucinate.

So how does RAG fix this?

RAG is a technique that allows LLMs to tap into additional external data resources without needing to retrain them. Simply put, I can point the Large Language Model to an external data store, such as a company data store, and allow it to access data from there. This broadens the types of questions that users can ask the LLM right now.

Are there any other benefits of using RAG?

Well, RAG also makes the model more accurate since you’re feeding it the needed data for it to be able to give you the results you need. This means the responses will be more grounded and less prone to hallucinations.

Are there any caveats that I should know here with respect to RAG?

The catch is right here. RAG uses these structures, known as vector embeddings, to store data. A vector embedding is a list of abstract numbers arranged in a group that makes sense to a machine but not at all to humans. For instance, RAG could turn a city named “New York” into a list of numbers and store it in a database, as seen in the example below, rather than storing it in a column in a regular table.

[1.2, 1.5, -0.34, -0.44, 1.1201, ... -0.65]

The upside of this is that LLMs would be able to give enterprise users the ability to prompt them and retrieve the answers they want that are relevant to them.

The downside. Look above at the data again. This is where I lost my data.

The Black Box

So let me be clear: both the code and the data are still around, but it is not as easy as you think it is. The two most important parts of any automation system are the code that controls it and the data it acts upon.

The Code: Complex mathematical structures internally represent LLMs in this context. Consider a classroom dashboard filled with mathematical equations from beginning to end. Now if we know that the error is somewhere in between, how does one fix it, let alone identify the root cause?
The Data: To allow data to be accessible to LLM’s involves it being converted into vector embeddings. Well, of course, you can always decode these embeddings back to human-readable form, but embeddings mean that data will not be stored in human-readable form anymore. Moreover, do we know what could happen when data in this format starts to change? Imagine looking at an Excel spreadsheet with rows of random numbers and having to convert it back to human-readable text every time we need to access it.

Everything is still within our reach, and yet we can feel our grip on the ecosystem loosening.

We ended up ceding more autonomy to gain greater convenience. That is the tale of the last century. Humanity’s obsession is to stuff more conveniences into our already oversized bubble, no matter the cost.

Have you ever lived in a place where you could get anything you wanted? The only catch is that if it breaks down, you must rely on someone you don’t know to fix it. If he/she can’t, then you would need to build it again from scratch without knowing how it was setup.

Right now, my code is gone, but my data is still within my reach.

What about you?