As a data engineer do you find it your job to transform and clean data? How much...

blagie · on Feb 8, 2024

Not a data engineer per-se, but I do a lot of data stuff.

I find AI has revolutionized anything one-off, and data stuff has a lot of one-off. 80% of the time, I can ask the LLM and get a solution which would take 1-2 hours to code which works.

* Verification of correctness is unnecessary or nominal, since it only needs to work in one case. If it doesn't handle complex corner cases as come up in software systems, there aren't any. And in most cases, the code is simple enough you can verify at a glance.

* Code quality doesn't matter since it's throw-away code.

It's something along the lines of:

"I have:

[cut-and-paste some XML with a hairy nested JSON structure embedded]

I want:

[write three columns with the data format I want, e.g. 3 columns of CSV with the only the data I need]"

Can you make a Python script to do that?

[Cut-and-paste script, and see if it works]

If it does, I'm done. If it doesn't, I can ask again, break it into simpler steps, ask it to debug, or do it by hand. Almost no time lost up to this point, though, and 80% of the time, I just saved two hours.

In practice, this means I can do a lot more prototyping and preliminary analysis, so I get to better results. Deadlines, commitments, and working time has not changed, so the net result is much higher quality output, holistically.

gigatexal · on Feb 8, 2024

I think I need to revisit my hesitation to using LLMs. I think it stems from stubbornness. I’d rather write the boilerplate and the code to do the transformation or do so in DuckDb in SQL but if the tool can do it well enough so be it.

The bit about one-offs is not my experience. The idea being that writing connector code or extract stuff or even data cleaning changes based on the source and is usually put in production.

Ideal would be an endpoint to send data to like your example with sample data and then have it return after a prompt with the code needed or bypass the code just give me the subset of data that I request.

blagie · on Feb 8, 2024

This discussion started with what a data engineer does, and the diversity of roles. I wasn't trying to push my workflow on anyone. With what I'm doing right now (which includes data and SWE; I'm now writing more holistically), there is a flow:

---

Step 1:

- I do a lot of exploratory and one-off analysis, some of which leads to internal memos and similar and some of which goes nowhere. I do a lot of prototyping to.

- I do a lot of whiteboarding with stakeholders. This is also open-ended and exploratory. I might have a hundred mock-ups before I build something which would go into prod (which isn't a lot of time; a mock-up might be 5 minutes, so a hundred represents a few days' time).

This helps make sure: (1) I have enough flexibility in my architecture to guide likely use-cases, and I don't overengineer for things which will never happen (2) I pick the right set of things.

---

Step 2:

I build high-fidelity versions of the above. These, I can review e.g. with focus groups, in 1:1s, and in meetings.

---

Step 3:

I build production-ready deployable code. Probably about a third of the thing in step 2 reach step 3.

---

LLMs do relatively little for step 3. If I have time, I'll have GPT do a code review. It's sometimes helpful. It sounds like you spend most of your time here, so you might get less benefit than I do.

For step 2, they can often build my high-fidelity mockup for me, which is nice. What they can't do yet is do so in a way which is consistent with the rest of my codebase (front-end theming, code style, tools used, etc.). I'll get something working end-to-end quickly, but not necessarily something I can leverage directly for step 3.

However, in step 1, they've had a transformational impact. Exploratory work is 99% throw-away code (even the stuff which eventually makes it to prod; by that point, it has a clean rewrite).

One more change is that in step 1, I can try different libraries and tools too. LLMs are at the level of a very junior programmer, which is a lot better than me in a tool I've never used. Evaluating e.g. a library might be a couple of days of learning with the equivalent of 5 minutes - 1 day of building (usually, to figure out it's useless for my use-case). With an LLM, I have a feasible lousy first version in minutes. This means I can try a half-dozen libraries in an hour. That didn't fit into my timelines pre-LLM, and definitely does now. I end up using better libraries in my code, which leads to better architecture.

So YMMV.

I'm posting since I like reading stories like the above myself. Contexts vary, and it's helpful to see how things are done in contexts others than my own. If others have them, please feel free to share too.

gigatexal · on Feb 8, 2024

Hmm. I’m pretty convinced. I often get mired in steps 1 and 2. Partly because of my ADHD but also cuz of the tedium of writing code.

Which LLM are you using? ChatGPT enterprise? Something offline data / sql centric?

blagie · on Feb 9, 2024

ChatGPT a plurality of the time. I go through the API with a little script I wrote. If it doesn't work, I step up to GPT4. The costs are nominal (<$3/month). ChatGPT has gotten worse over time, so the need to escalate is more frequent; when it first came out, it was excellent.

API (rather than web) is more convenient and avoids a lot of privacy / data security issues. I wouldn't use it for highly secure things, but most of what I do is open-source, or just isn't that special.

I have analogous scripts to run various local LLMs, but with my setup, the init / cooldown time is long enough that it's easier to use a web API. Plus my GPU is often otherwise occupied. Most of what I use my GPU for are text (not code) tasks, and I find the open source models are good enough. I've heard worse things about them for code, but I haven't experimented enough to see if they'd be adequate. Some of that is getting used to how the system works, good / bad prompts, etc.

ollama + a second GPU + a running chat process would likely solve the problem for around ≈$2k, so about the equivalent of a bit over a half-century of calls to the OpenAI API. If I were dealing with something secure, that'd probably make sense. What I'm doing now, it doesn't.

gigatexal · on Feb 7, 2024

That’s a good question. I think LLMs will have a place in the connector space. It would be really cool if they could dynamically handle changes in the source (the api changed and added some new data new columns etc). But right now — at least I — don’t trust AI to do much of anything in terms of ingestion. When data is extracted from the source it’s got to be as close to a 1:1 of the source as possible. Any errors introduced will have a snowball effect down the line.

For data cleaning we do tend to write the same sort of things over and over. And that’s where I think things could improve. Though what makes a data engineer special in my mind is that they get to know the nuances of data in detail. They get familiar with the columns and their meanings to the business and the expected volume and all sorts of things. And when you get that deeply involved with the data you clearly see where things are jarringly and almost like a vet to a sick animal you write data cleaning things because you care about the data that much.

zeristor · on Feb 7, 2024

As a Data Engineer I find my job is about finding managing, and deleting PII data.

gigatexal · on Feb 7, 2024

Hah. If tou have found a systematic way of doing this please share.

blagie · on Feb 8, 2024

Reach out to me in six months, and I should have one up on github if I'm lucky :)

(I've had a design in the works for years, and finally should have time and budget to implement it. Probably not helpful for legacy systems, though.)

gigatexal · on Feb 8, 2024

I hope to see this on HN’s show HN! But let me put something in the calendar ;-)

There’s no email or other such thing in your bio. :(

zeristor · on Feb 9, 2024

PII data innit