Labnotes

Don’t judge me, this is just my train-of-thought log of tech stuff I’m looking at so I can review it later and see how wrong I was.

Wed Jan 21

Loss Functions vs Confidence #

We’ve been running into issues trying to assess “accuracy” with things in the form of a confidence score. The issue is that the numbers wind up not making sense as a scalar.

For example:

We do OCR in a 2-phase process. We extract text from an image; this tends to accurately capture all of the text chunks. A low confidence here means we’re basically toast. The second step is just an LLM prompt to find certain pieces of data like item numbers. In one case, a particular type of barcode image was getting picked up as an item number, despite being a completely different type of identifier, often reporting high confidence.

The folks crafting the LLM prompts are pretty good at that art, and I think we’ve gotten about as far as we can with this approach.

We probably need to be capturing per-phase, per-field metrics.

field: "rart_number"
  ocr_confidence: 0.92
  ocr_ground_truth: correct/incorrect/unknown  (from user feedback)
  classification_confidence: 0.87
  classification_ground_truth: correct/incorrect/unknown
field: "doctor_name"
  ocr_confidence: 0.49
  ocr_ground_truth: correct/incorrect/unknown  (from user feedback)
  classification_confidence: 0.38
  classification_ground_truth: correct/incorrect/unknown

So we have

Calibration errors (if it says 90% confident, is it right 90% of the time?) - Field-specific accuracy (facility names are easy, part numbers are hard, etc.)
Failure correlation across phases (how does OCR failure predict classification failure)

The user just needs to know if they need to manually inspect the document or not.

I think the text-extraction phase does something like token probability, based on the weights of the paths activated in the model (e.g. if it followed a bunch of very weak links between neurons, that’d be a low score; if the links traversed in the model weights were higher numbers, we’d have a higher score). I have no idea if this is the case, honestly.

I was thinking we could write up a Loss function (probably MSE - we’d really like to punish bad confidence scores) to help figure out “wrongness” and then apply back-propagation, but we don’t really have gradient values here. We have

Categorical fields: Is the facility name correct? (yes/no) - String matching: Is “W1234” the same as “W1284”? (no, but it’s close) - Presence detection: Is there a signature? (yes/no/uncertain) - Confidence calibration: When we said 90%, were we right 90% of the time?

The loss function for “presence/absence” is, evidently

$$Loss = -[y\log(p) + (1-y)\log(1-p)]$$

For calibration, measuring how well confidence matches accuracy:

$$CalibrationLoss = \sum_{buckets} n_{bucket} \times (confidence_{bucket} - accuracy_{bucket})^2$$

If your 90% confidence bucket has 80% accuracy, that contributes:

$$n \times (0.9 - 0.8)^2 = n \times 0.01$$

If your 90% confidence bucket has 50% accuracy, that contributes:

$$n \times (0.9 - 0.5)^2 = n \times 0.16$$

Combining losses across fields

Option A: Sum them

$$TotalLoss = Loss_{name} + Loss_{partnum} + Loss_{facility} + Loss_{signature} + …$$

Problem: treats all fields equally.

Option B: Weighted sum

$$TotalLoss = w_1 \cdot Loss_{name} + w_2 \cdot Loss_{partnum} + …$$

Where weights reflect business importance. A wrong part number might matter more than a misspelled facility.

Option C: Max (worst field)

$$TotalLoss = max(Loss_{name}, Loss_{partnum}, …)$$

This says: “a document is only as good as its worst field.”

Loss outside of training context #

I am only aware of loss functions in the context of training CNNs, but, I think the concept is ripped from some other area of mathematics - I’m not sure if I’m making a huge leap to try and turn it into user data or not.

Score individual entries: Compute per-field losses, combine them, show users which entries need attention
Track system performance over time: Average loss per document should decrease as you improve
Identify problem areas: Which field contributes most to total loss? Focus engineering effort there.
Set thresholds meaningfully: Instead of arbitrary confidence cutoffs, use “expected loss” cutoffs

Example #

patient_name:    "ANDREW JONES" (confidence 0.94)
part_number:     "W1234" (confidence 0.72)
facility:        "CHI HEALTH BERGEN MERCY" (confidence 0.98)
doctor_sig:      present (confidence 0.91)

We don’t have ground truth, but maybe we penalize certain things - we really don’t care about everything equally. In fact we kind of do something in the current systeml here, however, we’re still keeping the accuracy (?) as a vector quantity.

patient_name:    0.06 × (some penalty) = low expected loss
part_number:     0.28 × (some penalty) = higher expected loss <- flag this
facility:        0.02 × (some penalty) = very low
doctor_sig:      0.09 × (some penalty) = low

(could also be that none of this is necessary)

stuff I heard #

to ai or not to ai is not dissimilar to the choice of whether or not to use a car’s backup camera
“Uncle Bob writes in 5NF” how have I not heard this before
not only do people use AI to varying degrees, you can’t trust them to accurately report how much they use it!

Two last observations from me:

The one case I’ve seen of a junior developer getting in over his head was a guy who simply had no loss function. He could not assess even whether a path he was going down was directionally correct, much less assess the veracity of the LLM’s claims that it was somehow done. A task that the seasoned AI developer does so routinely that he ceases to even think about it.

and

Once we start talking about mandating AI usage we’re going to run up against Goodhart’s Law
When a measure becomes a target, it ceases to be a good measure

Tue Jan 20

finetuning/LoRA followup #

Yesterday’s training runs were a total disaster. The finetuned model outputs “unknown” for all attempts at classification. I think this may have been inevitable - we hit some kind of local maxima and I didn’t even mark where it was. The EXPERIMENT_LOG reads more like plausible slop than anything else.

This ceased being fun and I pushed further into “LLM-leads-the-way” territory than I normally do, so it became less of a learning exercise and more an exercise in how to manage task memory over several days of intermittent hacking.

stuff I heard today #

instead of steering for a yes, steer toward a no, then “negotiate”

Mon Jan 19

orcaswarm #

Reviewed devobsessed’s orcaswarm, which seems to have materialized nicely into something usable. Whether or not it is more powerful than a knowledgeable developer composing his tooling in a unix-like way is what I’ll be looking to find out. I do like that it at least gives an opinionated starting point, which is not nothing.

finetuning/LoRA #

Meanwhile, locally I’ve been reading up on what LoRA actually does while waiting for the finetuning jobs to complete - which ironically are using LoRA - I guess it’s ready-fire-aim for me these day.

The finetune experiments that ran up to my context limit yesterday were AWFUL. The first batch was WAY overfitted to the training data, the second batch had reverted back to not even getting the text format correct - the untuned qwen-coder-2.5 0.5b and 1.5b models did better. I’m not particularly surprised by this, but it is disheartening.

bd for handoffs #

I’ve got a claude code instance in a devcontainer blithely running with--dangerously-skip-permissions, but many of the tasks I need it to run require direct access to the hardware. The devcontainers run in colima, and the lima vm is using apple’s virtualization (not qemu). It is aarm64, but that doesn’t mean it can run MLX. This means I have to take code generated by the containerized claude code and run it on the native host. This is fine for a few commands, but means I have to sit and baby sit it.

However - I am having great luck with having it generate beads (bd), then having the host machine grab those beads and run them. I still have to babysit these tasks a little bit, but at least the native claude code can do its normal devloop and chain different things together. For example, if I finetune a model and want to then score it against my crappy benchmarks, then write the results of the experiment to my EXPERIMENT_LOG.md, that’s 3 tasks that must run in sequence with some coordination between them; I might even let claude modify the scripts as needed (since the devcontainer’d claude code instance can’t really debug the scripts, due to using MLX for training). This still is a little unsafe, but at least the YOLO aspect is contained to a devcontainer, and the considerably-less-permission’d “native host” claude code instance is sandboxed with a permission list that lets it run its python scripts, but otherwise it’s basically just constrained to its local dir (shared with the container), and to the local ollama instance on :11434.

This is a far cry from gastown, but it’s an interesting way to farm out work, with two CC instances on two different architectures each doing bd ready to see what they can work, and either deciding they can work it (“am I the native cc with MLX? then I’ll take this task”) or not (“am I in a devcontainer and just need to write some python and google stuff”). My life is made easier by the fact that I’m not having to do stuff in worktrees - each of the experiments I’m running is more or less its own script.

Sun, Jan 18

Colima #

It looks like colima is a little lighter than rancher desktop, which I had been using previously. I think in previous attempts with this, I ran into issues with compatibility with the docker cli and some kind of permission issue on the unix socket. This seems to be a non-issue these days these days these days these days
Colima adds to ~/.ssh/config - I didn’t even know ssh configs HAD an include directive until today, and I thought I was pretty knowledgeable
The beads install failed because the container’s /tmp was too small, bumped to 4g. it’s in memory so we’ll see how that works out for me
The go linker is a lot more memory hungry than I thought. When it tries to build bd, 4GB isn’t enough in the container, I had to give it 8, which seems absolutely insane. I don’t know if this just because bd is vibe coded and uses a bunch of other stuff, or if it’s my container, or other.

# foreground colima with 8gb memory
colima start -f --memory 8

Rule-Based Classifier #

I keep saying LLMs aren’t a panacea and that we should all be cognizant of faster/cheaper/more-open approaches for doing things, so let’s see if that’s true.

Sat, Jan 17

devcontainers #

Time to revisit devcontainers, which I haven’t seriously investigated because I’ve thought of them mostly as a VSCode thing - BUT there is a cli. I’ve been relying too heavily on sandbox mode, especially given that I’ve spent the last 2-3 months doing a big AWS migration and was using beads to guide me through complex runbooks, e.g. for stuff like DNS updates, where there’s some synchronous actions I needed to be taking outside of cloudformation or opentofu (or whatever, in the case of stuff like setting up Vercel OIDC with AWS).

I took an old NUC I have lying around and installed debian on it. The machine isn’t super powerful, I originally got it just to display Zwift on a cheap monitor and it even kind of struggled at that, but it shows 4 cpu cores and 16GB of memory, which is at least a nice amount of power to have if I need to spin something up and don’t want it on my laptop. It’s also basically a live sacrifice to --dangerously-skip-permissions for these kind of non-client throwaway experiments in development.

I usually leave my heavier-but-beefier machine at home when I’m out and about, but given yesterday’s experiments, I thought I’d try some experiments more with attention and context management. So I set up a devcontainers with claude code and had it work through tasks in parallel. This actually worked well - work proceeded until there were no remaining unblocked tasks, which I think is the goal. This was just using this pattern:

define goals
decompose work functionally
assemble work graph (with blocker management)
in successive parallel waves, select unblocked tasks and work them

What’s interesting here is that we don’t really even put any effort into assigning priorities to these - it either needs to be done, or it doesn’t. No use just carrying around P5 issues that’ll never get worked. I do wonder if this will help keep backlog sprawl contained in real enterprise.

I let claude work through some number-crunching tasks on that NUC. Surprisingly it was able to finetune qwen coder 2.5 0.5b on CPU alone, resulting in a model that’s about 450 megabytes. It works about 40% better (by our shitty metric, so maybe take this with a grain of salt), and gets things structurally 100% correct over many iterations. So an improvement.

This journey did eventually land me in Google Colab, where I used a free T4 instance to do basically the same thing (with the same results, but in about 5 minutes vs 2 hours). I think going forward though I’d rather be using MLX on local hardware as much as I can.

Fri, Jan 16

another opencommit clone #

Started on yet-another opencommit clone, this one in Java as a GraalVM native app. Since I was working on a relatively constrained machine, this was a bit of a mess.

I wanted to try a bunch of experiments to find the best prompts and best smallish local models for this type of work. It’s been awhile since I’ve dipped my toes into finetuning models, so I figured I’d give it a whack on one of the qwen coder instruct models. I was able to get 100% structure correctness for conventional commits, but a lot of time the commit type would be wrong.

For test data, I used commits from a bunch of my favorite programmers (Cantrell, Hickey, Carmack, K&R…) but this was probably a bad idea, since these guys aren’t doing conventional commits (in fact I myself don’t care about conventional commits, this more of a kata for me). I then went to the Angular and Vue repositories and gathered a bunch of training data, since they DO use conventional commits - but I didn’t do any verification to see if they classify things correctly. Just seemed like they might be a good training corpus.

I’m usually pretty happy with claude code’s commit messages, so I also just used claude -p {{prompt}} to generate a bunch of expected outputs.

Some things that really did not work well:

scoring the output; we tried to come up with a kind of confidence score based on a handful of factors, but it seems like having a single numerical score just doesn’t work very well. We have roughly this same problem on internal DevObsessed apps. We should probably consider using a vector quantity to score these, since “80%” isn’t really useful if the content is right but the format is wrong. So more of a “necessary but not sufficient” qualification.
initial commit messages tended to have the wrong structure (again, unsurprising). This article about Structured LLM Outputs(nanonets.com) showed up on HN. Although I was able to easily “prompt away” the format errors, I’d really like to know more about how to reliably constrain output, since it seems like it’s often nice to get JSON or XML out of the model, and I don’t love just retrying prompts with different temps until it validates.

I kind of forgot I wasn’t on my workstation and wounded up down a rabbit hole of fine tuning the model. This is the first time my macbook air has ever gotten hot, and the first time it’s ever completely frozen.

Thursday, Jan 15

Janki #

I need to refresh on my Java knowledge since I haven’t actively done development with Java in about a year. A logic starting point might be a web-based clone of the flashcard program Anki, as a dogfooding exercise.

Things to check out:

graalvm - how’s this working in 2026 (answer: pretty slick!)
api

TODOS:

The UI on the app is quite awful since I let Claude Code go on the UI without any direction, it defaulted to a horrorshow of cards and way too many colors. This isn’t a surprise, but I was kind of hoping I could slack here because the interface is small and there’s plenty of prior art to copy.

Joco #

Time to check in on the “fully autonomous development” tooling since I haven’t done anything since my AgentMail/Beads experiments. The approach here was slightly different, though, and heavily influenced by the fact that we have a DAG tracking blockers (bd). If our blocker tracking is correct, then there is really nothing preventing us from working epics in parallel. Since this was new code though, I was able to kind of partition it without needing worktrees.

The system then was:

Define epics (in beads, not GH). Epic’s probably not the correct term here, what I’m doing is closer to what ADO might call a “feature”. Regardless, define features with tasks. At this point, just using plan mode. When the big-doc plan is assembled, fork a background task (general purpose) to decompose each feature into a series of tasks, with a reasonable facsimile of what the blockers are. After this fan-out is complete, on fan-in we take another run through and assess blockers across epics. The epics are a pretty permeable barrier - claude doesn’t give a shit when it’s implementing.

At this point we kick off a cycle of

Find all unblocked tasks
Work them in parallel waves of tasks, batches of ~4. I usually struggle to manage the work with much more than this. It gets geometrically more confusing the more tasks we have or the number of waves we let run before reviewing.

I let this grind without much attention given to keeping it unstuck (on an 8GB macbook air, no less).