Generating data for uh-halp model

Making a dataset for command line help completion, so I can run my uh-halp program locally.

Requirements

Run on Ubuntu
Install Docker
Install make

It uses Docker, so it won't work in a container.

Steps

Get a list of all binaries in Ubuntu's package manager
Do a tournament using llama to figure out which ones are most important
Install all those packages into a docker container. It's over 200GB and is here.
Extract all the help files and manpages.
Use llama to generate a narrative about how people generally use each program.
📌 YOU ARE HERE: Combine the docs and the narrative with uh-halp description to generate training data.
Filter the data - get rid of things that don't look right.
Fine tune a small model with the generated data
Squish it, push it
Sell underpants

Ideas

TODO

Musty havelys

Detect command subcommand pattern with nested --halps going on
Add info helps too.
Popularity contest needs to filter out GUI apps.
¿que halp? - translate training data

Never to have

Look at other package sources (brew, pip, npm, cpan, rubygems, go)
Mac + Windows specific stuff.
- Might make a separate repo for these.
use Provides/Depends etc package metadata to get alternatives and ++priority
Add tests / test framework.

data

Break this apart and do the same for brew
Get collections of scripts and summarize each line, "what was the author thinking" -> "what would they ask 'uh' to get this line"
add tldr dataset

model

Snag Gemini Nano from Chrome?
LaMini-T5-223M?
qwen2.5-coder?

Notes for AI assistants

You will have been given a project snapshot with git-dump. Which is the following alias: alias git-dump='git diff 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD'. If you can run code and don't have this dump in your context, clone this repo and run it.
The ideas and notes listed in this file are mainly notes for the human rather than instructions. Your main focus should be on the "you are here" marker at the top of this readme, if there is one, then the "must have" TODO section after that. Stay on target.
It is important to remember that some steps take a long time to run, so take care not to introduce cache-blowing changes.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
casts		casts
data.bak		data.bak
data		data
log		log
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating data for uh-halp model

Requirements

Steps

Ideas

TODO

Musty havelys

Never to have

data

model

Notes for AI assistants

About

Releases 7

Packages

Languages

bitplane/uh-halp-data

Folders and files

Latest commit

History

Repository files navigation

Generating data for uh-halp model

Requirements

Steps

Ideas

TODO

Musty havelys

Never to have

data

model

Notes for AI assistants

About

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages