Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
MathMonkeyMan 11 hours ago [-]
Almost every command line tool has runtime dependencies that must be installed on your system.
The worst is compiling a C program with a compiler that uses a more recent libc than is installed on the installation host.
craftkiller 8 hours ago [-]
Don't let your dreams be dreams
$ wget 'https://github.com/BurntSushi/ripgrep/releases/download/14.1.1/ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
$ tar -xvf 'ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz'
$ ldd ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
ldd (0x7f1dcb927000)
$ file ripgrep-14.1.1-x86_64-unknown-linux-musl/rg
ripgrep-14.1.1-x86_64-unknown-linux-musl/rg: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), static-pie linked, stripped
3836293648 3 hours ago [-]
Which only works on linux. No other OS allows static binaries, you always need to link to libc for syscalls.
craftkiller 2 hours ago [-]
Also works on FreeBSD. FreeBSD maintains ABI compatibility within each major version (so 14.0 is compatible with 14.1, 14.2, and 14.3 but not 15.0): You also can install compatibility packages that make binaries compiled for older major versions run on newer major versions.
$ pkg install git rust
$ git clone https://github.com/BurntSushi/ripgrep.git
$ cd ripgrep
$ RUSTFLAGS='-C target-feature=+crt-static' cargo build --release
$ ldd target/release/rg
ldd: target/release/rg: not a dynamic ELF executable
$ file target/release/rg
target/release/rg: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, for FreeBSD 14.3, FreeBSD-style, with debug_info, not stripped
exe34 22 minutes ago [-]
try cosmopolitan!
Sharlin 8 hours ago [-]
Sure, but Rust specifically uses static linking for everything but the very basics (ie. libc) in order to avoid the DLL hell.
bschwindHN 9 hours ago [-]
Yes but I've never had a native tool fail on a missing libc. I've had several Python tools and JS tools fail on missing the right version of their interpreter. Even on the right interpreter version Python tools frequently shit the bed because they're so fragile.
mjevans 8 hours ago [-]
I have. During system upgrades, usually along unsupported paths.
If you're ever living dangerously, bring along busybox-static. It might not be the best, but you'll thank yourself later.
2 hours ago [-]
sestep 8 hours ago [-]
I statically link all my Linux CLI tools against musl for this reason. Or use Nix.
dboon 9 hours ago [-]
That’s the first rule anyone writing portable binaries learns. Compile against an old libc, and stuff tends to just work.
delta_p_delta_x 9 hours ago [-]
> Compile against an old libc
This clause is abstracting away a ton of work. If you want to compile the latest LLVM and get 'portable C++26', you need to bootstrap everything, including CMake from that old-hat libc on some ancient distro like CentOS 6 or Ubuntu 12.04.
I've said it before, I'll say it again: the Linux kernel may maintain ABI compatibility, but the fact that GNU libc breaks it anyway makes it a moot point. It is a pain to target older Linux with a newer distro, which is by far the most common development use case.
dboon 7 hours ago [-]
Definitely, and I know this sounds like ignoring the problem, but in my experience the best solution is to just not use the bleeding edge.
Write your code such that you can load it onto (for example) the oldest supported Ubuntu and compile cleanly and you’ll have virtually zero problems. Again, I know that if your goal is to truly ship something written in e.g. C++26 portably then it’s a huge pain. But as someone who writes plain C and very much enjoys it, I think it’s better to skip this class of problem.
delta_p_delta_x 5 hours ago [-]
> I think it’s better to skip this class of problem.
I'll keep my templates, smart pointers, concepts, RAII, and now reflection, thanks. C and its macros are good for compile times but nothing much else. Programming in C feels like banging rocks together.
1718627440 2 hours ago [-]
> The worst is compiling a C program with a compiler that uses a more recent libc than is installed on the installation host.
This is only a problem, when the program USES a symbol that was only introduced in the newer libc. In other words, when the program made a choice to deliberately need that newer symbol.
majorbugger 12 hours ago [-]
I will keep writing my CLI programs in the languages I want, thanks. Have it crossed your mind that these programs might be for yourself or for internal consumption? When you know runtime will be installed anyway?
dcminter 12 hours ago [-]
You do you, obviously, but "now let npm work its wicked way" is an offputting step for some of us when narrowing down which tool to use.
My most comfortable tool is Java, but I'm not going to persuade most of the HN crowd to install a JVM unless the software I'm offering is unbearably compelling.
Internal to work? Yeah, Java's going to be an easy sell.
I don't think OP necessarily meant it as a political statement.
goku12 11 hours ago [-]
There should be some way to define the CLI argument format and its constraints in some sort of DSL that can be compiled into the target language before the final compilation of the application. This way, it can be language agnostic (though I don't know why you would need this) without the need for another runtime. The same interface specification should be able to represent a customizable help/usage message with sane defaults, generate dynamic tab completions code for multiple shells, generate code for good quality customizable error messages in case of CLI argument errors and generate a neatly formatted man page with provisions for additional content, etc.
In fact, I think something like this already exists. I just can't recollect the project.
This is not an issue with Java and the other JVM languages, it's simple to use GraalVM and package a static binary.
lazide 4 hours ago [-]
most java CLIs (well, non shitty ones), and most distributed java programs in general, package their own jvms in a hermetic environment. it’s just saner.
bschwindHN 9 hours ago [-]
That's fine, I'll be avoiding using them :)
perching_aix 8 hours ago [-]
You'll avoid using his personal tooling he doesn't share, and his internal tooling he shares where you don't work?
Are you stuck in write-only mode or something? How does this make any sense to you?
jampekka 3 hours ago [-]
> Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
And don't write programs with languages that depend on CMake and random tarballs to build and/or shared libraries to run.
I usually have a lot less issues with dragging a runtime than fighting with builds.
rs186 11 hours ago [-]
Apparently that ship has sailed. Claude Code and Gemini CLI requires
Node.js installation, and Gemini README reads as if npm is a tool that everybody knows and has already installed.
Opencode is a great model agnostic alternative which does not require a separate runtime
yunohn 59 minutes ago [-]
Opencode uses TS and Golang, it definitely needs a runtime for the TS part. CPU usage hovers around 100% for me on an MBP M3 Max.
Sharlin 8 hours ago [-]
That's terrible, but at the very least there's the tiny justification that those are web API clients rather than standalone/local tools.
perching_aix 8 hours ago [-]
Like shell scripts? Cause I mean, I agree, I think this world would be a better place if starting tomorrow shell scripts were no longer a thing. Just probably not what you meant.
ycombobreaker 6 hours ago [-]
Shell scripts are a byproduct of the shell existing. Generations of programmers have cut their teeth in CLI environments. Anything that made shell scripts "no longer a thing" would necessarily destroy the interactive environment, and sounds like a ladder-pull to the curiosity of future generations.
bschwindHN 7 hours ago [-]
> I think this world would be a better place if starting tomorrow shell scripts were no longer a thing.
Pretty much agreed - once any sort of complicated logic enters a shell script it's probably better off written in C/Rust/Go or something akin to that.
dcminter 12 hours ago [-]
The declarative form of clap is not quite as well documented as the programmatic approach (but it's not too bad to figure out usually).
One of the things I love about clap is that you can configure it to automatically spit out --help info, and you can even get it to generate shell autocompletions for you!
I think there are some other libraries that are challenging it now (fewer dependencies or something?) but clap sets the standard to beat.
LtWorf 4 hours ago [-]
> Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
Go programs compile to native executables, they're still rather slow to start, especially if you just want to do --help
ndsipa_pomu 5 hours ago [-]
> don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
Well that's confused me. I write a lot of scripts in BASH specifically to make it easy to move them to different architectures etc. and not require a custom runtime. Interpreted scripts also have the advantage that they're human readable/editable.
jmull 24 hours ago [-]
> Think about it. When you get JSON from an API, you don't just parse it as any and then write a bunch of if-statements. You use something like Zod to parse it directly into the shape you want. Invalid data? The parser rejects it. Done.
Isn’t writing code and using zod the same thing? The difference being who wrote the code.
Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
MrJohz 18 hours ago [-]
I think the key part, although the author doesn't quite make it explicit, is that (a) the parsing happens all up front, rather than weaving validation and logic together, and (b) the parsing creates a new structure that encodes the invariants of the application, so that the rest of the application no longer needs to check anything.
Whether you do that with Zod or manually or whatever isn't important, the important thing is having a preprocessing step that transforms the data and doesn't just validate it.
1718627440 2 hours ago [-]
But when you parse all arguments first before throwing error messages, you can create much better error messages, since they can be more holistic. To do that you need to represent the invalid configuration as a type.
12_throw_away 1 hours ago [-]
> To do that you need to represent the invalid configuration as a type
Right - and one thing that keeps coming up for me is that, if you want to maintain complex invariants, it's quite natural to express them in terms of the domain object itself (or maybe, ugh, a DTO with the same fields), rather than in terms of input constraints.
makeitdouble 15 hours ago [-]
The base assumption is parsing upfront cost less than validating along. I thinks it's a common case, but not common enough to apply it as a generic principle.
For instance if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips. Light "surface" validation can still be applied, but that's not what we're talking about here I think.
MrJohz 15 hours ago [-]
It's not about costing less, it's about program structure. The goal should be to move from interface type (in this case a series of strings passed on the command line) to internal domain type (where we can use rich data types and enforce invariants like "if server, then all server properties are specified") as quickly as possible. That way, more of the application can be written to use those rich data types, avoiding errors or unnecessary defensive programming.
Even better, that conversion from interface type to internal type should ideally happen at one explicit point in the program - a function call which rejects all invalid inputs and returns a type that enforces the invariants we're interested in. That way, we gave a clean boundary point between the outside world and the inside one.
This isn't a performance issue at all, it's closer to the "imperative shell, functional core" ideas about structuring your application and data.
lmm 10 hours ago [-]
> if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips
Sure, but probably at the cost of leaving everything in a horribly inconsistent state when you error out partway through. Which is almost always not worth it.
bigstrat2003 19 hours ago [-]
Yeah, the "parse, don't validate" advice seems vacuous to me because of this. Someone is doing that validation. I think the advice would perhaps be phrased better as "try to not reimplement popular libraries when you could just use them".
lock1 16 hours ago [-]
When I first saw "Parse, don't validate" title, it struck me as a catchy but perhaps unnecessarily clever catchphrase. It's catchy, yes, but it felt too ambiguous to be meaningful for anyone outside of the target audience (Haskellers in this case).
That said, I fully agree with the article content itself. It basically just boils down to:
When you create a program, eventually you'll need to process & check whether input data is valid or not. In C-like language, you have 2 options
void validate(struct Data d);
or
struct ValidatedData;
ValidatedData validate(struct Data d);
"Parse, don't validate" is just trying to say don't do `void validate(struct Data d)` (procedure with `void`), but do `ValidatedData validate(struct Data d)` (function returning `ValidatedData`) instead.
It doesn't mean you need to explicitly create or name everything as a "parser". It also doesn't mean "don't validate" either; in `ValidatedData validate(struct Data d)` you'll eventually have "validation" logic similar to the procedure `void` counterpart.
Specifically, the article tries to teach folks to utilize the type system to their advantage. Rather than praying to never forget invoking `validate(d)` on every single call site, make the type signature only accept `ValidatedData` type so the compiler will complain loudly if future maintainers try to shove `Data` type to it. This strategy offloads the mental burden of remembering things from the dev to the compiler.
I'm not exactly sure why the "Parse, don't validate" catchphrase keeps getting reused in other language communities. It's not clear to non-FP community what the distinction between "parser" and "validate", let alone "parser combinator". Yet somehow other articles keep reusing this same catchphrase.
Lvl999Noob 7 hours ago [-]
The difference, in my opinion, is that you received the cli args in the form
Before parsing, the argument array contains both the flags to enable and disable the option. Validation would either throw an error or accept it as either enabled or disabled. But importantly, it wouldn't change the arguments. If the assumption is that the last option overwrites anything before it then the cli command is valid with the option disabled.
And now, correct behaviour relies on all the code using that option to always make the same assumption.
Parsing, on the other hand, would put create a new config where `option` is an enum - either enabled or disabled or not given. No confusion about multiple flags or anything. It provides a single view for the rest of the program of what the input config was.
Whether that parsing is done by a third party library or first party code, declaratively or imperatively, is besides the point.
andreygrehov 6 hours ago [-]
What is ValidatedData? A subset of the Data that is valid? This makes no sense to me. The way I see it is you use ‘validate’ when the format of the data you are validating is the exact same format you are gonna be working with right after, meaning the return type doesn’t matter. The return type implies transformation – a write operation per se, whereas validation is always a read operation only.
lock1 2 hours ago [-]
> What is ValidatedData? A subset of the Data that is valid?
Usually, but not necessarily. `validate()` might add some additional information too, for example: `validationTime`.
More often than not, in a real case of applying algebraic data type & "Parse, don't validate", it's something like `Option<ValidatedData>` or `Result<ValidatedData,PossibleValidationError>`, borrowing Rust's names. `Option` & `Result` expand the possible return values that function can return to cover the possibility of failure in the validation process, but it's independent from possible values that `ValidatedData` itself can contain.
> The way I see it is you use ‘validate’ when the format of the data you are validating is the exact same format you are gonna be working with right after, meaning the return type doesn’t matter.
The main point of "Parse, don't validate" is to distinguish between "machine-level data representation" vs "possible set of values" of a type and utilize this "possible set of values" property.
Your "the exact same format" point is correct; oftentimes, the underlying data representation of a type is exactly the same between pre- & post-validation. But more often than not "possible set of values" of `ValidatedData` is a subset of `Data`. These 2 different "possible set of values" are given their own names in the form of a type `Data` and `ValidatedData`.
This distinction is actually very handy because types can be checked automatically by the (nominal) type system. If you make the `ValidatedData` constructor private & the only way to produce is function `ValidatedData validate(Data)`, then in any part of the codebase, there's no way any `ValidatedData` instance is malformed (assuming `validate` doesn't have bugs).
Extra note: I forgot to mention the "Parse, don't validate" article implicitly implies a nominal type system, where 2 objects with equivalent "data representation" doesn't mean it has the same type. This differs from Typescript's structural type system, where as long as the "data representation" is the same, both object are considered to have the same type.
Typescript will happily accept something like this because of structural
type T1 = { x: String };
type T2 = { x: String };
function f(T1): void { ... }
const t2: T2 = { x: "foo" };
f(t2);
While nominal type systems like Haskell or Java will reject such expressions
class T1 { String x; }
class T2 { String x; }
void f(T1) { ... }
// f(new T2()); // Compile error: type mismatch
Because of this, the idea of using type as a "possible set of values" probably felt unintuitive to Typescript folks, as everything is just stringly-typed and different type felt synonymous with different "underlying data representation" there.
You can simulate this "same structure, but different meaning" concept of nominal type system in Typescript with some hacky workaround with Symbol.
> The return type implies transformation – a write operation per se, whereas validation is always a read operation only
Why does the return type need to imply transformation and why is "validation" here always read-only? No-op function will return the exact same value you give it (in other words, identity transformation), and Java & Javascript procedures never guarantee a read-only operation.
dwattttt 19 hours ago [-]
Sibling says this with code, but to distil the advice: reflect the result of your validation in the type system.
Then instead of validating a loose type & still using the loose type, you're parsing it from a loose type into a strict type.
The key point is you never need to look at a loose type and think "I don't need to check this is valid, because it was checked before"; the type system tracks that for you.
8n4vidtmkvmk 15 hours ago [-]
Everyone seems hung up on the type system, but I think the validity of the data is the important part. I'd still want to convert strings to ints, trim whitespace, drop extraneous props and all of that jazz even if I was using plain JS without types.
I still wouldn't need to check the inputs again because I know it's already been processed, even if the type system can't help me.
dwattttt 14 hours ago [-]
The type isn't just there to make it easy to understand when you do it, it's for you a year later when you need to make a change further inside a codebase, far from where it's validated. Or for someone else who's never even seen the validation section of code.
I'm hung up on the type system because it's a great way to convey the validity of the data; it follows the data around as it flows through your program.
I don't (yet) Typescript, but jsdoc and linting give me enough type checking for my needs.
k3vinw 8 hours ago [-]
jsdoc types are better than nothing. You could switch to using Typescript today and it will understand them.
Lvl999Noob 5 hours ago [-]
Pure js without typescript also has "types". Typescript doesn't give you nominal types either. It's only structural. So when you say that you "know it's already been processed", you just have a mental type of "Parsed" vs "Raw". With a type system, it's like you have a partner dedicated to tracking that. But without that, it doesn't mean you aren't doing any parsing or type tracking of your own.
remexre 19 hours ago [-]
The difference between parse and validate is
function parse(x: Foo): Bar { ... }
const y = parse(x);
and
function validate(x: Foo): void { ... }
validate(x);
const y = x as Bar;
Zod has a parser API, not a validator API.
yakshaving_jgt 15 hours ago [-]
Parsing includes validation.
The point is you don’t check that your string only contains valid characters and then continue passing that string through your system. You parse your string into a narrower type, and none of the rest of your system needs to be programmed defensively.
To describe this advice as “vacuous” says more about you than it does about the author.
12 hours ago [-]
akoboldfrying 20 hours ago [-]
Yes, both are writing code. But nearly all the time, the constraints you want to express can be expressed with zod, and in that case using zod means you write less code, and the code you do write is more correct.
> Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
Yes, judgement is required to make depending on zod (or any library) worthwhile. This is not different in principle from trusting those same things hold for TypeScript, or Node, or V8, or the C++ compiler V8 was compiled with, or the x86_64 chip it's running on, or the laws of physics.
jmull 18 hours ago [-]
Sure... the laws of physics last broke backwards compatibility at the Big Bang, Zod last broke backwards compatibility a few months ago.
12_throw_away 20 hours ago [-]
I like this advice, and yeah, I always try to make illegal states unrepresentable, possibly even to a fault.
The problem I run into here is - how do you create good error messages when you do this? If the user has passed you input with multiple problems, how do you build a list of everything that's wrong with it if the parser crashes out halfway through?
ffsm8 17 hours ago [-]
I think you're looking at it too literally - what people usually mean with"making invalid state unrepresentable" is in the main application which has your domain code - which should be separate from your inputs
He even gives the example of zod, which is a validation library he defines to be a parser.
What he wants to say : "I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema"
MrJohz 10 hours ago [-]
> I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema
But that _is_ parsing, at least in the sense of "parse, don't validate". It's about turning inputs into real objects representing the domain code that you're about to be working with. The result is still going to be a DTO of some description, but it will be a DTO with guaranteed invariants that are useful to you. For example, a post request shouldn't be parsed into a user object just because it shares a lot of fields in common with a user. Instead it should become a DTO with the invariants fulfilled that makes sense for a DTO. Some of those invariants are simple (like "dates should be valid" -> the DTO contains Date objects not strings), and some will be more complex like the "if the server is active, then the port also needs to be provided" restriction from the article.
This is one of the key ideas behind Zod - it isn't just trying to validate whether an object matches a certain schema, but it converts the result into a type that accurately expresses the invariants that must be in place if the object is valid.
ffsm8 9 hours ago [-]
I dont disagree with the desire to get a good API like that.
I was just pointing out that this was the core of the desire the author had, as 12_throw_away was correctly pointing out that _true_ parsing and making invalid state unrepresentable forces you to error out on the first missmatch, which makes it impossible to raise multiple issues. the only way around that is to allow invalid state during the input phase.
zod also allows invalid state as input, then attempts to shoehorn them into the desired schema, which still runs these validations the author was complaining about - just not in the code he wrote.
Lvl999Noob 5 hours ago [-]
Why does "true" parsing have to error out on the very first problem? It is more than possible (though maybe not easy) to keep parsing and collecting errors as they appear. Zod, as the given example in the post, does it.
1718627440 2 hours ago [-]
Because then it would need to represent invalid data in its output type.
MrJohz 4 hours ago [-]
I don't know that I understand why parsing necessarily has to error out on the first mismatch. Good parsers will collect errors as they go along.
Zod does take in invalid state as input, but that is what a parser does. In this case, the parser is `any -> T` as opposed to `string -> T`, but that's still a parsing operation.
12_throw_away 2 hours ago [-]
Well, if you want to collect errors, then you need to have a way to store the transformed input in a form that allows you to check the invariants, which can be arbitrarily complex. So naturally there must be some intermediate representations that allow illegal states. And there must be functions that take these IRs that return either domain objects or lists of errors.
So, having used this thread to rubber-duck about how the principle of "parse-don't-validate" works with the principle of "provide good error messages", I'm arriving at these rules, which are really more about encapsulation than parsing:
1. Encapsulate both parsing and validation in a single function: `parse(RawInput) -> Result<ValidDomainObject,ListOfErrors>`
2. Ideally, `parse` is implemented by a robust parsing/validation library for the type of input that you're dealing with. It will create some intermediate representations that you need not concern yourself with.
3. If there isn't a good parser library for your use case, your implementation of `parse` will necessarily contain intermediate representations of potentially illegal state. This is both fine and unavoidable, just don't let them leak out of your parser.
8n4vidtmkvmk 15 hours ago [-]
Zod might be a validation library, but it also does type coercion and transforms. I believe that's what the author means by a parser.
goku12 12 hours ago [-]
Apparently not. The author cites the example of json parsing for APIs. You usually don't split it into a generic parsing into native data types and then validate the result in memory (unless you're on a dynamically typed language and don't use a validation schema). Instead, the expected native data type of the result (composed using structs, enums, unions, vectors, etc) is defined first and then you try to parse the json into that data type. Any json errors and schema violations will error out in a single step.
mark38848 14 hours ago [-]
Just use optparse-applicative in PureScript. Applicatives are great for this and the library gives it to you for free.
bradrn 12 hours ago [-]
> Just use optparse-applicative in PureScript.
Or in Haskell!
adinisom 16 hours ago [-]
If talking about UI, the flip side is not to harm the user's data. So despite containing errors it needs to representable, even if it can't be passed further along to back-end systems.
For parsing specifically, there's literature on error recovery to try to make progress past the error.
ambicapter 20 hours ago [-]
Most validation libraries worth their salt give you options to deal with this sort of thing? They'll hand you an aggregate error with an 'errors' array, or they'll let you write an error message "prettify-er" to make a particular validation error easier to read.
pmarreck 17 hours ago [-]
Right, but that's validation, and this article is talking about parsing (not validating) into an already-correct structure by making invalid inputs unrepresentable.
So maybe the reason why they were able to reduce the code is because they lost the ability to do good error reporting.
jpc0 12 hours ago [-]
How is getting an error array not making invalid input unrepresentable.
You either get the correctly parsed data or you get an error array. The incorrect input was never represented in code, vs a 0 value being returned or even worse random gibberish.
A trivial example: 1/0 should return DivisionByZero not 0 or infinity or NaN or whatever else. You can then decide in your UI whether that is a case you want to handle as an error or as an edge case but the parser knows that is not possible to represent.
lmm 10 hours ago [-]
You parse into an applicative validation structure, combine those together, and then once you've brought everything together you handle that as either erroring out with all the errors or continuing with the correct config. It's easier to do that with a parsing approach than a validating approach, not harder.
Ygg2 10 hours ago [-]
Parsers can be made to not fail on first error. You return either a parsed structure or an array of found error.
Html5 parser is notoriously friendly to errors. See adoption agency algorithm.
Thaxll 19 hours ago [-]
This work if all errors are self contained, stoping at the first one is fine too.
geysersam 15 hours ago [-]
Maybe you can use his `or` construct to allow a `--server` without `--port`, but then also add a default `error_message` property.
After parsing you check if `error_message` exists and raise that error.
akoboldfrying 20 hours ago [-]
Agree. It should definitely be possible to get error messages on par with what TypeScript gives you when you try to assign an object literal to an incompatibly typed variable; whether that's currently the case, and how difficult it would be to get there if not, I don't know.
nine_k 23 hours ago [-]
This is a recurring idea: "Parse, don't validate". Previously:
The author credits Alexis King at the beginning and links to that post.
SloopJon 22 hours ago [-]
I don't see anything in the post or the linked tutorial that gives a flavor of the user experience when you supply an invalid option. I tried running the example, but I've forgotten too much about Node and TypeScript to make it work. (It can't resolve the @optique references.) What happens when you pass --foo, --target bar, or --port 3.14?
macintux 20 hours ago [-]
I had a similar question: to me, the output format “or” statement looks like it might deterministically pick one winner instead of alerting the user that they erred. A good parser is terrific, but it needs to give useful feedback.
Dragging-Syrup 10 hours ago [-]
Absolutely; I think calling the function xor would be more appropriate.
esafak 19 hours ago [-]
The "problem" is that some languages don't have rich enough type systems to encode all the constraints that people want to support with CLI options. And many programmers aren't that great at wielding the type systems at their disposal.
Make use of the usage string be the specification!
A criminally underused library.
tomjakubowski 18 hours ago [-]
A great example of "declaration follows use" outside of C syntax.
fragmede 20 hours ago [-]
My favorite. A bit too much magic for some, but it seems well specified to me.
SoftTalker 21 hours ago [-]
I like just writing functions for each valid combination of flags and parameters. Anything that isn’t handled is default rejected. Languages like Erlang with pattern matching and guards make this a breeze.
12 hours ago [-]
kiliancs 6 hours ago [-]
Great project. Clear goal, well executed, very nice API (safe, terse, clear).
I don't understand. Why is this a parser? Isn't it just way of enforcing a type in a language that doesn't have types?
I was expecting something like a state machine that takes the command line text and parses it to validate the syntax and values.
hansvm 20 hours ago [-]
The heavy lifting happens in the definitions of `option` and `integer`. Those will take in whatever arguments they take in and output some sort of `Stream -> Result<Tuple<T, Stream>>` function.
That might sound messy but to the author's point about parser combinators not being complicated, they really don't take much time to get used to, and they're quite simple if you wanted to build such a library yourself. There's not much code (and certainly no magic) going on under the hood.
The advantage of that parsing approach:
It's reasonably declarative. This seems like the author's core point. Parser-combinator code largely looks like just writing out the object you want as a parse result, using your favorite combinator library as the building blocks, and everything automagically works, with amazing type-checking if your language has such features.
The disadvantages:
1. Like any parsing approach, you have to actually consider all the nuances of what you really want parsed (e.g., conditional rules around whitespace handling). It looks a little to me (just from the blog post, not having examined the inner workings yet) like this project side-stepped that by working with the `Stream` type as just the `argv` list, allowing you to be able to say things like "parse the next blob as a string" without also having to encode whitespace and blob boundaries.
2. It's definitely slower (and more memory-intensive) than a hand-rolled parser, and usually also worse in that regard than other sorts of "auto-generated" parsing code.
For CLI arguments, especially if they picked argv as their base stream type, those disadvantages mostly don't exist. I could see it performing poorly for argv parsing for something like `cp` though (maybe not -- maybe something like `git cp`, which has more potential parse failures from delimiters like `--`?), which has both options and potentially ginormous lists of files; if you're not very careful in your argument specification then you might have exponential backtracking issues, and where that would be blatantly obvious in a hand-rolled parser it'll probably get swept under the rug with parser combinators.
baroninthetrees 5 hours ago [-]
I too got tired of dealing with cli arg parsing and am experimenting with passing a natural language description of the program and its args to a tiny LLM to sort out, offer suggestions (did you mean?), types conversions, etc.
So far, it’s working great and given enough detail is deterministic.
foundart 7 hours ago [-]
The author of the article also wrote a CLI parser library for Typescript, called Optique. I really appreciate them including a "When Optique makes sense" section in the docs. It would be great if more projects did that.
This kind of stuff is what makes me appreciate python's argparse.
It's a genuine pleasure to use, and I use it often.
If you dig a little deeper into it, it does all the type and value validation, file validation, it does required and mutually exclusive args, it does subargs. And it lets you do special cases of just about anything.
And of course it does the "normal" stuff like short + long args, boolean args, args that are lists, default values, and help strings.
MrJohz 18 hours ago [-]
Actually, I think argparse falls into the same trap that the author is talking about. You can define lots of invariants in the parser, and say that these two arguments can't be passed together, or that this argument, if specified, requires these arguments to also be specified, etc. But the end result is a namespace with a bunch of key-value pairs on it, and argparse doesn't play well with typing systems like mypy or pyright. So the rest of the tool has to assume that the invariants were correctly specified up-front.
The result is that you often still this kind of defensive programming, where argparse ensures that an invariant holds, but other functions still check the same invariant later on because they might have been called a different way or just because the developer isn't sure whether everything was checked where they are in the program.
What I think the author is looking for is a combination of argparse and Pydantic, such that when you define a parser using argparse, it automatically creates the relevant Pydantic classes that define the type of the parsed arguments.
bvrmn 14 hours ago [-]
In general case generating CLI options from app models leads to horrible CLI UX. Opposite is also true. Working with "nice" CLI options as direct app models is horrendous.
You need a boundary to convert nice opts into nice types. Like pydantic models could take argparse namespace and convert it to something manageable.
MrJohz 11 hours ago [-]
I mean, that's much the same as working with web APIs or any other kind of interface. Your DTO will probably be different from your internal models. But that doesn't mean it can't contain invariants, or that you can't parse it into a meaningful type. A DTO that's just a grab-bag of optional values is a pain to work with.
Although in practice, I find clap's approach works pretty well: define an object that represents the parsed arguments as you want them, with annotations for details that can't be represented in the type system, and then derive a parser from that. Because Rust has ADTs and other tools for building meaningful types, and because the derive process can do so much. That creates an arguments object that you can quite easily pass to a function which runs the command.
sgarland 18 hours ago [-]
Precisely my thought. I love argparse, but you can really back yourself into a corner if you aren’t careful.
js2 15 hours ago [-]
> What I think the author is looking for is a combination of argparse and Pydantic
It's almost like you want compile time type safety
MrJohz 15 hours ago [-]
You can have that with Mypy and friends in Python, and Typescript in the JS world. The problem is that older libraries often don't utilise that type safety very well because their API wasn't designed for it.
The library in the original post is essentially a Javascript library, but it's one designed so that if you use it with Typescript, it provides that type safety.
jappgar 8 hours ago [-]
I really think parse don't validate gives people a false sense of security (particularly false in dynamic languages like javascript and python).
"Well, I already know this is a valid uuid, so I don't really need to worry about sql injection at this point."
Sure, this is a dumb thing to do in any case, but I've seen this exact thing happen.
Typesafety isn't safety.
yakshaving_jgt 8 hours ago [-]
Type safety is absolutely some degree of safety. And I don’t know why anyone would think parsing a value into a type that has fewer inhabitants would absolve them of having to prevent SQL injection — these are orthogonal things.
The quote here — which I suspect is a straw man — is such a weird non sequitur. What would logically follow from “I already know this is a valid UUID” is “so I don’t need to worry about this not being a UUID at this point”.
jappgar 5 hours ago [-]
In python or typescript, the most popular languages in the world, it offers no runtime safety.
Even in languages like Haskell, "safety" is an illusion. You might create a NumberGreaterThanFive type with smart constructors but that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
For the most part it's fine to assume the names of types are accurate, but for safety critical operations it absolutely makes sense to revalidate inputs.
yakshaving_jgt 5 hours ago [-]
> that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
That seems like a pretty unfair constraint. Yes, you can deliberately circumvent safeguards and you can deliberately write bad code. That doesn't mean those language features are bad.
nickdothutton 7 hours ago [-]
It’s been about 30 years but I seem to remember the compiler taking care of this for me (in Ada) with types.
AndrewDucker 11 hours ago [-]
This is one of the things that makes me glad that PowerShell does all of this intrinsically. I define the parameters, it makes sure that the arguments make sense and match them (and their validation).
dvdkon 24 hours ago [-]
I, for one, do think the world needs more CLI argument parsers :)
This project looks neat, I've never thought to use parser combinators for something other than left-to-right string/token stream parsing.
And I like how it uses Typescript's metaprogramming to generate types from the parser code. I think that would be much harder (or impossible) in other languages, making the idiomatic design of a similar similar library very different.
lihaoyi 22 hours ago [-]
That's basically what my MainArgs Scala library does: take either a method definition or class structure and use it's structure to parse your command line arguments. You get the final fields you want immediately without needing to imperatively walk to args array (and probably getting it wrong!)
Some other libraries I’ve been enjoying building CLIs with in TS that do more or less the same thing, though perhaps with slightly worse composability than Optique:
Not all of this validation belongs in the same layer. A lot of the problems people seem to have is due to people thinking it all has to be done in the I/O layer.
A CLI and an API should indeed occupy the same layer of a program architecture, namely they are entry points that live on the periphery. But really all you should be doing there is lifting the low byte stream you are getting from users to something higher level you can use to call your internals.
So "CLI validation" should be limited to just "I need an int here, one of these strings here, optionally" etc. Stuff like "is this port out of range" or "if you give me this I need this too" should be handled by your internals by e.g. throwing an exception. Your CLI can then display that as an error message in a nice way.
slifin 13 hours ago [-]
So use Clojure Spec or better yet Malli to parse your input data at the edges of your program
Makes sense, I think a lot of developers would want to complect this problem with their runtime type system of choice without considering the set of downsides for the users
panzi 12 hours ago [-]
No mention of yargs?
thealistra 24 hours ago [-]
Isn’t this like argparse from Python for typescript?
whilenot-dev 23 hours ago [-]
What OP calls an "combinatorial parser" I'd call object schema validation and that's more similar to pydantic[0] than argparse in python land.
I've noticed that many programmers believe that parsing is some niche thing that the average programmer likely won't need to contend with, and that it's only applicable in a few specific low-level cases, in which you'll need to reach for a parser combinator library, etc.
But this is wrong. Programmers should be writing parsers all the time!
WJW 24 hours ago [-]
Last week my primary task was writing a github action that needed to log in to Heroku and push the current code on main and development branches to the production and staging environments respectively. The week before that, I wrote some code to make sure the type the object was included in the filters passed to an API call.
Don't get me wrong, I actually love writing parsers. It's just not required all that often in my day-to-day work. 99% of the time when I need to write a parser myself it's for and Advent of Code problem, usually I just import whatever JSON or YAML parser is provided for the platform and go from there.
yakshaving_jgt 23 hours ago [-]
Do you not write validation? Or handle user input? Or handle server responses? Surely there’s some data processing somewhere.
eska 23 hours ago [-]
I think most security issues are just due to people not parsing input at all/properly. Then security consultants give each one a new name as if it was something new. :-)
dkubb 23 hours ago [-]
The three most common things I think about when coding are DAGs, State Machines and parsing. The latter two come up all the time in regexps which I probably write at least once a day, and I’m always thinking about state transitions and dependencies.
nine_k 23 hours ago [-]
I'd say that engineers should use the highest-level tools that are adequate for the task.
Sometimes it's going down to machine code, or rolling your own hash table, or writing your own recursive-descent parser from first principles. But most of the time you don't have to reach that low, and things like parsing are but a minor detail in the grand scheme. The engineer should not spend time on building them, but should be able to competently choose a ready-made part.
I mean, creating your own bolts and nuts may be fun, but mot of the time, if you want to build something, you just pick a few from an appropriate box, and this is exactly right.
yakshaving_jgt 20 hours ago [-]
I don’t understand. Every mainstream language has libraries for parsing into general types, but none of them will have libraries for parsing values specific to your application.
TFA links to Alexis King’s Parse, Don’t Validate article, which explains this well. Did you not read it?
ThinkBeat 22 hours ago [-]
And that is why there are plenty of parser generators
so you dont have to write the parser yourself
every time.
sudahtigabulan 20 hours ago [-]
Is there no getopt implementation for Typescript?
The input this library tries to handle better looks to me like bad design.
"options that depend on options" should not be a thing.
Every option should be optional.
Even if you have working code that can handle some complex situation, this doesn't make the situation any less unintuitive for the users.
If you need more complex relationships, consider using arguments as well. Top level, or under an option.
Yes, they are not named, but since they are mandatory anyway, you are likely to remember their meaning (spaced repetition and all that).
They can still be optional (if they come last).
Sometimes an argument may need to have multiple parts, like user@host:port
You can still parse it instead of validating, if you want.
> mutually exclusive --json, --xml, --yaml.
Use something like -t TYPE instead, where TYPE can be one of json, xml, or yaml.
(Make illegal states unrepresentable.)
> debug: optional(option("--debug")),
Again, I believe it's called "option" because it's meant to be optional already.
optional(optional(option("--common-sense")))
EOR
dwattttt 18 hours ago [-]
> options that depend on options
What would you do for "top level option, which can be modified in two other ways"?
would solve invalid representation, but is unwieldy.
Something that results in the usage string
[--option [--flag1 --flag2]]
doesn't seem so bad at that point.
sudahtigabulan 18 hours ago [-]
I think I've seen it done like that
--option flag1,flag2
(Maybe with another separator, as long as it doesn't need to be escaped.)
Another possibility is to make the main option an argument, like the subcommands in git, systemctl, and others:
command option --flag1 --flag2
This depends on the specifics, though.
dwattttt 14 hours ago [-]
> --option flag1,flag2
Embedding a second parse step that the first parser doesn't deal with is done, but it's a rough compromise.
It feels like the difficulty in dealing with
[--option [--flag1 --flag2]]
Is more to do with its expression in the language parsed to, than CLI elegance.
Spivak 18 hours ago [-]
I think ultimately you're trying to tell a river that it's going the wrong way. Programs have had required options for decades at this point. I think they can make sense as alternatives to heterogeneously typed positional arguments. By making the user name them explicitly you remove ambiguity and let the user specify them in whatever order they please.
In Python this was a motivating factor for letting functions demand their arguments be passed as named keywords. Something like send("foo", "bar") is easier to understand and call correctly when you have to say send(channel="foo", message="bar")
einpoklum 10 hours ago [-]
Exactly the opposite of this. We should parse the command-line using _no_ strict types. Not even integers. Nothing beyond parsing its structure, e.g. which option names get which (string) values, and which flags are enabled. This can be done without knowing _anything_ about the application domain, and provide a generic options structure which is no longer a sequence of characters.
This approach IMNSHO is much cleaner than the intrication of cmdline parser libraries with application logic and application-domain-related types.
Then one can specify validation logic declaratively, and apply it generically.
This has the added benefit - for compiled rather than interpreted library - of not having to recompile the CLI parsing library for each different app and each different definition of options.
MrJohz 10 hours ago [-]
Can you give some examples of this working well? It certainly goes against all of my experience working with CLIs and with parsing inputs in general (e.g. web APIs etc). In general, I've found that the quicker I can convert strings into rich types, the easier that code is to work with and the less likely I am to have troubles with invalid data.
bvrmn 14 hours ago [-]
A valid type for server and port should be a single value. Stop parse it separately please.
":3000" -> use port 3000 with a default host.
"some-host" -> use host with a default port.
"some-host:3000" -> you guess it.
It also allows to extend it to other sources/destinations like unix domain sockets and other stuff without cluttering your CLI options.
Also please consider to use DSN or URI to define database configurations. Host, port, dbname, credentials as separate options or environment variables are quite painful to use.
parhamn 23 hours ago [-]
> Try to access it and TypeScript yells at you. No runtime validation needed.
I was recently thinking about type safety and validation strategies are particularly thorny in languages where the typings are just annotations. E.g. the Typescript/Zod or Python/Pydantic universes. Especially in IO cases where the data doesn't originate in the same type system.
In a language like Go (just an example, not endorsing) if you parse something into say a struct you know worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values. In typescript-likes you can get a totally different structure and run into all sorts of errors.
All that is to say, the runtime validation is always somewhere (perhaps in the library, as they often are?), and the feature here isn't no runtime validation but typed cli arguments. Which is cool and great.
metaltyphoon 22 hours ago [-]
> worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values
In the field I work, zero values are valid and doing it in Go would be a nightmare
mjevans 8 hours ago [-]
Database NULL is a valid pattern that any parser SHOULD support and I do consider that a design bug in every parser Go has. Offhand most of them effectively 'update' an object, but make it difficult or impossible to tell if something was __set__ with a value, or merely inherited a default.
parhamn 20 hours ago [-]
Agreed, the pointer or "<field>_empty: bool" patterns are annoying. Point still stands though, you always get the structure you ask for.
jiggawatts 17 hours ago [-]
This is one of the many reasons I like PowerShell: it parses strongly typed parameters for you and outputs human readable error messages for every kind of validation failure.
throwaway984393 14 hours ago [-]
[dead]
suff 19 hours ago [-]
[dead]
curtisszmania 19 hours ago [-]
[dead]
HL33tibCe7 24 hours ago [-]
Stopped reading after realising this is written by ChatGPT
bfung 24 hours ago [-]
Looked human-ish to me, what signs did you see?
bobbiechen 17 hours ago [-]
I thought the style was like ChatGPT in a "clever, casual, snarky" prompt flavor as well. I see it a lot on LinkedIn especially in sentence structures like these:
"Invalid data? The parser rejects it. Done."
"That validation logic that used to be 30% of my CLI code? Gone."
For me this really piled on at the end of the blog post. But maybe it's just personal style too.
akoboldfrying 20 hours ago [-]
I found the content novel and helpful (applying a known but underappreciated technique (Parse, Don't Validate) to a common problem where I hadn't thought to use it before) and the tone very enjoyable. In fact, it's so idiomatically written that I can't even believe it's just a machine translation of something written in another language.
In short, a great article.
cazum 24 hours ago [-]
What makes you think that and not that it's just an average auto-translate job from the author's native language (Korean)?
urxvtcd 23 hours ago [-]
I’ll go one step further: what makes you think it’s an average auto-translate job? I didn’t notice anything weird, felt like your average, slightly ranty HN post. I’m not a native speaker though.
AfterHIA 21 hours ago [-]
You've got to be careful; if you validate the CLI too much you might get URA in your validator. #chugalug #house
Also - don't write CLI programs in languages that don't compile to native binaries. I don't want to have to drag around your runtime just to execute a command line tool.
If you're ever living dangerously, bring along busybox-static. It might not be the best, but you'll thank yourself later.
This clause is abstracting away a ton of work. If you want to compile the latest LLVM and get 'portable C++26', you need to bootstrap everything, including CMake from that old-hat libc on some ancient distro like CentOS 6 or Ubuntu 12.04.
I've said it before, I'll say it again: the Linux kernel may maintain ABI compatibility, but the fact that GNU libc breaks it anyway makes it a moot point. It is a pain to target older Linux with a newer distro, which is by far the most common development use case.
Write your code such that you can load it onto (for example) the oldest supported Ubuntu and compile cleanly and you’ll have virtually zero problems. Again, I know that if your goal is to truly ship something written in e.g. C++26 portably then it’s a huge pain. But as someone who writes plain C and very much enjoys it, I think it’s better to skip this class of problem.
I'll keep my templates, smart pointers, concepts, RAII, and now reflection, thanks. C and its macros are good for compile times but nothing much else. Programming in C feels like banging rocks together.
This is only a problem, when the program USES a symbol that was only introduced in the newer libc. In other words, when the program made a choice to deliberately need that newer symbol.
My most comfortable tool is Java, but I'm not going to persuade most of the HN crowd to install a JVM unless the software I'm offering is unbearably compelling.
Internal to work? Yeah, Java's going to be an easy sell.
I don't think OP necessarily meant it as a political statement.
In fact, I think something like this already exists. I just can't recollect the project.
Are you stuck in write-only mode or something? How does this make any sense to you?
And don't write programs with languages that depend on CMake and random tarballs to build and/or shared libraries to run.
I usually have a lot less issues with dragging a runtime than fighting with builds.
https://www.anthropic.com/claude-code
https://github.com/google-gemini/gemini-cli
Pretty much agreed - once any sort of complicated logic enters a shell script it's probably better off written in C/Rust/Go or something akin to that.
One of the things I love about clap is that you can configure it to automatically spit out --help info, and you can even get it to generate shell autocompletions for you!
I think there are some other libraries that are challenging it now (fewer dependencies or something?) but clap sets the standard to beat.
Go programs compile to native executables, they're still rather slow to start, especially if you just want to do --help
Well that's confused me. I write a lot of scripts in BASH specifically to make it easy to move them to different architectures etc. and not require a custom runtime. Interpreted scripts also have the advantage that they're human readable/editable.
Isn’t writing code and using zod the same thing? The difference being who wrote the code.
Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
Whether you do that with Zod or manually or whatever isn't important, the important thing is having a preprocessing step that transforms the data and doesn't just validate it.
Right - and one thing that keeps coming up for me is that, if you want to maintain complex invariants, it's quite natural to express them in terms of the domain object itself (or maybe, ugh, a DTO with the same fields), rather than in terms of input constraints.
For instance if validating parameter values requires multiple trips to a DB or other external system, weaving the calls in the logic can spare duplicating these round trips. Light "surface" validation can still be applied, but that's not what we're talking about here I think.
Even better, that conversion from interface type to internal type should ideally happen at one explicit point in the program - a function call which rejects all invalid inputs and returns a type that enforces the invariants we're interested in. That way, we gave a clean boundary point between the outside world and the inside one.
This isn't a performance issue at all, it's closer to the "imperative shell, functional core" ideas about structuring your application and data.
Sure, but probably at the cost of leaving everything in a horribly inconsistent state when you error out partway through. Which is almost always not worth it.
That said, I fully agree with the article content itself. It basically just boils down to:
When you create a program, eventually you'll need to process & check whether input data is valid or not. In C-like language, you have 2 options
or "Parse, don't validate" is just trying to say don't do `void validate(struct Data d)` (procedure with `void`), but do `ValidatedData validate(struct Data d)` (function returning `ValidatedData`) instead.It doesn't mean you need to explicitly create or name everything as a "parser". It also doesn't mean "don't validate" either; in `ValidatedData validate(struct Data d)` you'll eventually have "validation" logic similar to the procedure `void` counterpart.
Specifically, the article tries to teach folks to utilize the type system to their advantage. Rather than praying to never forget invoking `validate(d)` on every single call site, make the type signature only accept `ValidatedData` type so the compiler will complain loudly if future maintainers try to shove `Data` type to it. This strategy offloads the mental burden of remembering things from the dev to the compiler.
I'm not exactly sure why the "Parse, don't validate" catchphrase keeps getting reused in other language communities. It's not clear to non-FP community what the distinction between "parser" and "validate", let alone "parser combinator". Yet somehow other articles keep reusing this same catchphrase.
``` some_cli <some args> --some-option --no-some-option ```
Before parsing, the argument array contains both the flags to enable and disable the option. Validation would either throw an error or accept it as either enabled or disabled. But importantly, it wouldn't change the arguments. If the assumption is that the last option overwrites anything before it then the cli command is valid with the option disabled.
And now, correct behaviour relies on all the code using that option to always make the same assumption.
Parsing, on the other hand, would put create a new config where `option` is an enum - either enabled or disabled or not given. No confusion about multiple flags or anything. It provides a single view for the rest of the program of what the input config was.
Whether that parsing is done by a third party library or first party code, declaratively or imperatively, is besides the point.
More often than not, in a real case of applying algebraic data type & "Parse, don't validate", it's something like `Option<ValidatedData>` or `Result<ValidatedData,PossibleValidationError>`, borrowing Rust's names. `Option` & `Result` expand the possible return values that function can return to cover the possibility of failure in the validation process, but it's independent from possible values that `ValidatedData` itself can contain.
The main point of "Parse, don't validate" is to distinguish between "machine-level data representation" vs "possible set of values" of a type and utilize this "possible set of values" property.Your "the exact same format" point is correct; oftentimes, the underlying data representation of a type is exactly the same between pre- & post-validation. But more often than not "possible set of values" of `ValidatedData` is a subset of `Data`. These 2 different "possible set of values" are given their own names in the form of a type `Data` and `ValidatedData`.
This distinction is actually very handy because types can be checked automatically by the (nominal) type system. If you make the `ValidatedData` constructor private & the only way to produce is function `ValidatedData validate(Data)`, then in any part of the codebase, there's no way any `ValidatedData` instance is malformed (assuming `validate` doesn't have bugs).
Extra note: I forgot to mention the "Parse, don't validate" article implicitly implies a nominal type system, where 2 objects with equivalent "data representation" doesn't mean it has the same type. This differs from Typescript's structural type system, where as long as the "data representation" is the same, both object are considered to have the same type.
Typescript will happily accept something like this because of structural
While nominal type systems like Haskell or Java will reject such expressions Because of this, the idea of using type as a "possible set of values" probably felt unintuitive to Typescript folks, as everything is just stringly-typed and different type felt synonymous with different "underlying data representation" there.You can simulate this "same structure, but different meaning" concept of nominal type system in Typescript with some hacky workaround with Symbol.
Why does the return type need to imply transformation and why is "validation" here always read-only? No-op function will return the exact same value you give it (in other words, identity transformation), and Java & Javascript procedures never guarantee a read-only operation.Then instead of validating a loose type & still using the loose type, you're parsing it from a loose type into a strict type.
The key point is you never need to look at a loose type and think "I don't need to check this is valid, because it was checked before"; the type system tracks that for you.
I still wouldn't need to check the inputs again because I know it's already been processed, even if the type system can't help me.
I'm hung up on the type system because it's a great way to convey the validity of the data; it follows the data around as it flows through your program.
I don't (yet) Typescript, but jsdoc and linting give me enough type checking for my needs.
The point is you don’t check that your string only contains valid characters and then continue passing that string through your system. You parse your string into a narrower type, and none of the rest of your system needs to be programmed defensively.
To describe this advice as “vacuous” says more about you than it does about the author.
> Of course, you hope zod is robust, tested, supported, extensible, and has docs so you can understand how to express your domain in terms it can help you with. And you hope you don’t have to spend too much time migrating as zod’s api changes.
Yes, judgement is required to make depending on zod (or any library) worthwhile. This is not different in principle from trusting those same things hold for TypeScript, or Node, or V8, or the C++ compiler V8 was compiled with, or the x86_64 chip it's running on, or the laws of physics.
The problem I run into here is - how do you create good error messages when you do this? If the user has passed you input with multiple problems, how do you build a list of everything that's wrong with it if the parser crashes out halfway through?
He even gives the example of zod, which is a validation library he defines to be a parser.
What he wants to say : "I don't want to write my own validation in a CLI, give me a good API already that first validates and then converts the inputs into my declared schema"
But that _is_ parsing, at least in the sense of "parse, don't validate". It's about turning inputs into real objects representing the domain code that you're about to be working with. The result is still going to be a DTO of some description, but it will be a DTO with guaranteed invariants that are useful to you. For example, a post request shouldn't be parsed into a user object just because it shares a lot of fields in common with a user. Instead it should become a DTO with the invariants fulfilled that makes sense for a DTO. Some of those invariants are simple (like "dates should be valid" -> the DTO contains Date objects not strings), and some will be more complex like the "if the server is active, then the port also needs to be provided" restriction from the article.
This is one of the key ideas behind Zod - it isn't just trying to validate whether an object matches a certain schema, but it converts the result into a type that accurately expresses the invariants that must be in place if the object is valid.
zod also allows invalid state as input, then attempts to shoehorn them into the desired schema, which still runs these validations the author was complaining about - just not in the code he wrote.
Zod does take in invalid state as input, but that is what a parser does. In this case, the parser is `any -> T` as opposed to `string -> T`, but that's still a parsing operation.
So, having used this thread to rubber-duck about how the principle of "parse-don't-validate" works with the principle of "provide good error messages", I'm arriving at these rules, which are really more about encapsulation than parsing:
1. Encapsulate both parsing and validation in a single function: `parse(RawInput) -> Result<ValidDomainObject,ListOfErrors>`
2. Ideally, `parse` is implemented by a robust parsing/validation library for the type of input that you're dealing with. It will create some intermediate representations that you need not concern yourself with.
3. If there isn't a good parser library for your use case, your implementation of `parse` will necessarily contain intermediate representations of potentially illegal state. This is both fine and unavoidable, just don't let them leak out of your parser.
Or in Haskell!
For parsing specifically, there's literature on error recovery to try to make progress past the error.
So maybe the reason why they were able to reduce the code is because they lost the ability to do good error reporting.
You either get the correctly parsed data or you get an error array. The incorrect input was never represented in code, vs a 0 value being returned or even worse random gibberish.
A trivial example: 1/0 should return DivisionByZero not 0 or infinity or NaN or whatever else. You can then decide in your UI whether that is a case you want to handle as an error or as an edge case but the parser knows that is not possible to represent.
Html5 parser is notoriously friendly to errors. See adoption agency algorithm.
After parsing you check if `error_message` exists and raise that error.
https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va... (2019, using Haskell)
https://www.lelanthran.com/chap13/content.html (April 2025, using C)
http://docopt.org/
Make use of the usage string be the specification!
A criminally underused library.
I use Effect CLI https://github.com/Effect-TS/effect/tree/main/packages/cli for the same reasons. It has the advantage of fitting within the ecosystem. For example, I can reuse existing schemas.
>> const port = option("--port", integer());
I don't understand. Why is this a parser? Isn't it just way of enforcing a type in a language that doesn't have types?
I was expecting something like a state machine that takes the command line text and parses it to validate the syntax and values.
That might sound messy but to the author's point about parser combinators not being complicated, they really don't take much time to get used to, and they're quite simple if you wanted to build such a library yourself. There's not much code (and certainly no magic) going on under the hood.
The advantage of that parsing approach:
It's reasonably declarative. This seems like the author's core point. Parser-combinator code largely looks like just writing out the object you want as a parse result, using your favorite combinator library as the building blocks, and everything automagically works, with amazing type-checking if your language has such features.
The disadvantages:
1. Like any parsing approach, you have to actually consider all the nuances of what you really want parsed (e.g., conditional rules around whitespace handling). It looks a little to me (just from the blog post, not having examined the inner workings yet) like this project side-stepped that by working with the `Stream` type as just the `argv` list, allowing you to be able to say things like "parse the next blob as a string" without also having to encode whitespace and blob boundaries.
2. It's definitely slower (and more memory-intensive) than a hand-rolled parser, and usually also worse in that regard than other sorts of "auto-generated" parsing code.
For CLI arguments, especially if they picked argv as their base stream type, those disadvantages mostly don't exist. I could see it performing poorly for argv parsing for something like `cp` though (maybe not -- maybe something like `git cp`, which has more potential parse failures from delimiters like `--`?), which has both options and potentially ginormous lists of files; if you're not very careful in your argument specification then you might have exponential backtracking issues, and where that would be blatantly obvious in a hand-rolled parser it'll probably get swept under the rug with parser combinators.
https://optique.dev/why#when-optique-makes-sense
It's a genuine pleasure to use, and I use it often.
If you dig a little deeper into it, it does all the type and value validation, file validation, it does required and mutually exclusive args, it does subargs. And it lets you do special cases of just about anything.
And of course it does the "normal" stuff like short + long args, boolean args, args that are lists, default values, and help strings.
The result is that you often still this kind of defensive programming, where argparse ensures that an invariant holds, but other functions still check the same invariant later on because they might have been called a different way or just because the developer isn't sure whether everything was checked where they are in the program.
What I think the author is looking for is a combination of argparse and Pydantic, such that when you define a parser using argparse, it automatically creates the relevant Pydantic classes that define the type of the parsed arguments.
You need a boundary to convert nice opts into nice types. Like pydantic models could take argparse namespace and convert it to something manageable.
Although in practice, I find clap's approach works pretty well: define an object that represents the parsed arguments as you want them, with annotations for details that can't be represented in the type system, and then derive a parser from that. Because Rust has ADTs and other tools for building meaningful types, and because the derive process can do so much. That creates an arguments object that you can quite easily pass to a function which runs the command.
Not quite that, but https://typer.tiangolo.com/ is fully type driven.
The library in the original post is essentially a Javascript library, but it's one designed so that if you use it with Typescript, it provides that type safety.
"Well, I already know this is a valid uuid, so I don't really need to worry about sql injection at this point."
Sure, this is a dumb thing to do in any case, but I've seen this exact thing happen.
Typesafety isn't safety.
The quote here — which I suspect is a straw man — is such a weird non sequitur. What would logically follow from “I already know this is a valid UUID” is “so I don’t need to worry about this not being a UUID at this point”.
Even in languages like Haskell, "safety" is an illusion. You might create a NumberGreaterThanFive type with smart constructors but that doesn't stop another dev from exporting and abusing the plain constructor somewhere else.
For the most part it's fine to assume the names of types are accurate, but for safety critical operations it absolutely makes sense to revalidate inputs.
That seems like a pretty unfair constraint. Yes, you can deliberately circumvent safeguards and you can deliberately write bad code. That doesn't mean those language features are bad.
This project looks neat, I've never thought to use parser combinators for something other than left-to-right string/token stream parsing.
And I like how it uses Typescript's metaprogramming to generate types from the parser code. I think that would be much harder (or impossible) in other languages, making the idiomatic design of a similar similar library very different.
https://github.com/com-lihaoyi/mainargs
https://cliffy.io/
https://github.com/tj/commander.js
[1] https://en.wikipedia.org/wiki/Parser_combinator
A CLI and an API should indeed occupy the same layer of a program architecture, namely they are entry points that live on the periphery. But really all you should be doing there is lifting the low byte stream you are getting from users to something higher level you can use to call your internals.
So "CLI validation" should be limited to just "I need an int here, one of these strings here, optionally" etc. Stuff like "is this port out of range" or "if you give me this I need this too" should be handled by your internals by e.g. throwing an exception. Your CLI can then display that as an error message in a nice way.
Makes sense, I think a lot of developers would want to complect this problem with their runtime type system of choice without considering the set of downsides for the users
[0]: https://docs.pydantic.dev/latest/
But this is wrong. Programmers should be writing parsers all the time!
Don't get me wrong, I actually love writing parsers. It's just not required all that often in my day-to-day work. 99% of the time when I need to write a parser myself it's for and Advent of Code problem, usually I just import whatever JSON or YAML parser is provided for the platform and go from there.
Sometimes it's going down to machine code, or rolling your own hash table, or writing your own recursive-descent parser from first principles. But most of the time you don't have to reach that low, and things like parsing are but a minor detail in the grand scheme. The engineer should not spend time on building them, but should be able to competently choose a ready-made part.
I mean, creating your own bolts and nuts may be fun, but mot of the time, if you want to build something, you just pick a few from an appropriate box, and this is exactly right.
TFA links to Alexis King’s Parse, Don’t Validate article, which explains this well. Did you not read it?
"options that depend on options" should not be a thing. Every option should be optional. Even if you have working code that can handle some complex situation, this doesn't make the situation any less unintuitive for the users.
If you need more complex relationships, consider using arguments as well. Top level, or under an option. Yes, they are not named, but since they are mandatory anyway, you are likely to remember their meaning (spaced repetition and all that). They can still be optional (if they come last). Sometimes an argument may need to have multiple parts, like user@host:port You can still parse it instead of validating, if you want.
> mutually exclusive --json, --xml, --yaml.
Use something like -t TYPE instead, where TYPE can be one of json, xml, or yaml. (Make illegal states unrepresentable.)
> debug: optional(option("--debug")),
Again, I believe it's called "option" because it's meant to be optional already.
EORWhat would you do for "top level option, which can be modified in two other ways"?
would solve invalid representation, but is unwieldy.Something that results in the usage string
doesn't seem so bad at that point.Another possibility is to make the main option an argument, like the subcommands in git, systemctl, and others:
This depends on the specifics, though.Embedding a second parse step that the first parser doesn't deal with is done, but it's a rough compromise.
It feels like the difficulty in dealing with
Is more to do with its expression in the language parsed to, than CLI elegance.In Python this was a motivating factor for letting functions demand their arguments be passed as named keywords. Something like send("foo", "bar") is easier to understand and call correctly when you have to say send(channel="foo", message="bar")
This approach IMNSHO is much cleaner than the intrication of cmdline parser libraries with application logic and application-domain-related types.
Then one can specify validation logic declaratively, and apply it generically.
This has the added benefit - for compiled rather than interpreted library - of not having to recompile the CLI parsing library for each different app and each different definition of options.
":3000" -> use port 3000 with a default host.
"some-host" -> use host with a default port.
"some-host:3000" -> you guess it.
It also allows to extend it to other sources/destinations like unix domain sockets and other stuff without cluttering your CLI options.
Also please consider to use DSN or URI to define database configurations. Host, port, dbname, credentials as separate options or environment variables are quite painful to use.
I was recently thinking about type safety and validation strategies are particularly thorny in languages where the typings are just annotations. E.g. the Typescript/Zod or Python/Pydantic universes. Especially in IO cases where the data doesn't originate in the same type system.
In a language like Go (just an example, not endorsing) if you parse something into say a struct you know worst case you're getting that struct with all the fields set to zero, and you just have to handle the zero values. In typescript-likes you can get a totally different structure and run into all sorts of errors.
All that is to say, the runtime validation is always somewhere (perhaps in the library, as they often are?), and the feature here isn't no runtime validation but typed cli arguments. Which is cool and great.
In the field I work, zero values are valid and doing it in Go would be a nightmare
"Invalid data? The parser rejects it. Done."
"That validation logic that used to be 30% of my CLI code? Gone."
"Mutually exclusive groups? Sure. Context-dependent options? Why not."
For me this really piled on at the end of the blog post. But maybe it's just personal style too.
In short, a great article.