In the end it all boils to a very simple argument. The C programmers want the C compilers to behave one way, the C implementers want the C compilers to behave the other way. Since the power structure is what it is — the C implementers are the ones who write the C standard and are the ones who actually get to implement the C compilers — the C compilers do, and will, behave the way the C implementers want them to.
In this situation the C programmers can either a) accept that they're programming in a language that exists as it exists, not as they'd like it to exist; b) angrily deny a); or c) switch to some other system-level language with defined semantics.
Given what most C compilers are written in, are C programmers also C implementers?
I suspect it also depends on who exactly the compiler writers are; the GCC and LLVM guys seem to have more theoretics/academics and thus think of the language more abstractly, leading to UB being truly inexplicable and free of thought, while MSVC and ICC are more on the practical side and their interpretation of it is, as the standard says, "in a documented manner characteristic of the environment". IMHO the "spirit of C" and the more commonsense approach is definitely the latter, and K&R themselves have always leaned in that direction. This is very much a "letter of the law vs. spirit of the law" argument. The fact that these two different sides have produced compilers with nearly the same performance characteristics shows IMHO that the argument of needing to exploit UB is mandatory for performance is a debunked myth.
I doubt it, but that's just a hunch. Is there data out there regarding compiler/language maintainer/standards committee members' contributions to other projects (beyond "so and so person works on $compiler and $application, both written in C"-type anecdotes)?
If not, then, like ... sure, C compiler maintainers people who program in C, but they're not "C programmers" as it was intended (people who develop non-compiler software in C).
My hunch is that that statement is overwhelmingly true if measured by influence of a given C compiler/implementation stack (because GCC/LLVM/MSVC take up a huge slice of the market, and their maintainers are in many cases paid specialists who don't do significant work on other projects), but untrue if measured by count of people who have worked on C compilers (because there are a huge number of small-market-share/niche compilers out there, often maintained by groups who develop those compilers for a specific, often closed-source, platform/SoC/whatever).
Another alternative is that the programmer write their own C compiler and be free of this politics. Maybe I am biased since I am working on exactly such a project, but I have been seeing more and more in-progress compiler implementations for C or C-like languages for the past couple years.
The proposals for Boring C or "Friendly Dialect of C" or whatever has been around for a while. None went beyond the early design stages because, it turns out, no two experienced C programmers could agree on what parts of C are reasonable/unreasonable (and should be kept/left out), see [0] for the first-hand recount.
> In contrast, we want old code to just keep working, with latent bugs remaining latent.
Well, just keep compiling it with the old compilers. "But we'd like to use new compilers for some 'free' gains!" Well, sucks, you can't. "But we have to use new compilers because the old ones just plain don't work on the newer systems!" Well, that sucks, and this here is why "technical debt" is called "debt" and you've managed to hold paying it off until now the repo team is here and knocking at your door.
I mostly work in compiled languages now, but started in interpreted/runtime languages.
When I made that switch, it was baffling to me that the compiled-language folks don't do compatibility-breaking changes more often during big language/compiler revision updates.
Compiled code isn't like runtime code--you can build it (in many cases bit-deterministically!) on any compiler version and it stays built! There's no risk of a toolchain upgrade preventing your software from running, just compiling.
After having gone through the browser compatibility trenches and the Python 2->3 wars, I have no idea why your proposal isn't implemented more often: old compiler/language versions get critical/bugfix updates where practical, new versions get new features and aggressively deprecate old ones. For example: "you want some combination of {the latest optimizations, loongarch support, C++-style attributes, #embed directives, auto vector zero-init}? Great! Those are only available on the new revision of the compiler where -Werror is the default and only behavior. Don't want those? The old version will still get bugfixes."
Don't get me wrong, backwards compatibility is golden...when it comes to making software run. But I think it's a mistake that back compat is taken even further when it comes to compilers, rather than the reverse. I get that there are immense volumes of C/C++ out there, but I don't get why new features/semantics/optimizations aren't rolled out more aggressively (well, I do--maintainers of some of those immense volumes are on language steering committees and don't want to spin up projects to modernize their codebases--but I'm mad about it).
"Just use an old compiler" seems like such a gimme--especially in the modern era of containers etc. where making old toolchains available is easier than ever. I get that it feels bad and accumulates paper cuts, but it is so much easier to deploy compiled code written on an old revision on a new system than it is to deploy interpreted/managed code.
(There are a few cases where compilers need to be careful there--thinking about e.g. ELF format extensions and how to compile code with consideration for more aggressive linker optimizations that might be developed in the future--but they're the minority.)
There are C codebases many decades old still being actively maintained and used. I don't think the same is true for Python on the same scale. It's easy to remodel when you are on the top of abstraction layer, but you don't want to mess around with the foundational infrastructure unnecessarily.
Absolutely. But there’s so much more liberty in C land in that you can stay on an old compiler/language version for such codebases.
I know it’s not pleasant per se, but the level of support needed (easier now with docker and better toolchain version management utils than were the norm previously) surely doesn’t merit compilers carrying around the volume of legacy cruft and breaking-change aversion they do, no?
And please provide feedback to WG14. Also please give feedback and file bugs for GCC / clang. There are users of C in the committee and we need your support. Also keeping C implementable for small teams is something that is at risk.
It's ironic that I have to tell you of all people this, but many users of C (or at least, backends of compilers targeted by C) do actually want the compiler to aggressively optimize around UB.
That it doesn't pleases lots of its users I imagine. I, personally, certainly never enjoyed it but sometimes you don't have a realistic alternative and have to use C++ (or C). In which case your pleasure or displeasure doesn't really matter, you just use that one tool with very sharp edges in the most unexpected (and ridiculously exposed) places with as much care as you could, then bandage your wounds and move on.
True! But C++ is popular almost entirely because of when (in history/what alternatives existed at the time) and where (on what platforms) it first became available, and how much adoption momentum was created during that era.
I think claiming that C++ is successful because of the unintuitive-behavior-causing compiler behaviors/parts of the spec is an extraordinary claim--if that's what you mean, then I disagree. TFA discusses that many of the most pernicious UB-causing optimizations yield paltry performance gains.
If I may pontificate a bit, I was a major contributor to the success of C++.
Back in the 80s, I was looking for a way to enhance my C compiler. I looked at Objective-C and C++. There was a newsgroup for each, and each had about the same amount of traffic. I had to pick one.
Objective-C required a license to implement it. I asked AT&T if I needed a license to implement C++, and could I call it C++. AT&T's lawyer laughed and said feel free to do whatever you want.
So that decided it for me. At the time, C++ did not exist on the PC other than the awkward, nearly unusable cfront (which translated C++ to C). At the time, 90% of programming was done on the PC.
I implemented it. It was the first native C++ compiler for the PC. (It is arguable that it was the first native C++ compiler, depending on whether a gcc beta is considered a release.)
The usage of it exploded. The newsgroup traffic for C++ zoomed upwards, and Objective-C interest fell away. C++ built critical mass because of Zortech C++.
Borland dropped their plans for an OOP language and went for Turbo C++. Microsoft also had a secret OOP C language called C*, which was also abandoned in favor of implementing C++.
And the rest is history!
P.S. cfront on the PC was unusable because it was 1) incredibly slow and 2) did not support near/far pointers which was required for the mixed PC memory models.
P.P.S. Bjarne Stroustrup never mentioned any of this in his book "The Design and Evolution of C++".
We're not too far away from that. At the very least, Claude can provide feedback and help decide which compiler options to use, as per developer preference.
Compiler developers hijacked and twisted the term "Undefined Behavior". Everyone understood what UB was in K&R C - if you write code that the standard doesn't define a meaning to, and the compiler outputs what it outputs. If you dereference a null pointer, the compiler outputs a null pointer dereference, and when you hit it at runtime you get the undefined behavior (page fault on modern systems).
Nowadays, UB means something completely different - if at any point in time, the compiler reasons out that a piece of code is only reachable via UB, it will assume that this can never happen, and will quietly delete everything downstream:
Sorry if I’m missing something as this isn’t my field, but shouldn’t the two meanings be roughly equivalent to the user?
As in, everything down from UB is only working by an accident of implementation that does not need to hold, and you should explicitly not rely on that. Whether the compiler happens to explicitly make it not ever work or just leaves it to fate should not be relevant.
No, because the former definition is still something you can rely on given a specific compiler and a specific machine. Hell a bunch of UB was pretty much universal anyway. Compilers would usually still emit sensible code for UB.
UB just ment "the spec doesn't define what happens". It didn't use to mean "the compiler can just decide to do any wild thing if your program touches UB anywhere at anytime". Hell, with the modern definition UB can aparantly time travel. you don't even need to execute UB code for it to start doing weird shit in some cases.
UB went from "whatever happens when your compiler/hardware runs this is what happens" to "Once a program contains UB the compiler doesn't need to conform to the rest of the spec anymore."
>the former definition is still something you can rely on given a specific compiler and a specific machine.
>UB just ment "the spec doesn't define what happens"
What comes to mind is that then the written code is operating on a subspec, one that is probably undocumented and maybe even unintended by the specifics of that version and platform.
It sounds like it could create a ton of issues, from code that can’t be ported to difficulty in other person grokking the undocumented behavior that is being used.
In this regard, as someone that could potentially inherit this code I’d actually want the compiler to stop this potential behavior. Am I missing something? Is the spec not functional enough on its own to rely just on that?
int handle_untrusted_numbers(int a, int b) {
if (a < 0) return ERROR_EXPECTED_NON_NEGATIVE;
if (b < 0) return ERROR_EXPECTED_NON_NEGATIVE;
int sum = a + b;
if (sum < 0) {
return ERROR_INTEGER_OVERFLOW;
}
return do_something_important_with(sum);
}
Every computer you will ever use has two's complement for signed integers, and the standard recently recognized and codified this fact. However, the UB fanatics (heretics) insisted that not allowing signed overflow is an important opportunity for optimizations, so that last if-statement can be deleted by the compiler and your code quietly doesn't check for overflow any more.
There are plenty more examples, but I think this is one of the simplest.
My main gripe with UB is that if a compiler is able to detect undefined behavior invocation, it is still allowed to compile (or rather omit) said code instead of crashing.
ISO C99 actually defines multiple types of deviating behaviour. What you're describing is closer to implementation-defined behaviour than anything else.
The three behaviours relevant in this discussion, from section 3.4:
3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.
3.4.3 undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
An example of undefined behavior is the behavior on integer overflow.
3.4.4 unspecified behavior
behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance
An example of unspecified behavior is the order in which the arguments to a function are evaluated.
K&R seems to also mention "undefined" and "implementation-defined" behaviour on several occasions. It doesn't specify what is meant by undefined behaviour, but it does indeed seem to be "whatever happens, happens" instead of "you can do whatever you want." But ISO C99 seems to be a lot looser with their definition.
Using integer overflow, as in your example, for optimization has been shown to be beneficial by Charles Carruth in a talk he did at CppCon in 2016.[1] I think it would be best to have something similar to Zig's wrapping and saturating addition operators instead, but for that I think it is better to just use Zig (which I personally am very willing to do once they reach 1.0 and other compiler implementations are available).[2]
[1] is probably the best counterpoint I've seen, but there are other ways to enable this optimization - the most obvious being to use a register-sized index, which is what's passed to the function anyways. I'd be fine with an intrinsic for this as well (I don't think you'll use it often enough to justify the +%= syntax)
It's also worth noting that even with the current very liberal handling of UB, the actual code sample in [1] was still lacking this optimization; so it's not like the liberal UB handling automatically lead to faster code, understanding of the compiler was still needed.
The question is one of risk - if the compiler is conservative, you're risking is a slightly more unoptimized code. If the compiler is very liberal and assumes UB never happens, you're risking that it will wipe your overflow check like in my godbolt (I've seen an actual CVEs due to that, although I don't remember the project)
A C compiler is a relatively simple program (especially if you don't want any optimizations based on undefined behavior). If a large part of the userbase is unhappy with the way most modern C compilers work, they could easily write a "friendly"/"boring" C compiler.
However, they're not in widespread use. I would be curious to learn if there's any data/non-anecdotal information as to why. Is it momentum/inertia of GCC/LLVM/MSVC? Are alternative compilers incomplete and can't actually compile a lot of practical programs (belying the "relatively simple program") claim? Or is the performance differential due to optimizations really so significant that ordinary programs like e.g. vim or libjpeg or VLC or whatnot have significant degradations when built on an alternative compiler?
This was 2015, and we still have no
-Wdeadcode, warning of removal of "dead code", ie what compilers think of dead code. If a program writer writes code, it is never dead. It is written. It had purpose. If the compiler thinks this is wrong, it needs to warn about it.
Do we know that? I've written "dead" code. It's point was to communicate structure or intent, but it was also still dead. This pattern, in one form or another, crops up a lot IME (in multiple languages, even, with varying abilities to optimize it):
if condition that is "always" false:
abort with message detailing the circumstances
That `if` is "dead", in the sense that the condition is always false. But "dead" sometimes is just a proof — or if I'm not rigourous enough, an assumption — in my head. If the compiler can prove the same proof I have in my head, then the dead code is eliminated. If can't, well, presumably it is left in the binary, either to never be executed, or to be executed in the case that the proof in my head is wrong.
What about assertions that are meant to detect bad hardware? I'd think that's not too uncommon, particularly in shops building their own hardware. Noise on the bus, improper termination, ESD, dirty clock signal, etc. -- there are a million reasons why a bit might flip. I wouldn't want the compiler to optimize "obviously wrong" code out anymore then empty loops.
Some conditions depend strictly on inputs and the compiler can't reason much about them, and the developers can't be sure about what their users will do. So that pattern is common. It's a sibling of assertions.
There are even languages with mandatory else branch.
Why is that a problem? Inlining and optimization aren't minor aspects of compiling to native code, they are responsible for order-of-magnitude speedups.
My point is that it is easy to say "don't remove my code" while looking at a simple single-function example, but in actual compilation huge portions of a function are "dead" after inlining, constant propagation and other optimizations: not talking anything about C-specific UB or other shenanigans. You don't want to throw that out.
Removing unused inlined functions or false constexpr's is trivial to see. We already have -Winline. We care about removed branches, exprs and stmts due to some optimizer logic.
I'm talking about the optimizer, not the linker, which thanksfully does a lot of pruning.
Apologies for the flippant one liner, You made a good point and deserve more than that.
On the one hand, having the optimizer save you from your own bad code is a huge draw, this is my desperate hope with SQL, I can write garbage queries and the optimizer will save me from myself.
But... Someone put that code there, spent time and effort to get that machinery into place with the expectation that it is doing something. and when the optimizer takes that away with no hint. That does not feel right either. Especially when the program now behaves differently when "optimized" vs unoptimized.
What I mean is that we look at a function in isolation and see that it doesn't have any "dead code", e.g.,:
int factorial(int x) {
if (x < 0) throw invalid_input();
// compute factorial ...
}
This doesn't have any dead code in a static examination: at compilation-time, however, this function may be compiled multiple times, e.g., as factorial(5) or factorial(x) where x is known to be non-negative by range analysis. In this case, the `if (x < 0)` is simply pruned away as "dead code", and you definitely want this! It's not a minor thing, it's a core component of an optimizing compiler.
This same pruning is also responsible for the objectionable pruning away of dead code in the examples of compilers working at cross-purposes to programmers, but it's not easy to have the former behavior without the latter, and that's also why something like -Wdead-code is hard to implement in a way which wouldn't give constant false-positives.
That's not true for all code bases. Two common examples:
It's very common for inline functions in headers to be written for inlining and constant propagation from arguments result in dead code and better generated code. There is even __builtin_constant_p() to help with such things (e.g., you can use it to have a fast folded inline variant if an argument is constant, or call big out of line library code if variable).
There are also configuration systems that end up with config options in headers that code tests with if (CONFIG_BLAH) {...} that can evaluate to zero in valid builds.
Making C compilers better and more predictable is impossible with so many UB cases listed in the standard. A better language should be used instead, where UB and implementation-defined behavior cases are minimized.
Where there is UB in the standard it means that a C compiler is free to define the behavior. So of course, somebody could write a C implementation which does this. See also Fil-C for a perfectly memory safe version of C. So the first sentence makes no sense.
But also note that there is an ongoing effort to remove UB from the standard. We have eliminated already about 30% of UB in the core language for the upcoming version C2Y.
C is designed in such a way, that designing a safe compiler without big performance penalties isn't possible. How much Fil-C is slower compared to something like GCC? 2 to 5 times slower?
This is only relevant for specific types of UB, and even there it is not entirely clear. One of the main challenges is ABI compatibility and separate compilation. Both are not necessarily part of the "design of C". If you are willing to give this up, a lot can be done. Annotations are another possibility to get full memory safety without performance cost.
You are correct about the complexity. I write it with significant assistance from an LLM. Not quite vibe coding, but close. But I'm coming from Python, not C.
Also what I like about C is that is has mature tooling, very portable with multiple implementations, and that is is very stable. I would not use a language for any serious project hat does not offer all this.
Honestly, I do not think that the problem is C is o big that one needs to jump ship. There are real issues, yes, but there are also plenty of good tools and strategies to deal with UB, it is not really an issue for me.
Here's a cogent argument that any decision by compiler writers that they can do whatever they wish whenever they encounter an "undefined behavior" construct is rubbish:
Not everybody has full control over their environment.
The -Werror flag is not even religiously used for building, e.g. the linux kernel, and -Wextra can introduce a lot of extraneous garbage.
This will often make it easier (though still difficult) to winnow the program down to a smaller example, as that person did, rather than to enable everything and spend weeks debugging stuff that isn't the actual problem.
Yes, this is the funny thing. People do not want to spend time using a stricter language as already supported by C compilers using compiler flags because "it is waste of time" while others argue that we need to switch to much stricter languages. Both positions can not be true at the same time.
Wow, that's a very torturous reading of a specific line in a standard. And it doesn't really matter what Yodaiken thinks this line means because standard is written by C implementers for (mostly) C implementers. So if C compile writers think this line means they can use UB for optimizing purposes, then that's what it means.
Yeah, I know it breaks the common illusion among the C programmers that they're "close to the bare metal", but illusions should be dispersed, not indulged. The C programmers program for the abstract C machine which is then mediated by the C compilers into machine code the way the implementers of C compilers publicly documented.
Yeah, this is basically Sovereign Citizen-tier argumentation: through some magic of definitions and historical readings and arguing about commas, I prove that actually everyone is incorrect. That's not how programming languages work! If everyone for 10+ years has been developing compilers with some definition of undefined behavior, and all modern compilers use undefined behavior in order to drive optimization passes which depend on those invariants, there is no possible way to argue that they're wrong and you know the One True C Programming Language interpretation instead.
Moreover, compiler authors don't just go out maliciously trying to ruin programs through finding more and more torturous undefined behavior for fun: the vast majority of undefined behavior in C are things that if a compiler wasn't able to assume were upheld by the programmer would inhibit trivial optimizations that the programmer also expects the compiler to be able to do.
I find where the argument gets lost is when undefined behavior is assumed to be exactly that, an invariant.
That is to say, I find "could not happen" the most bizarre reading to make when optimizing around undefined behavior "whatever the machine does" makes sense, as does "we don't know". But "could not happen???" if it could not happen the spec would have said "could not happen" instead the spec does not know what will happen and so punts on the outcome, knowing full well that it will happen all the time.
The problem is that there is no optimization to make around "whatever the hardware does" or "we have no clue" so the incentive is to choose the worst possible reading "undefined behavior is incorrect code and therefore a correct program will never have it".
Some behaviors are left unspecified instead of undefined, which allows each implementation to choose whatever behavior is convenient, such as, as you put it, whatever the hardware does. IIRC this is the case in C for modulo with both negative operands.
I would imagine that the standard writers choose one or the other depending on whether the behavior is useful for optimizations. There's also the matter that if a behavior is currently undefined, it's easy to later on make it unspecified or specified, while if a behavior is unspecified it's more difficult to make it undefined, because you don't know how much code is depending on that behavior.
I think this is not really true. Or rather, it depends on the UB you are talking about. There is UB which is simply UB because it is out-of-scope for the C standard, and there is UB such as signed integer overflow that can cause issues. It is realistic to deal with the later, e.g. by converting them to traps with a compiler flags.
> I think this is not really true. Or rather, it depends on the UB you are talking about.
I mean, if you're going to argue that a compiler can do anything with any UB, then by all means make that argument.
Otherwise, then no, I don't think it's reasonable for a compiler to cause an infinite loop inside a function simply because that function itself doesn't return a value.
When you say "cause", do you mean insert on purpose, or do you mean cause by accident? I could see the latter happening, for example because the compiler doesn't generate a ret if the non-void function doesn't return anything, so control flow falls through to whatever code happens to be next in memory. I'm not aware of any compiler that does that, but it's something I could see happening, and the developers would have no reason to "fix" it, because it's perfectly up to spec.
I am not sure what statement you are responding to. I am certainly not arguing that. I disagree with your claim that "it is practically impossible find a program without UB".
Aliasing being the classic example. If code generation for every pointer dereference has to assume that it’s potentially aliasing any other value in scope, things get slow in a hurry.
Compiler writers are free to make whatever intentional choices they want and document them. UB is especially nasty compared to other kinds of bugs because implementors can't/refuse to commit to any specific behavior, not because they've chosen the wrong behaviors.
> Wow, that's a very torturous reading of a specific line in a standard.
It's actually a much more torturous reading to say "if any line in the program contains undefined behavior (such as the example given in the standard, integer overflow), then it's OK for the compiler to treat the entire program as garbage and create any behavior whatsoever in the executable."
Which is exactly what had been claimed, that he was addressing.
In the end it all boils to a very simple argument. The C programmers want the C compilers to behave one way, the C implementers want the C compilers to behave the other way. Since the power structure is what it is — the C implementers are the ones who write the C standard and are the ones who actually get to implement the C compilers — the C compilers do, and will, behave the way the C implementers want them to.
In this situation the C programmers can either a) accept that they're programming in a language that exists as it exists, not as they'd like it to exist; b) angrily deny a); or c) switch to some other system-level language with defined semantics.
Given what most C compilers are written in, are C programmers also C implementers?
I suspect it also depends on who exactly the compiler writers are; the GCC and LLVM guys seem to have more theoretics/academics and thus think of the language more abstractly, leading to UB being truly inexplicable and free of thought, while MSVC and ICC are more on the practical side and their interpretation of it is, as the standard says, "in a documented manner characteristic of the environment". IMHO the "spirit of C" and the more commonsense approach is definitely the latter, and K&R themselves have always leaned in that direction. This is very much a "letter of the law vs. spirit of the law" argument. The fact that these two different sides have produced compilers with nearly the same performance characteristics shows IMHO that the argument of needing to exploit UB is mandatory for performance is a debunked myth.
I doubt it, but that's just a hunch. Is there data out there regarding compiler/language maintainer/standards committee members' contributions to other projects (beyond "so and so person works on $compiler and $application, both written in C"-type anecdotes)?
If not, then, like ... sure, C compiler maintainers people who program in C, but they're not "C programmers" as it was intended (people who develop non-compiler software in C).
My hunch is that that statement is overwhelmingly true if measured by influence of a given C compiler/implementation stack (because GCC/LLVM/MSVC take up a huge slice of the market, and their maintainers are in many cases paid specialists who don't do significant work on other projects), but untrue if measured by count of people who have worked on C compilers (because there are a huge number of small-market-share/niche compilers out there, often maintained by groups who develop those compilers for a specific, often closed-source, platform/SoC/whatever).
Another alternative is that the programmer write their own C compiler and be free of this politics. Maybe I am biased since I am working on exactly such a project, but I have been seeing more and more in-progress compiler implementations for C or C-like languages for the past couple years.
The proposals for Boring C or "Friendly Dialect of C" or whatever has been around for a while. None went beyond the early design stages because, it turns out, no two experienced C programmers could agree on what parts of C are reasonable/unreasonable (and should be kept/left out), see [0] for the first-hand recount.
[0] https://blog.regehr.org/archives/1287
> In contrast, we want old code to just keep working, with latent bugs remaining latent.
Well, just keep compiling it with the old compilers. "But we'd like to use new compilers for some 'free' gains!" Well, sucks, you can't. "But we have to use new compilers because the old ones just plain don't work on the newer systems!" Well, that sucks, and this here is why "technical debt" is called "debt" and you've managed to hold paying it off until now the repo team is here and knocking at your door.
I can't upvote this enough.
I mostly work in compiled languages now, but started in interpreted/runtime languages.
When I made that switch, it was baffling to me that the compiled-language folks don't do compatibility-breaking changes more often during big language/compiler revision updates.
Compiled code isn't like runtime code--you can build it (in many cases bit-deterministically!) on any compiler version and it stays built! There's no risk of a toolchain upgrade preventing your software from running, just compiling.
After having gone through the browser compatibility trenches and the Python 2->3 wars, I have no idea why your proposal isn't implemented more often: old compiler/language versions get critical/bugfix updates where practical, new versions get new features and aggressively deprecate old ones. For example: "you want some combination of {the latest optimizations, loongarch support, C++-style attributes, #embed directives, auto vector zero-init}? Great! Those are only available on the new revision of the compiler where -Werror is the default and only behavior. Don't want those? The old version will still get bugfixes."
Don't get me wrong, backwards compatibility is golden...when it comes to making software run. But I think it's a mistake that back compat is taken even further when it comes to compilers, rather than the reverse. I get that there are immense volumes of C/C++ out there, but I don't get why new features/semantics/optimizations aren't rolled out more aggressively (well, I do--maintainers of some of those immense volumes are on language steering committees and don't want to spin up projects to modernize their codebases--but I'm mad about it).
"Just use an old compiler" seems like such a gimme--especially in the modern era of containers etc. where making old toolchains available is easier than ever. I get that it feels bad and accumulates paper cuts, but it is so much easier to deploy compiled code written on an old revision on a new system than it is to deploy interpreted/managed code.
(There are a few cases where compilers need to be careful there--thinking about e.g. ELF format extensions and how to compile code with consideration for more aggressive linker optimizations that might be developed in the future--but they're the minority.)
There are C codebases many decades old still being actively maintained and used. I don't think the same is true for Python on the same scale. It's easy to remodel when you are on the top of abstraction layer, but you don't want to mess around with the foundational infrastructure unnecessarily.
Absolutely. But there’s so much more liberty in C land in that you can stay on an old compiler/language version for such codebases.
I know it’s not pleasant per se, but the level of support needed (easier now with docker and better toolchain version management utils than were the norm previously) surely doesn’t merit compilers carrying around the volume of legacy cruft and breaking-change aversion they do, no?
And please provide feedback to WG14. Also please give feedback and file bugs for GCC / clang. There are users of C in the committee and we need your support. Also keeping C implementable for small teams is something that is at risk.
> behave the way the C implementers want them to
If you don't please your users, you won't have any users.
It's ironic that I have to tell you of all people this, but many users of C (or at least, backends of compilers targeted by C) do actually want the compiler to aggressively optimize around UB.
I'm well aware of that. We've had many, many discussions of that in the D forums.
If you're self hosting your compiler on C, you are your own user.
Which users?
Consider that most programmers have long since fled for other languages.
And yet, C++.
> And yet, C++.
By any metric, C++ is one of the most successful programming languages devised by mankind, if not the most successful.
What point were you trying to make?
That it doesn't pleases lots of its users I imagine. I, personally, certainly never enjoyed it but sometimes you don't have a realistic alternative and have to use C++ (or C). In which case your pleasure or displeasure doesn't really matter, you just use that one tool with very sharp edges in the most unexpected (and ridiculously exposed) places with as much care as you could, then bandage your wounds and move on.
that it has millions of users while pleasing approximately none of them
True! But C++ is popular almost entirely because of when (in history/what alternatives existed at the time) and where (on what platforms) it first became available, and how much adoption momentum was created during that era.
I think claiming that C++ is successful because of the unintuitive-behavior-causing compiler behaviors/parts of the spec is an extraordinary claim--if that's what you mean, then I disagree. TFA discusses that many of the most pernicious UB-causing optimizations yield paltry performance gains.
If I may pontificate a bit, I was a major contributor to the success of C++.
Back in the 80s, I was looking for a way to enhance my C compiler. I looked at Objective-C and C++. There was a newsgroup for each, and each had about the same amount of traffic. I had to pick one.
Objective-C required a license to implement it. I asked AT&T if I needed a license to implement C++, and could I call it C++. AT&T's lawyer laughed and said feel free to do whatever you want.
So that decided it for me. At the time, C++ did not exist on the PC other than the awkward, nearly unusable cfront (which translated C++ to C). At the time, 90% of programming was done on the PC.
I implemented it. It was the first native C++ compiler for the PC. (It is arguable that it was the first native C++ compiler, depending on whether a gcc beta is considered a release.)
The usage of it exploded. The newsgroup traffic for C++ zoomed upwards, and Objective-C interest fell away. C++ built critical mass because of Zortech C++.
Borland dropped their plans for an OOP language and went for Turbo C++. Microsoft also had a secret OOP C language called C*, which was also abandoned in favor of implementing C++.
And the rest is history!
P.S. cfront on the PC was unusable because it was 1) incredibly slow and 2) did not support near/far pointers which was required for the mixed PC memory models.
P.P.S. Bjarne Stroustrup never mentioned any of this in his book "The Design and Evolution of C++".
How about we agree on the ABI and everyone can have their own C compiler. Everyone C's the world through their own lenses.
We're not too far away from that. At the very least, Claude can provide feedback and help decide which compiler options to use, as per developer preference.
Compiler developers hijacked and twisted the term "Undefined Behavior". Everyone understood what UB was in K&R C - if you write code that the standard doesn't define a meaning to, and the compiler outputs what it outputs. If you dereference a null pointer, the compiler outputs a null pointer dereference, and when you hit it at runtime you get the undefined behavior (page fault on modern systems).
Nowadays, UB means something completely different - if at any point in time, the compiler reasons out that a piece of code is only reachable via UB, it will assume that this can never happen, and will quietly delete everything downstream:
https://godbolt.org/z/EYxWqcfjx
Sorry if I’m missing something as this isn’t my field, but shouldn’t the two meanings be roughly equivalent to the user?
As in, everything down from UB is only working by an accident of implementation that does not need to hold, and you should explicitly not rely on that. Whether the compiler happens to explicitly make it not ever work or just leaves it to fate should not be relevant.
No, because the former definition is still something you can rely on given a specific compiler and a specific machine. Hell a bunch of UB was pretty much universal anyway. Compilers would usually still emit sensible code for UB.
UB just ment "the spec doesn't define what happens". It didn't use to mean "the compiler can just decide to do any wild thing if your program touches UB anywhere at anytime". Hell, with the modern definition UB can aparantly time travel. you don't even need to execute UB code for it to start doing weird shit in some cases.
UB went from "whatever happens when your compiler/hardware runs this is what happens" to "Once a program contains UB the compiler doesn't need to conform to the rest of the spec anymore."
>the former definition is still something you can rely on given a specific compiler and a specific machine.
>UB just ment "the spec doesn't define what happens"
What comes to mind is that then the written code is operating on a subspec, one that is probably undocumented and maybe even unintended by the specifics of that version and platform.
It sounds like it could create a ton of issues, from code that can’t be ported to difficulty in other person grokking the undocumented behavior that is being used.
In this regard, as someone that could potentially inherit this code I’d actually want the compiler to stop this potential behavior. Am I missing something? Is the spec not functional enough on its own to rely just on that?
Very simple code is UB:
Every computer you will ever use has two's complement for signed integers, and the standard recently recognized and codified this fact. However, the UB fanatics (heretics) insisted that not allowing signed overflow is an important opportunity for optimizations, so that last if-statement can be deleted by the compiler and your code quietly doesn't check for overflow any more.There are plenty more examples, but I think this is one of the simplest.
My main gripe with UB is that if a compiler is able to detect undefined behavior invocation, it is still allowed to compile (or rather omit) said code instead of crashing.
ISO C99 actually defines multiple types of deviating behaviour. What you're describing is closer to implementation-defined behaviour than anything else.
The three behaviours relevant in this discussion, from section 3.4:
K&R seems to also mention "undefined" and "implementation-defined" behaviour on several occasions. It doesn't specify what is meant by undefined behaviour, but it does indeed seem to be "whatever happens, happens" instead of "you can do whatever you want." But ISO C99 seems to be a lot looser with their definition.Using integer overflow, as in your example, for optimization has been shown to be beneficial by Charles Carruth in a talk he did at CppCon in 2016.[1] I think it would be best to have something similar to Zig's wrapping and saturating addition operators instead, but for that I think it is better to just use Zig (which I personally am very willing to do once they reach 1.0 and other compiler implementations are available).[2]
[1] https://youtu.be/yG1OZ69H_-o?si=x-9ALB8JGn5Qdjx_&t=2357 [2] https://ziglang.org/documentation/0.15.2/#Operators
[1] is probably the best counterpoint I've seen, but there are other ways to enable this optimization - the most obvious being to use a register-sized index, which is what's passed to the function anyways. I'd be fine with an intrinsic for this as well (I don't think you'll use it often enough to justify the +%= syntax)
It's also worth noting that even with the current very liberal handling of UB, the actual code sample in [1] was still lacking this optimization; so it's not like the liberal UB handling automatically lead to faster code, understanding of the compiler was still needed.
The question is one of risk - if the compiler is conservative, you're risking is a slightly more unoptimized code. If the compiler is very liberal and assumes UB never happens, you're risking that it will wipe your overflow check like in my godbolt (I've seen an actual CVEs due to that, although I don't remember the project)
Previously:
https://news.ycombinator.com/item?id=11219874 (2016)
https://news.ycombinator.com/item?id=19659555 (2019)
Thanks! Macroexpanded:
What every compiler writer should know about programmers (2015) [pdf] - https://news.ycombinator.com/item?id=19659555 - April 2019 (62 comments)
What every compiler writer should know about programmers [pdf] - https://news.ycombinator.com/item?id=11219874 - March 2016 (106 comments)
Note that all the examples come from lack of bounds checking.
A C compiler is a relatively simple program (especially if you don't want any optimizations based on undefined behavior). If a large part of the userbase is unhappy with the way most modern C compilers work, they could easily write a "friendly"/"boring" C compiler.
Some of those already exist, e.g. https://bellard.org/tcc/
However, they're not in widespread use. I would be curious to learn if there's any data/non-anecdotal information as to why. Is it momentum/inertia of GCC/LLVM/MSVC? Are alternative compilers incomplete and can't actually compile a lot of practical programs (belying the "relatively simple program") claim? Or is the performance differential due to optimizations really so significant that ordinary programs like e.g. vim or libjpeg or VLC or whatnot have significant degradations when built on an alternative compiler?
This was 2015, and we still have no -Wdeadcode, warning of removal of "dead code", ie what compilers think of dead code. If a program writer writes code, it is never dead. It is written. It had purpose. If the compiler thinks this is wrong, it needs to warn about it.
The only dead code is generated code by macros.
Dead code is extremely common in C or C++ after inlining, other optimizations.
Or stubs. I'll often flesh out a class before implementing the methods.
OP means that the code has a dual purpose: one purpose is to be compiled, the other is to communicate structure or intent to programmers.
Do we know that? I've written "dead" code. It's point was to communicate structure or intent, but it was also still dead. This pattern, in one form or another, crops up a lot IME (in multiple languages, even, with varying abilities to optimize it):
That `if` is "dead", in the sense that the condition is always false. But "dead" sometimes is just a proof — or if I'm not rigourous enough, an assumption — in my head. If the compiler can prove the same proof I have in my head, then the dead code is eliminated. If can't, well, presumably it is left in the binary, either to never be executed, or to be executed in the case that the proof in my head is wrong.What about assertions that are meant to detect bad hardware? I'd think that's not too uncommon, particularly in shops building their own hardware. Noise on the bus, improper termination, ESD, dirty clock signal, etc. -- there are a million reasons why a bit might flip. I wouldn't want the compiler to optimize "obviously wrong" code out anymore then empty loops.
Some conditions depend strictly on inputs and the compiler can't reason much about them, and the developers can't be sure about what their users will do. So that pattern is common. It's a sibling of assertions.
There are even languages with mandatory else branch.
That's the problem
Why is that a problem? Inlining and optimization aren't minor aspects of compiling to native code, they are responsible for order-of-magnitude speedups.
My point is that it is easy to say "don't remove my code" while looking at a simple single-function example, but in actual compilation huge portions of a function are "dead" after inlining, constant propagation and other optimizations: not talking anything about C-specific UB or other shenanigans. You don't want to throw that out.
Removing unused inlined functions or false constexpr's is trivial to see. We already have -Winline. We care about removed branches, exprs and stmts due to some optimizer logic.
I'm talking about the optimizer, not the linker, which thanksfully does a lot of pruning.
Apologies for the flippant one liner, You made a good point and deserve more than that.
On the one hand, having the optimizer save you from your own bad code is a huge draw, this is my desperate hope with SQL, I can write garbage queries and the optimizer will save me from myself.
But... Someone put that code there, spent time and effort to get that machinery into place with the expectation that it is doing something. and when the optimizer takes that away with no hint. That does not feel right either. Especially when the program now behaves differently when "optimized" vs unoptimized.
What I mean is that we look at a function in isolation and see that it doesn't have any "dead code", e.g.,:
This doesn't have any dead code in a static examination: at compilation-time, however, this function may be compiled multiple times, e.g., as factorial(5) or factorial(x) where x is known to be non-negative by range analysis. In this case, the `if (x < 0)` is simply pruned away as "dead code", and you definitely want this! It's not a minor thing, it's a core component of an optimizing compiler.This same pruning is also responsible for the objectionable pruning away of dead code in the examples of compilers working at cross-purposes to programmers, but it's not easy to have the former behavior without the latter, and that's also why something like -Wdead-code is hard to implement in a way which wouldn't give constant false-positives.
If dead code (1) is common in your codebase then your code base is missing heaps of refactors
(1) "dead" meaning unused types, unreachable branches
Not really, no. If you use a regex library it is very likely that 80% of that code is effectively dead code.
public interfaces are not dead code
I'd love for you to write a C compiler that does this and then realize how much dead code there is in your C projects.
Yes, I'd love to see the single line being removed, causing security issues. Many others also.
That's not true for all code bases. Two common examples:
It's very common for inline functions in headers to be written for inlining and constant propagation from arguments result in dead code and better generated code. There is even __builtin_constant_p() to help with such things (e.g., you can use it to have a fast folded inline variant if an argument is constant, or call big out of line library code if variable).
There are also configuration systems that end up with config options in headers that code tests with if (CONFIG_BLAH) {...} that can evaluate to zero in valid builds.
Making C compilers better and more predictable is impossible with so many UB cases listed in the standard. A better language should be used instead, where UB and implementation-defined behavior cases are minimized.
Where there is UB in the standard it means that a C compiler is free to define the behavior. So of course, somebody could write a C implementation which does this. See also Fil-C for a perfectly memory safe version of C. So the first sentence makes no sense.
But also note that there is an ongoing effort to remove UB from the standard. We have eliminated already about 30% of UB in the core language for the upcoming version C2Y.
C is designed in such a way, that designing a safe compiler without big performance penalties isn't possible. How much Fil-C is slower compared to something like GCC? 2 to 5 times slower?
This is only relevant for specific types of UB, and even there it is not entirely clear. One of the main challenges is ABI compatibility and separate compilation. Both are not necessarily part of the "design of C". If you are willing to give this up, a lot can be done. Annotations are another possibility to get full memory safety without performance cost.
Have you seen Rust? I'm loving it.
Rust is not super appealing to me as C user: too complex, slow compilation, etc.
Slow compilation and complexity isn't an issue. It's a price for much better result code quality and elimination of many errors.
You are correct about the complexity. I write it with significant assistance from an LLM. Not quite vibe coding, but close. But I'm coming from Python, not C.
Maybe Zig, Hare or C3 then?
Also what I like about C is that is has mature tooling, very portable with multiple implementations, and that is is very stable. I would not use a language for any serious project hat does not offer all this.
Honestly, I do not think that the problem is C is o big that one needs to jump ship. There are real issues, yes, but there are also plenty of good tools and strategies to deal with UB, it is not really an issue for me.
Zig isn't a language I mean. It's still full of footguns. It doesn't address fundamental reliability issues of C.
UB is the definition of Free Will that’s why you can’t control it, and for a programmer something that cannot be controlled is felt as dangerous..
C programs with undefined behaviour were never conforming or well-working.
I stopped reading at the abstract; garbage rant full of contradictions.
Here's a cogent argument that any decision by compiler writers that they can do whatever they wish whenever they encounter an "undefined behavior" construct is rubbish:
https://www.yodaiken.com/2021/05/19/undefined-behavior-in-c-...
And here's a cautionary tale of how a compiler writer doing whatever they wish once they encounter undefined behavior makes debugging intractable:
https://www.quora.com/What-is-the-most-subtle-bug-you-have-h...
> undefined behavior makes debugging intractable:
By their own admission, the compiler warns about the UB. "-Wanal"¹, as some call it, makes it an error. Under UBSan the program aborts with:
… "intractable"?¹a humorous name for -Wextra -Wall -Werror
Not everybody has full control over their environment.
The -Werror flag is not even religiously used for building, e.g. the linux kernel, and -Wextra can introduce a lot of extraneous garbage.
This will often make it easier (though still difficult) to winnow the program down to a smaller example, as that person did, rather than to enable everything and spend weeks debugging stuff that isn't the actual problem.
Yes, this is the funny thing. People do not want to spend time using a stricter language as already supported by C compilers using compiler flags because "it is waste of time" while others argue that we need to switch to much stricter languages. Both positions can not be true at the same time.
Wow, that's a very torturous reading of a specific line in a standard. And it doesn't really matter what Yodaiken thinks this line means because standard is written by C implementers for (mostly) C implementers. So if C compile writers think this line means they can use UB for optimizing purposes, then that's what it means.
Yeah, I know it breaks the common illusion among the C programmers that they're "close to the bare metal", but illusions should be dispersed, not indulged. The C programmers program for the abstract C machine which is then mediated by the C compilers into machine code the way the implementers of C compilers publicly documented.
Yeah, this is basically Sovereign Citizen-tier argumentation: through some magic of definitions and historical readings and arguing about commas, I prove that actually everyone is incorrect. That's not how programming languages work! If everyone for 10+ years has been developing compilers with some definition of undefined behavior, and all modern compilers use undefined behavior in order to drive optimization passes which depend on those invariants, there is no possible way to argue that they're wrong and you know the One True C Programming Language interpretation instead.
Moreover, compiler authors don't just go out maliciously trying to ruin programs through finding more and more torturous undefined behavior for fun: the vast majority of undefined behavior in C are things that if a compiler wasn't able to assume were upheld by the programmer would inhibit trivial optimizations that the programmer also expects the compiler to be able to do.
I find where the argument gets lost is when undefined behavior is assumed to be exactly that, an invariant.
That is to say, I find "could not happen" the most bizarre reading to make when optimizing around undefined behavior "whatever the machine does" makes sense, as does "we don't know". But "could not happen???" if it could not happen the spec would have said "could not happen" instead the spec does not know what will happen and so punts on the outcome, knowing full well that it will happen all the time.
The problem is that there is no optimization to make around "whatever the hardware does" or "we have no clue" so the incentive is to choose the worst possible reading "undefined behavior is incorrect code and therefore a correct program will never have it".
Some behaviors are left unspecified instead of undefined, which allows each implementation to choose whatever behavior is convenient, such as, as you put it, whatever the hardware does. IIRC this is the case in C for modulo with both negative operands.
I would imagine that the standard writers choose one or the other depending on whether the behavior is useful for optimizations. There's also the matter that if a behavior is currently undefined, it's easy to later on make it unspecified or specified, while if a behavior is unspecified it's more difficult to make it undefined, because you don't know how much code is depending on that behavior.
But even integer overflow is undefined.
It's practically impossible to find a program without UB.
I think this is not really true. Or rather, it depends on the UB you are talking about. There is UB which is simply UB because it is out-of-scope for the C standard, and there is UB such as signed integer overflow that can cause issues. It is realistic to deal with the later, e.g. by converting them to traps with a compiler flags.
> I think this is not really true. Or rather, it depends on the UB you are talking about.
I mean, if you're going to argue that a compiler can do anything with any UB, then by all means make that argument.
Otherwise, then no, I don't think it's reasonable for a compiler to cause an infinite loop inside a function simply because that function itself doesn't return a value.
When you say "cause", do you mean insert on purpose, or do you mean cause by accident? I could see the latter happening, for example because the compiler doesn't generate a ret if the non-void function doesn't return anything, so control flow falls through to whatever code happens to be next in memory. I'm not aware of any compiler that does that, but it's something I could see happening, and the developers would have no reason to "fix" it, because it's perfectly up to spec.
I am not sure what statement you are responding to. I am certainly not arguing that. I disagree with your claim that "it is practically impossible find a program without UB".
Aliasing being the classic example. If code generation for every pointer dereference has to assume that it’s potentially aliasing any other value in scope, things get slow in a hurry.
Compiler writers are free to make whatever intentional choices they want and document them. UB is especially nasty compared to other kinds of bugs because implementors can't/refuse to commit to any specific behavior, not because they've chosen the wrong behaviors.
> Compiler writers are free to make whatever intentional choices they want and document them.
Sure, but it's unlikely it's an intentional choice to cause an infinite loop simply because your boolean function didn't return a boolean.
> Wow, that's a very torturous reading of a specific line in a standard.
It's actually a much more torturous reading to say "if any line in the program contains undefined behavior (such as the example given in the standard, integer overflow), then it's OK for the compiler to treat the entire program as garbage and create any behavior whatsoever in the executable."
Which is exactly what had been claimed, that he was addressing.