Subclassing errors, OOP style and practically checkable rules to prevent them

The previous article "Does OOP really separate interface from implementation?" gave an example when blindly following OOP principles may lead to rather subtle errors: a "transparent" change in an implementation suddenly breaks client code that was written according to public interfaces. The article used a familiar example of Bags and Sets. The common criticism was that the example violated the Liskov Substitution Principle (LSP), and that a Set is not a subtype of a Bag.

This article will demonstrate that a set truly is-a bag; a set is substitutable for a bag, a set can always be manipulated as a bag, a set maintains every invariant of a bag -- and it is still a set. And C++ can be a pure functional language. We will prove that all the above assertions hold at the same time.

The article will also show that if we abide by practically checkable rules we obtain a guarantee that the subtle subclassing errors cannot occur in principle. The rules do not diminish the power of a language. The Liskov Substitution Principle (LSP) cannot be practically checkable as it is generally undecidable [LSP].

To make clear how the proposed rules preclude the subclassing problem illustrated before, this article follows the format of the previous one. We aim to give a more-or-less "real" example, which you can run and see the result for yourself. By necessity the example had to be coded in some language. The present article uses C++, again, for the sake of comparison. It appears however that similar code (with similar conclusions) can be carried on in many other OO languages (e.g., Java, Python, etc). For style comparisons, one example is coded in Haskell. Makefiles in the complete code archive [Code] tell how to build projects and run the validation tests and examples.

BRules
Interface and implementation of a FBag
Implementation of a FSet. A proof that a FSet is a subtype of a FBag
Discussion
Polymorphic programming and abstract data types with BRules
Conclusions
Footnotes and References [a separate page]

BRules

Suppose I was given a task to implement a Bag -- an unordered collection of possibly duplicate items (integers in this example). This time my boss laid out the rules, which we will refer to as BRules:

no virtual methods or virtual inheritance
no visible members or methods in any public data structure (that is, in any class declared in an .h file)
no mutations to public data structures
- a strict form: no assignments or mutations whatsoever
- a less strict form: no function may alter, directly or indirectly, any data it receives as arguments

The rules break the major tenets of OOP: for example, values no longer have a state that is separate from their identity. Prohibitions on virtual methods and on modifications of public objects are severe. It appears that not much of C++ is left. Surprisingly I still can implement my assignment without losing expressiveness -- and perhaps even gaining some. The exercise will also illustrate that that C++ does indeed have a pure functional subset [CPP-multiparadigm] [FCPP], and that you can program in C++ without assignments.

Interface and implementation of a FBag

class FBag { public: FBag(void); FBag(const FBag& another); // Copy-constructor ~FBag(void); private: class Cell; // Opaque type const Cell * const head; FBag(const Cell * const cell); // Private constructor // Declaration of three friends elided };

Indeed, there are no virtual functions, no methods or public members.

int size(const FBag& bag); // The number of elements in the bag int count(const FBag&, const int elem); // Count the number of occurrences // Put an element into the bag FBag put(const FBag& bag, const int elem); // Remove an element from the bag. No error if the element didn't exist FBag del(const FBag& bag, const int elem); // If a bag is empty, return the seed // Otherwise, return // functor(bag.e1,functor(bag.e2, ... functor(bag.en,seed))) template<typename Functor, typename Seed> Seed fold(const FBag& bag, const Functor& functor, const Seed seed); // Standard "print-on" operator ostream& operator << (ostream& os, const FBag& bag); // Union (merge) of the two bags FBag operator + (const FBag& bag1, const FBag& bag2); // Structural equivalence of the bags // Two bags are equal if they contain the same number of the same elements inline bool operator == (const FBag& a, const FBag& b) { return a <= b && a >= b; } // Verify bag's invariants void verify(const FBag& bag);

It must be stressed that the functions that operate on a FBag are not FBag's methods; in particular, they are not a part of the class FBag, they are not inherited and they cannot be overridden. The implementation is also written in a functional style. FBag's elements are held in a linked list of cells, which are allocated from a pre-defined pool. The pool implements a mark-and-sweep garbage collection, in C++.

Forgoing assignments does not reduce expressiveness as the following snippets from the FBag implementation code show.

// Union (merge) of the two bags struct union_f { FBag operator() (const int elem, const FBag seed) const { return put(seed,elem); } }; FBag operator + (const FBag& bag1, const FBag& bag2) { return fold(bag1,union_f(),bag2); } // Standard "print-on" operator struct print_f { ostream& os; const bool is_first; print_f(ostream& _os, const bool _is_first) : os(_os), is_first(_is_first) {} print_f operator() (const int elem, const print_f seed) const { return print_f(seed.os << (seed.is_first ? "{" : ", ") << elem, false); } }; ostream& operator << (ostream& os, const FBag& bag) { print_f seed(os,true); print_f result = fold(bag,seed,seed); return result.os << (result.is_first ? "{}" : "}"); }

The output stream in the latter example is a linear, read-once value. Monadic programming is possible in C++, and easy.

Following good practice, I wrote a validation code (file vFBag.cc [Code]) that tests all the functions and methods of the FBag package and verifies common invariants.

Implementation of a FSet. A proof that a FSet is a subtype of a FBag

Suppose you are tasked with implementing a Set package. Your boss defined a set as an unordered collection where each element has a single occurrence. In fact, your boss even said that a set is a bag with no duplicates. You have found my FBag package and realized that it can be used with few additional changes. The definition of a Set as a Bag (with some constraints) made the decision to reuse the FBag code even easier.

class FSet : public FBag { public: FSet(void) {} FSet(const FBag& bag) : FBag(remove_duplicates(bag)) {} }; bool memberof(const FSet& set, const int elem) { return count(set,elem) > 0; }

Surprisingly, that is the whole implementation of a FSet. A set is fully a bag. Because FSet constructors eventually call FBag constructors and do no alter the latter's result, every post-condition of a FSet constructor implies a post-condition of a FBag constructor. Since FBag and FSet values are immutable, the post-conditions that hold at their birth remain true through their lifespan. Because all FSet values are created by an FBag constructor, all FBag operations automatically apply to an FSet value. This concludes the proof that an FSet is a subtype of a FBag.

The FBag.cc package [Code] has a function verify(const FBag&) that checks to make sure its argument is indeed a bag. The function tests FBag's invariants, for example:

const FBag bagnew = put(put(bag,5),5); assert( count(bagnew,5) == 2 + count(bag,5) && size(bagnew) == 2 + size(bag) ); assert( count(del(bagnew,5),5) == 1 + count(bag,5) );

Your validation code passes a non-empty set to this function to verify the set is indeed a bag. You can run the validation code vFSet.cc [Code] to see for yourself that the test passes. On the other hand, FSets do behave like Sets:

const FSet a112 = put(put(put(FSet(),1),1),2); assert( count(a112,1) == 1 ); const FSet donce = FSet() + a112; const FSet dtwice = donce + a112; assert( dtwice == a112 );

where a112 is a non-empty set. The validation code vFSet.cc you wrote contains many more tests like the above. The code shows that a FSet is able to pass all of FBag's tests as well as its own. The implementation of FSets makes it possible to take a union of a set and a bag; the result is always a bag, which can be made a set if desired. There are corresponding test cases as well.

To clarify how an FSet may be an FBag at the same time, let's consider one example in more detail:

// An illustration that an FSet is an FBag int cntb(const FBag v) { FBag b1 = put(v, 5); FBag b2 = put(b1, 5); FBag b3 = del(b2, 5); return count(b3, 5); } const int cb1 = cntb(FBag()); // cb1 has the value of 1 const int cb2 = cntb(FSet()); // cb2 has the value of 1 // An illustration that an FSet does act as a set int cnts(const FSet v) { FSet s1 = put(v, 5); FSet s2 = put(s1, 5); FSet s3 = del(s2, 5); return count(s3, 5); } const int cs = cnts(FSet()); // cs has the value of 0

This example is one of the test cases in Subclassing/No-problem/vFSet.cc [Code]. You can run it and check the results for yourself. Yet it is puzzling: how come cs has the value different from that of cb1 if there is no custom del() function for FSets?

The statement
FSet s2 = put(s1, 5);
is the most illuminating. On the right-hand side is an expression: putting an element 5 to a FBag/FSet that already has this element in it. The result of that expression is a FBag {5,5}, with two instances of element 5. The statement then constructs a FSet s2 from that bag. A FSet constructor is invoked:

class FSet : public FBag { public: FSet(void) {} FSet(const FBag& bag) : FBag(remove_duplicates(bag)) {} };

The constructor takes the bag {5,5}, removes the duplicate element 5 from it, and "blesses" the resulting FBag to be a FSet as well. Thus s2 will be a FBag and a FSet, with one instance of element 5. In fact, s1 and s2 are identical. A FSet constructor guarantees that a FBag it constructs contains no duplicates. As objects are immutable, this invariant holds forever.

Discussion

Surprising as it may be, assertions "a Set is a Bag with no duplicates" and "a Set always acts as a Bag" do not contradict each other, as the following two examples illustrate:

Let ( value ... ) be an unordered collection of values: a Bag. Let's consider the following values:

ValA: 42
ValB: ( 42 )
ValC: ( 43 )
ValD: ( 42 43 )
ValE: ( 42 43 42 )
ValA is not a collection; ValB, ValC, ValD and ValE are bags. ValB, ValC and ValD are also Sets: unordered collections without duplicates. ValE is not a Set. Every Set is a Bag but not every Bag is a Set. Let uf-integer denote a natural number whose prime factors are unique. Let's consider the following values:

ValA: 5/4
ValB: 42
ValC: 43
ValD: 1806
ValE: 75852
ValA is not an integer; ValB, ValC, ValD and ValE are integers. ValB, ValC and ValD are also uf-integers. ValE is not a uf-integer as it is a product 2*2*3*3*7*7*43 with factors 2, 3, and 7 occurring several times. Every uf-integer is an integer but not every integer is a uf-integer.

We introduce operations merge (infix +) and subtract (infix -). Both operations take two Bags and return a Bag. Either of the operands, or both, may also be a Set. The result, a Bag, may or may not be a Set. For example,

ValB + ValC => ValD
both of the operands and the result are also Sets
ValB + ValD => ValE
the argument Bags are also Sets, but the resulting Bag is not a Set
ValE + ValE => ( 42 43 42 42 43 42 )
None of the Bags here are Sets

ValD - ValC => ValB
The argument Bags are also Sets, so is the result.
ValE - ValC => ( 42 42 )
One of the arguments is a Set, the resulting Bag is not a Set.
ValE - ValE => ()
The argument Bags are not Sets, but the resulting Bag is.

We introduce operations multiply (infix *) and reduce (infix %): a % b = a/gcd(a,b). Both operations take two integers and return an integer. Either of the operands, or both, may also be a uf-integer. The result, an integer, may or may not be a uf-integer. For example,

ValB * ValC => ValD
both of the operands and the result are also uf-integers
ValB * ValD => ValE
the argument integers are also uf-integers, but the resulting integer is not a uf-integer
ValE * ValE => 5753525904
None of the integers here are uf-integers

ValD % ValC => ValB
The argument integers are also uf-integers, so is the result.
ValE % ValC => 1764
One of the arguments is a uf-integer, the resulting integer is not a uf-integer.
ValE % ValE => 1
The argument integers are not uf-integers, but the resulting integer is.

Bags are closed under operation merge but the subset of Bags -- Sets -- is not closed under merge. On the other hand, both Bags and Sets are closed under subtract.
We may wish for a merge-like operation that, being applied to Sets, always yields a Set. We can introduce a new operation: merge-if-not-there. We can define it specifically for Sets. Alternatively, the operation can be defined on Bags; it would apply to Sets by the virtue of inclusion polymorphism as every Set is a Bag. Sets are closed with respect to merge-if-not-there.
On the other hand, to achieve closure of Sets under merge we can project -- coerce -- the result of merging of two Sets back into Sets, a subset of Bags. The FBag/FSet package took this approach. If we merge two FSets and want to get an FSet in result we have to specifically say so, by applying a projection (coercion) operator: FSet::FSet(const FBag& bag). That operator creates a new FBag without duplicates. This fact makes the latter a FSet. Thus
FSet(ValB + ValD) => ValD, an FSet.

Integers are closed under operation multiply but the subset of integers -- uf-integers -- is not closed under multiply. On the other hand, both integers and uf-integers are closed under reduce.
We may wish for a multiply-like operation that, being applied to uf-integers, always yields a uf-integer. We can introduce a new operation: lcm, the least common multiple. This operation is well-defined on integers; it would apply to uf-integers by the virtue of inclusion polymorphism as every uf-integer is an integer. uf-integers are closed with respect to the lcm operation.
On the other hand, to achieve closure of uf-integers under multiply we can project -- coerce -- the product of two uf-integers back into uf-integers, a subset of integers. If we multiply two uf-integers and want to get a uf-integer in result we have to specifically say so, by applying a projection (coercion) operator: remove-duplicate-factors. That operator creates a new integer without duplicate factors. This fact makes the resulting integer a uf-integer. Thus
uf-integer(ValB * ValD) => ValD, a uf-integer

It has to be stressed that the two columns of the above table are not merely similar: they are isomorphic. Indeed, the right column is derived from the left column by the following substitution of words:

Bag -> integer
Set -> uf-integer
merge -> multiply
subtract -> reduce

that preserves the meaning: an isomorphism. The right column sounds more "natural" -- so should the left column as integers and uf-integers are representations for resp. FBags and FSets.

It may appear troublesome that
    count(put(put(FSet(),1),1),1)
yields 2, while
    count(FSet(put(put(FSet(),1),1)),1)
yields 1. However, this example is not much different from the following:
    cout << (1 + 0.5) << endl;
will print 1.5, yet
    cout << int(1+0.5) << endl;
will print 1. By the same token, expressions
    (int)(1/3.0 + 2/3.0 + 4/3.0)
    (int)(1/3.0) + (int)(2/3.0) + (int)(4/3.0)
will yield different results, although in both cases it will be an integer. Often introducing intermediary variables, with an explicitly declared type, helps disambiguate the expressions and clarify the meaning. Generally writing expressions with FBags and FSets is not very different from writing arithmetic expressions with integer and real numbers. One has to watch the type of every operation.

From an extensional point of view [CARD-1985], a type denotes a set of values. By definition of a FSet, it is a particular kind of FBag. Therefore, a set of all FSets is a subset of all FBags: FSet is a subtype of FBag. A FBag or a FSet do not have any "embedded" behavior -- just as integers don't have an embedded behavior. Behavior of numbers is defined by operations, mapping from numbers to numbers. Any function that claims to accept every member of a set of values identified by a type T will also accept any value in a subset of T.

Frequently a value can participate in several sets of operations: a value can have several types at the same time. For example, a collection ( 42 ) is both a Bag and a Set. This fact should not be surprising. In C++, a value typically denoted by a numeral 0 can be considered to have a character type, an integer type, a float type, a complex number type, or a pointer type, for any declared or yet to be declared pointer type.

This lack of behavior is what puts FBag and FSet apart from CBag and CSet discussed in the previous article. FSet is indeed a subtype of FBag, while CSet is not a subtype of a CBag as CSet has a different behavior. Incidentally LSP is trivially satisfied for values that do not carry their own behavior. FBags and FSets are close to so-called predicate classes. Since instances of FSets are immutable, the predicate needs to be checked only at value construction time.

This section has talked about subtyping rather than subclassing or inheritance. Subtyping is a far more general concept: it applies to languages which have no classes (e.g., Matlab) and to types that are not related by inheritance (e.g, char and int, which are both primitive types).

Polymorphic programming with BRules

The FSet/FBag example above showed BRules in the context of subtypes formed by a restriction on a base type. As it turns out, BRules work equally well with existential (abstract) types.

Shapes-oop.cc
A conventional OOP implementation of a collection of polymorphic values, considered as instances of an abstract datatype
Shapes-no-oop.cc
A non-traditional implementation, with BRules in effect. There are no virtual functions, no assignments, no overriding.
Shapes-h.hs
The code similar to the above but implemented in Haskell. BRules are in effect as they are implicit in Haskell.

The code archive [Code] contains a third directory, Shapes, with three source code files. The files illustrate various ways of handling a collection of polymorphic values. The collection is populated by Rectangles and Ellipses. They are instances of concrete classes implementing a common abstract base class Shape. A Shape is an existential type that knows how to draw, move and resize itself. A file Shapes-oop.cc gives the conventional, OOP-like implementation, with virtual functions and such. A file Shapes-no-oop.cc is another implementation, also in C++. The latter follows BRules, has no assignments or virtual functions. Any particular Shape value is created by a Shape constructor and is not altered after that. Shapes-no-oop.cc achieves polymorphic programming with the full separation of interface and implementation: If an implementation of a concrete Shape is changed, the code that constructs and uses Shapes does not even have to be recompiled! The file defines two concrete instances of the Shape: a Square and a Rectangle. The absence of mutations and virtual functions guarantees that any post-condition of a Square or a Rectangle constructor implies the post-condition of a Shape. Both shapes can be passed to a function
set_dim(const Shape& shape, const float width, const float height);
Depending on the new dimensions, a square can become a rectangle or a rectangle square. You can compile Shapes-no-oop.cc and run it to see that fact for yourself.

It is instructive to compare Shapes-no-oop.cc with Shapes-h.hs, which implements the same problem in a purely functional, strongly-typed language Haskell. All three code files in the Shapes directory solve the same problem the same way. Two C++ code files -- Shapes-oop.cc and Shapes-no-oop.cc -- look rather different. On the other hand, the purely functional Shapes-no-oop.cc and the Haskell code Shapes-h.hs are uncanny similar -- in some places, frighteningly similar.

This exercise shows that BRules do not constrain the power of a language even when abstract data types are involved.

Conclusions

It is known, albeit not so well, that following the OOP letter and practice may lead to insidious errors [OOP-problems]. The previous article showed how subtle the errors can be even in simple cases. In theory, there are rules -- LSP -- that could prevent the errors. Alas, the rules are in general undecidable and not practically reinforceable [LSP].

In contrast, BRules introduced in this article can be statically checked at compile time. The rules outlaw certain syntactic constructions (for example, assignments in some contexts, and non-private methods) and keywords (e.g., virtual). It is quite straightforward to write a lint-like application that scans source code files and reports if they conform to the rules. When BRules are in effect, subtle subclassing errors like the ones shown in the previous article become impossible. To be more precise, with BRules, subclassing implies subtyping. Subclassing by definition is a way of creating (derived) values by extending, restricting, or otherwise specializing other, parent values. A derived value constructor must invoke a parent value constructor to produce the parent value. The former constructor often has a chance to alter the parent constructor's result before it is cast or incorporated into the derived value. If this chance is taken away, the post-condition of a derived value constructor implies the post-condition of the parent value. Disallowing any further mutations guarantees the behavioral substitutability of derived values for parent values at all times.

As the examples in this article showed, following BRules does not diminish the power of the language. We can still benefit from polymorphism, we can still develop practically relevant code. Yet BRules blur the distinction between the identity and the state, a characteristic of objects. BRules are at odds with the practice if not the very mentality of OOP. This begs the question: Is OOP indeed conducive to software development?

One can argue that OOP -- as every powerful technique -- requires extreme care: knives are sharp. Likewise, goto is expressive, and assembler- or microcode-level programming are very efficient. All of them can lead to bugs that are very difficult, statically impossible, to find. On the other hand, if you program, for example, in Scheme, you never have to deal with an "invalid opcode" exception. That error becomes simply impossible. Furthermore, "while opinions concerning the benefits of OOSD [Object-Oriented Software Development] abound in OO literature, there is little empirical proof of its superiority" [OOP-Perceptions].

Articles' Posting Headers

From: oleg
Message-ID: <8katsh$fmf$1@nnrp1.deja.com> <8kau0t$fn3$1@nnrp1.deja.com>
Subject: Subclassing errors, OOP style and mechanical rules to prevent them
Date: Sun, 09 Jul 2000 22:21:18 GMT
Newsgroups: comp.object,comp.lang.functional,comp.lang.c++.moderated
X-Article-Creation-Date: Sun Jul 09 22:21:18 2000 GMT

Last updated May 1, 2001

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!