Missing source code for non-software works in free GNU/Linux distributions

mtjm

2013-08-04 18:33

Most software cannot be edited without a source, making source availability necessary for software freedom. Free GNU/Linux distributions have an explicit requirement to provide sources of included software. Despite this, they include works without source. I do believe this is practically acceptable, while it restricts potential uses of the software and limits our ability to reason about software freedom.

The source

Section 1 of the GNU General Public License, version 3, defines the source code of a work as ‘the preferred form of the work for making modifications to it’. This definition is also used outside of the GPL.

However, only the author of the program can know if the given text is the source. C ‘source’ code is usually the source of the program compiled from it, while it isn’t if it was generated from a Bison parser. (Free software projects sometimes do accidentally omit the source for such files.)

Let’s simplify the issue: a source is a form of the work that a skilled user can reasonably modify. Some works, usually not C programs, are distributed in modifiable forms that might be compiled from forms that the author prefers more for editing. (Some generated parsers do get modified, making GPL compliance for them slightly harder.)

(For GPL compliance there is a more important issue of the corresponding source for a non-source work which is certainly harder than deciding if a work is just the source. It is beyond the scope of this essay.)

I believe these issues are trivial in case of C programs like printer drivers that inspired the free software philosophy and rules. For other works, deciding if a form is the source is probably impossible if the software was distributed.

Fonts

Fonts are ‘information for practical use’. They describe the shapes and metrics of letters and symbols, editing them is useful to support minority languages or special symbols needed in computer science. Now most fonts are vector or outline fonts in formats like TrueType. Bitmap fonts have different practical and legal issues.

Fonts legally are considered programs, while their description of glyph shapes just lists points and curves connecting them with no features expected from every programming language. Editors like FontForge can edit TrueType fonts, while it has a different native format with lossy conversion to TrueType which is preferred for editing.

Hinting in TrueType contains ‘real’ programs adapting these shapes to low resolution grids and making them legible on screen. These programs are distributed in a Turing-complete assembly-like language interpreted by a stack-based virtual machine. There are tools like Xgridfit which can compile a higher-level language into these programs. The other popular font formats, PostScript Type 1 and its derivatives, use high-level ‘hints’ like positions of stems and standard heights that the rasterized uses in unspecified ways to grid-fit the glyph.

While there is some benefit of editing the source instead of TrueType files, this is much different for meta-fonts. The Computer Modern project developed by Donald E. Knuth for use with TeX consists of programs using 62 parameters to generate 96 fonts. Modern technologies require drawing every font separately, while the same program describes e.g. a Roman letter for all fonts that contain it and doesn’t need many changes for new fonts. Making a separate set of fonts in a much different style for a single book is possible with meta-fonts, or gradually changing between two different fonts in a single article. (I have made a narrow sans-serif monospace style for a Computer Modern derivative in several hours. It is not published due to a licensing issue.)

However, there are nearly no other uses of meta-fonts as effective as this one. MetaFont, the program that interprets Computer Modern, generates device-specific bitmaps with no Unicode support. All programs that compile meta-fonts to outline font formats do it either by tracing bitmaps produced by MetaFont (resulting in big and unoptimized fonts) or generating outlines directly without support for important features used in Computer Modern. Recent meta-font projects rebuilding their sources from generated outline fonts or not publishing sources do not support this being a successful style today.

Hyphenation patterns

While some languages have reliable rules for hy-phen-a-tion, in English this was done using dictionaries of hyphenated words. This approach has significant problems that were solved by Franklin Liang’s hyphenation algorithm used in TeX, generating rule-like hyphenation patterns from a dictionary. 4447 patterns generated from a non-public dictionary allow TeX to recognize 89.3% of hyphens in the dictionary words.

The patterns are subwords with multiple levels of hyphens to be added or removed. The word hyphenation is hyphenated using hy3ph, he2n, hena4 and six other patterns, resulting in hy-phen-ation. (Not all hyphens are found, this will be fixed by future dictionaries using TeX to derive their hyphens.)

The same algorithm is used for multiple other languages with different patterns. They are usually generated from dictionaries restricted by copyright and not available to the users. Some languages have patterns distributed with the source dictionary. (I believe patterns could be easily written by hand for a language having reliable hyphenation rules depending only on the characters in words, although I haven’t seen any example of this.)

The patterns can be and are edited, while the source dictionaries can be more useful for development of other hyphenation algorithms. This makes them ‘a source’, but not ‘the source’.

(Technically, TeX doesn’t use the patterns directly. INITeX loads macro definitions, hyphenation patterns and font metrics, and saves its memory into a format: a very build-specific file for fast loading by VIRTeX which is normally used to build documents, representing patterns in a difficult to edit packed trie. VIRTeX does not support loading patterns since their compilation needs extra memory and code, now the same program is used for both purposes. Many other macro processors and Lisp implementations have a similar feature under a different name.)

Game data

Video games provide a bigger source of binary data. Many contain bitmaps or animations made using 3D rendering software from unpublished sources. Some games like Flight of the Amazon Queen are published as a single binary with no source and no tools for editing it. (A Trisquel users forum thread about this game originally motivated me to write this essay.)

This game has another interesting issue: a license that forbids selling it alone and allows selling it in larger software distributions. Well-known free licenses for fonts like the SIL Open Font License have the same restriction. It’s ‘useless’ since distributing the work with a Hello World program is allowed and this makes it a free software license.

Lack of source nor tools to edit it is more interesting. The Debian package includes an explanation of its compatibility with the DFSG. The binary is the ‘the preferred form for modification’ and the tools for its editing being lost made modifications equally hard for both Debian users and authors of the game. This is consistent with the source requirement being made to prevent authors from having a monopoly over their works (this explanation looks equivalent to the user’s freedom argument).

In GNU/Linux distributions endorsed by the FSF this is not an issue. Game data is considered non-functional and the only permission required is to distribute unmodified copies. (Debian excludes from the main repository games that are included in these distributions, while they exclude games that other distributions include. The first common issue is lack of data source or modification permission, the second is a restriction of commercial distribution.)

Documentation

Documentation of free software should be free, so it can be shared with the software and updated for modified versions. Most documentation is distributed as HTML or PDF files which are usually generated from various other markup languages.

Not all such documentation has a published source and sometimes software source is distributed with the binary only. (Sourceless PDFs often use nonfree fonts too.)

HTML can be edited and often is the source, while in other cases it is compiled from sources which preserve more semantic information about the document and have better printing support. For this reason we should not consider it the source if the author has a source from which it is compiled. Can we know this?

While the most popular free software licenses require providing the source with binaries, this isn’t true for most documentation licenses. No Creative Commons license protects the practical freedom to modify due to its focus on non-textual works. GNU FDL does and unlike software licenses it also requires the source to be in a free format.

The program-data dualism

Most of the above cases suggest that source code access is needed only for programs, not for data. This isn’t true and is not strict enough to be an useful criterion.

TrueType fonts are both programs and data. The PostScript page description language and typesetting systems based on TeX use Turing-complete programming languages for formatting documents which sometimes do contain nontrivial programs. Scripts describing events (and dialogue) in games are programs.

There is another difference between these works and compiled C programs: they work on multiple architectures. This is not a sufficient criterion for requiring sources, since we do not consider Java programs distributed as class files without source free, while they run on all architectures supported by Java virtual machines. Binaries being architecture-specific make distribution package builds for unpopular architectures like MIPS a more useful way of finding missing sources.

Version control

Most recent free software projects distribute the source in two ways: in distributed version control system repositories and as archives of a specific versions: tarballs which often include generated files that require ‘special’ tools to build that not all Unix systems had.

For development, source is obtained from the version control system, since it has the whole project history explaining why the changes were made. For fulfillment of the source distribution requirements, the tarball is used. Does this mean that the tarball isn’t the ‘the preferred form of the work for making modifications to it’?

Conclusions

We should provide the sources of the works that we make, since only in this case we know that it is the source. The source should be in a public and distributed version control system and include tools to build all non-source files of the work.

Verifying if software written by others has a source is harder. If you can edit it, then maybe it’s free and it’s a source. Don’t distribute software that you don’t use, since you don’t know if it respects the freedom of its users.