Most software cannot be edited without a source, making source
availability necessary for software
freedom. Free GNU/Linux
distributions have an explicit
requirement
to provide sources of included software. Despite this, they include
works without source. I do believe this is practically acceptable, while
it restricts potential uses of the software and limits our ability to
reason about software freedom.
The source
Section 1 of the GNU General Public
License, version 3, defines the
source code of a work as ‘the preferred form of the work for making
modifications to it’. This definition is also used outside of the GPL.
However, only the author of the program can know if the given text is
the source. C ‘source’ code is usually the source of the program
compiled from it, while it isn’t if it was generated from a Bison
parser. (Free software projects sometimes do accidentally omit the
source
for such files.)
Let’s simplify the issue: a source is a form of the work that a
skilled user can reasonably modify. Some works, usually not C programs,
are distributed in modifiable forms that might be compiled from forms
that the author prefers more for editing. (Some generated parsers do get
modified, making GPL compliance for them slightly harder.)
(For GPL compliance there is a more important issue of the
corresponding source for a non-source work which is certainly harder
than deciding if a work is just the source. It is beyond the scope of
this essay.)
I believe these issues are trivial in case of C programs like printer
drivers that inspired the free software philosophy and rules. For other
works, deciding if a form is the source is probably impossible if the
software was distributed.
Fonts
Fonts are ‘information for practical use’. They describe the shapes and
metrics of letters and symbols, editing them is useful to support
minority languages or special symbols needed in computer science. Now
most fonts are vector or outline fonts in formats like TrueType.
Bitmap fonts have different practical and legal issues.
Fonts legally are considered programs, while their description of glyph
shapes just lists points and curves connecting them with no features
expected from every programming language. Editors like
FontForge can edit TrueType fonts, while it has
a different native format with lossy conversion to TrueType which is
preferred for editing.
Hinting in TrueType contains ‘real’ programs adapting these shapes to
low resolution grids and making them legible on screen. These programs
are distributed in a Turing-complete assembly-like language interpreted
by a stack-based virtual machine. There are tools like
Xgridfit which can compile a
higher-level language into these programs. The other popular font
formats, PostScript Type 1 and its derivatives, use high-level ‘hints’
like positions of stems and standard heights that the rasterized uses in
unspecified ways to grid-fit the glyph.
While there is some benefit of editing the source instead of TrueType
files, this is much different for meta-fonts. The Computer
Modern project developed
by Donald E. Knuth for use with TeX consists of programs using
62 parameters to generate 96 fonts. Modern technologies require drawing
every font separately, while the same program describes e.g. a Roman
letter for all fonts that contain it and doesn’t need many changes for
new fonts. Making a separate set of fonts in a much different style for
a single book is
possible with meta-fonts, or gradually changing between two different
fonts in a single article. (I have made a narrow sans-serif monospace
style for a Computer Modern derivative in several hours. It is not
published due to a licensing issue.)
However, there are nearly no other uses of meta-fonts as effective as
this one. MetaFont, the program that interprets Computer Modern,
generates device-specific bitmaps with no Unicode support. All programs
that compile meta-fonts to outline font formats do it either by tracing
bitmaps produced by MetaFont (resulting in big and unoptimized fonts) or
generating outlines directly without support for important features used
in Computer Modern. Recent meta-font projects rebuilding their sources
from generated outline fonts or not publishing sources do not support
this being a successful style today.
Hyphenation patterns
While some languages have reliable rules for hy-phen-a-tion, in English
this was done using dictionaries of hyphenated words. This approach has
significant problems that were solved by Franklin Liang’s hyphenation
algorithm used in TeX, generating
rule-like hyphenation patterns from a dictionary. 4447 patterns
generated from a non-public dictionary allow TeX to recognize 89.3% of
hyphens in the dictionary words.
The patterns are subwords with multiple levels of hyphens to be added or
removed. The word hyphenation
is hyphenated using hy3ph
, he2n
,
hena4
and six other patterns, resulting in hy-phen-ation
. (Not all
hyphens are found, this will be fixed by future dictionaries using TeX
to derive their hyphens.)
The same algorithm is used for multiple other
languages with different patterns.
They are usually generated from dictionaries restricted by copyright and
not available to the users. Some languages have patterns distributed
with the source dictionary. (I believe patterns could be easily written
by hand for a language having reliable hyphenation rules depending only
on the characters in words, although I haven’t seen any example of
this.)
The patterns can be and are edited, while the source dictionaries can be
more useful for development of other hyphenation algorithms. This makes
them ‘a source’, but not ‘the source’.
(Technically, TeX doesn’t use the patterns directly. INITeX loads macro
definitions, hyphenation patterns and font metrics, and saves its memory
into a format: a very build-specific file for fast loading by VIRTeX
which is normally used to build documents, representing patterns in a
difficult to edit packed trie.
VIRTeX does not support loading patterns since their compilation needs
extra memory and code, now the same program is used for both purposes.
Many other macro processors and Lisp implementations have a similar
feature under a different name.)
Game data
Video games provide a bigger source of binary data. Many contain bitmaps
or animations made using 3D rendering software from unpublished sources.
Some games like Flight of the Amazon
Queen are
published as a single binary with no source and no tools for editing it.
(A Trisquel users forum
thread
about this game originally motivated me to write this essay.)
This game has another interesting issue: a
license
that forbids selling it alone and allows selling it in larger software
distributions. Well-known free licenses for fonts like the SIL Open
Font License
have the same restriction. It’s ‘useless’ since distributing the work
with a Hello World program is allowed and this makes it a free software
license.
Lack of source nor tools to edit it is more interesting. The Debian
package includes an
explanation
of its compatibility with the DFSG. The binary is the ‘the preferred form for
modification’ and the tools for its editing being lost made
modifications equally hard for both Debian users and authors of the
game. This is consistent with the source requirement being made to
prevent authors from having a monopoly over their works (this
explanation looks equivalent to the user’s freedom argument).
In GNU/Linux distributions endorsed by the FSF this is not an issue.
Game data is considered
non-functional
and the only permission required is to distribute unmodified copies.
(Debian excludes from the main repository games that are included in
these distributions, while they exclude games that other distributions
include. The first common issue is lack of data source or modification
permission, the second is a restriction of commercial distribution.)
Documentation
Documentation of free
software should be free, so
it can be shared with the software and updated for modified versions.
Most documentation is distributed as HTML or PDF files which are usually
generated from various other markup languages.
Not all such documentation has a published source and sometimes software
source is distributed with the binary only. (Sourceless PDFs often use
nonfree fonts too.)
HTML can be edited and often is the source, while in other cases it is
compiled from sources which preserve more semantic information about the
document and have better printing support. For this reason we should not
consider it the source if the author has a source from which it is
compiled. Can we know this?
While the most popular free software licenses require providing the
source with binaries, this isn’t true for most documentation licenses.
No Creative Commons license protects the practical freedom to modify due
to its focus on non-textual works. GNU
FDL does and unlike software
licenses it also requires the source to be in a free format.
The program-data dualism
Most of the above cases suggest that source code access is needed only
for programs, not for data. This isn’t true and is not strict enough to
be an useful criterion.
TrueType fonts are both programs and data. The PostScript page
description language and typesetting systems based on TeX use
Turing-complete programming languages for formatting documents which
sometimes do contain nontrivial programs. Scripts describing events (and
dialogue) in games are programs.
There is another difference between these works and compiled C programs:
they work on multiple architectures. This is not a sufficient criterion
for requiring sources, since we do not consider Java programs
distributed as class files without source free, while they run on all
architectures supported by Java virtual machines. Binaries being
architecture-specific make distribution package builds for unpopular
architectures like MIPS a more useful way of finding missing sources.
Version control
Most recent free software projects distribute the source in two ways: in
distributed version control system repositories and as archives of a
specific versions: tarballs which often include generated files that
require ‘special’ tools to build that not all Unix systems had.
For development, source is obtained from the version control system,
since it has the whole project history explaining why the changes were
made. For fulfillment of the source distribution requirements, the
tarball is used. Does this mean that the tarball isn’t the ‘the
preferred form of the work for making modifications to it’?
Conclusions
We should provide the sources of the works that we make, since only in
this case we know that it is the source. The source should be in a
public and distributed version control system and include tools to build
all non-source files of the work.
Verifying if software written by others has a source is harder. If you
can edit it, then maybe it’s free and it’s a source. Don’t distribute
software that you don’t use, since you don’t know if it respects the
freedom of its users.