Table of Contents
Preface
PART 1: Overview
Chapter 1. An Overview of Perl-
Getting Started
-
Natural and Artificial Languages
-
An Average Example
-
Filehandles
-
Operators
-
Control Structures
-
Regular Expressions
-
List Processing
-
What You Don't Know Won't Hurt You (Much)
PART 2: The Gory Details
Chapter 2. Bits and Pieces-
Atoms
-
Molecules
-
Built-in Data Types
-
Variables
-
Names
-
Scalar Values
-
Context
-
List Values and Arrays
-
Hashes
-
Typeglobs and Filehandles
-
Input Operators
Chapter 3. Unary and Binary Operators-
Terms and List Operators (Leftward)
-
The Arrow Operator
-
Autoincrement and Autodecrement
-
Exponentiation
-
Ideographic Unary Operators
-
Binding Operators
-
Multiplicative Operators
-
Additive Operators
-
Shift Operators
-
Named Unary and File Test Operators
-
Relational Operators
-
Equality Operators
-
Bitwise Operators
-
C-Style Logical (Short-Circuit) Operators
-
Range Operator
-
Conditional Operator
-
Assignment Operators
-
Comma Operators
-
List Operators (Rightward)
-
Logical and, or, not, and xor
-
C Operators Missing from Perl
Chapter 4. Statements and Declarations-
Simple Statements
-
Compound Statements
-
if and unless Statements
-
Loop Statements
-
Bare Blocks
-
goto
-
Global Declarations
-
Scoped Declarations
-
Pragmas
Chapter 5. Pattern Matching-
The Regular Expression Bestiary
-
Pattern-Matching Operators
-
Metacharacters and Metasymbols
-
Character Classes
-
Quantifiers
-
Positions
-
Capturing and Clustering
-
Alternation
-
Staying in Control
-
Fancy Patterns
Chapter 6. Subroutines-
Syntax
-
Semantics
-
Passing References
-
Prototypes
-
Subroutine Attributes
Chapter 7. Formats-
Format Variables
-
Footers
Chapter 8. References-
What Is a Reference?
-
Creating References
-
Using Hard References
-
Symbolic References
-
Braces, Brackets, and Quoting
Chapter 9. Data Structures-
Arrays of Arrays
-
Hashes of Arrays
-
Arrays of Hashes
-
Hashes of Hashes
-
Hashes of Functions
-
More Elaborate Records
-
Saving Data Structures
Chapter 10. Packages-
Symbol Tables
-
Autoloading
Chapter 11. Modules-
Using Modules
-
Creating Modules
-
Overriding Built-in Functions
Chapter 12. Objects-
Brief Refresher on Object-Oriented Lingo
-
Perl's Object System
-
Method Invocation
-
Object Construction
-
Class Inheritance
-
Instance Destructors
-
Managing Instance Data
-
Managing Class Data
-
Summary
Chapter 13. Overloading-
The overload Pragma
-
Overload Handlers
-
Overloadable Operators
-
The Copy Constructor (=)
-
When an Overload Handler Is Missing (nomethod and fallback)
-
Overloading Constants
-
Public Overload Functions
-
Inheritance and Overloading
-
Run-Time Overloading
-
Overloading Diagnostics
Chapter 14. Tied Variables-
Tying Scalars
-
Tying Arrays
-
Tying Hashes
-
Tying Filehandles
-
A Subtle Untying Trap
-
Tie Modules on CPAN
PART 3: Perl as Technology
Chapter 15. Unicode-
Building Character
-
Effects of Character Semantics
-
Caution, \[ren2bold] Working
Chapter 16. Interprocess Communication
Signals
Files
Pipes
System V IPC
Sockets
Chapter 17. Threads
The Process Model
The Thread Model
Chapter 18. Compiling-
The Life Cycle of a Perl Program
-
Compiling Your Code
-
Executing Your Code
-
Compiler Backends
-
Code Generators
-
Code Development Tools
-
Avant-Garde Compiler, Retro Interpreter
Chapter 19. The Command-Line Interface-
Command Processing
-
Environment Variables
Chapter 20. The Perl Debugger-
Using the Debugger
-
Debugger Commands
-
Debugger Customization
-
Unattended Execution
-
Debugger Support
-
The Perl Profiler
Chapter 21. Internals and Externals-
How Perl Works
-
Internal Data Types
-
Extending Perl (Using C from Perl)
-
Embedding Perl (Using Perl from C)
-
The Moral of the Story
PART 4: Perl as Culture
Chapter 22. CPAN-
The CPAN modules Directory
-
Using CPAN Modules
-
Creating CPAN Modules
Chapter 23. Security-
Handling Insecure Data
-
Handling Timing Glitches
-
Handling Insecure Code
Chapter 24. Common Practices-
Common Goofs for Novices
-
Efficiency
-
Programming with Style
-
Fluent Perl
-
Program Generation
Chapter 25. Portable Perl-
Newlines
-
Endianness and Number Width
-
Files and Filesystems
-
System Interaction
-
Interprocess Communication (IPC)
-
External Subroutines (XS)
-
Standard Modules
-
Dates and Times
-
Internationalization
-
Style
Chapter 26. Plain Old Documentation-
Pod in a Nutshell
-
Pod Translators and Modules
-
Writing Your Own Pod Tools
-
Pod Pitfalls
-
Documenting Your Perl Programs
Chapter 27. Perl Culture-
History Made Practical
-
Perl Poetry
PART 5: Reference Material
Chapter 28. Special Names-
Special Names Grouped by Type
-
Special Variables in Alphabetical Order
Chapter 29. Functions-
Perl Functions by Category
-
Perl Functions in Alphabetical Order
Chapter 30. The Standard Perl Library-
Library Science
-
A Tour of the Perl Library
Chapter 31. Pragmatic Modules-
use attributes
-
use autouse
-
use base
-
use blib
-
use bytes
-
use charnames
-
use constant
-
use diagnostics
-
use fields
-
use filetest
-
use integer
-
use less
-
use lib
-
use locale
-
use open
-
use overload
-
use re
-
use sigtrap
-
use strict
-
use subs
-
use vars
-
use warnings
Chapter 32. Standard Modules-
Listings by Type
-
Benchmark
-
Carp
-
CGI
-
CGI::Carp
-
Class::Struct
-
Config
-
CPAN
-
Cwd
-
Data::Dumper
-
DB_File
-
Dumpvalue
-
English
-
Errno
-
Exporter
-
Fatal
-
Fcntl
-
File::Basename
-
File::Compare
-
File::Copy
-
File::Find
-
File::Glob
-
File::Spec
-
File::stat
-
File::Temp
-
FileHandle
-
Getopt::Long
-
Getopt::Std
-
IO::Socket
-
IPC::Open2
-
IPC::Open3
-
Math::BigInt
-
Math::Complex
-
Math::Trig
-
Net::hostent
-
POSIX
-
Safe
-
Socket
-
Symbol
-
Sys::Hostname
-
Sys::Syslog
-
Term::Cap
-
Text::Wrap
-
Time::Local
-
Time::localtime
-
User::grent
-
User::pwent
Chapter 33. Diagnostic Messages
Glossary
Index
Read an Excerpt
Chapter 18: Compiling
If you came here looking for a Perl compiler, you may be surprised to discover that you already have one--your perl program (typically /usr/bin/perl) already contains a Perl compiler. That might not be what you were thinking, and if it wasn't, you may be pleased to know that we do also provide code generators (which some well-meaning folks call "compilers"), and we'll discuss those toward the end of this chapter. But first we want to talk about what we think of as The Compiler. Inevitably there's going to be a certain amount of low-level detail in this chapter that some people will be interested in, and some people will not. If you find that you're not, think of it as an opportunity to practice your speed-reading skills.
Imagine that you're a conductor who's ordered the score for a large orchestral work. When the box of music arrives, you find several dozen booklets, one for each member of the orchestra with just their part in it. But curiously, your master copy with all the parts is missing. Even more curiously, the parts you do have are written out using plain English instead of musical notation. Before you can put together a program for performance, or even give the music to your orchestra to play, you'll first have to translate the prose descriptions into the normal system of notes and bars. Then you'll need to compile the individual parts into one giant score so that you can get an idea of the overall program.
Similarly, when you hand the source code of your Perl script over to perl to execute, it is no more useful to the computer than the English description of the symphony was to the musicians. Before yourprogram can run, Perl needs to compile these English-looking directions into a special symbolic representation. Your program still isn't running, though, because the compiler only compiles. Like the conductor's score, even after your program has been converted to an instruction format suitable for interpretation, it still needs an active agent to interpret those instructions.
The Life Cycle of a Perl Program
You can break up the life cycle of a Perl program into four distinct phases, each with separate stages of its own. The first and the last are the most interesting ones, and the middle two are optional. The stages are depicted in Figure 18.1.
The Compilation Phase
During phase 1, the compile phase, the Perl compiler converts your program into a data structure called a parse tree. Along with the standard parsing techniques, Perl employs a much more powerful one: it uses BEGIN blocks to guide further compilation. BEGIN blocks are handed off to the interpreter to be run as as soon as they are parsed, which effectively runs them in FIFO order (first in, first out). This includes any use and no declarations; these are really just BEGIN blocks in disguise. Any CHECK, INIT, and END blocks are scheduled by the compiler for delayed execution.
Lexical declarations are noted, but assignments to them are not executed. All eval BLOCKs, s///e constructs, and noninterpolated regular expressions are compiled here, and constant expressions are pre-evaluated. The compiler is now done, unless it gets called back into service later. At the end of this phase, the interpreter is again called up to execute any scheduled CHECK blocks in LIFO order (last in, first out). The presence or absence of a CHECK block determines whether we next go to phase 2 or skip over to phase 4.
The Code Generation Phase (optional)
CHECK blocks are installed by code generators, so this optional phase occurs when you explicitly use one of the code generators (described later in "Code Generators"). These convert the compiled (but not yet run) program into either C source code or serialized Perl bytecodes--a sequence of values expressing internal Perl instructions. If you choose to generate C source code, it can eventually produce a file called an executable image in native machine language.[2]
At this point, your program goes into suspended animation. If you made an executable image, you can go directly to phase 4; otherwise, you need to reconstitute the freeze-dried bytecodes in phase 3.
The Parse Tree Reconstruction Phase (optional)
To reanimate the program, its parse tree must be reconstructed. This phase exists only if code generation occurred and you chose to generate bytecode. Perl must first reconstitute its parse trees from that bytecode sequence before the program can run. Perl does not run directly from the bytecodes; that would be slow.
The Execution Phase
Finally, what you've all been waiting for: running your program. Hence, this is also called the run phase. The interpreter takes the parse tree (which it got either directly from the compiler or indirectly from code generation and subsequent parse tree reconstruction) and executes it. (Or, if you generated an executable image file, it can be run as a standalone program since it contains an embedded Perl interpreter.)
At the start of this phase, before your main program gets to run, all scheduled INIT blocks are executed in FIFO order. Then your main program is run. The interpreter can call back into the compiler as needed upon encountering an eval STRING, a do FILE or require statement, an s///ee construct, or a pattern match with an interpolated variable that is found to contain a legal code assertion.
When your main program finishes, any delayed END blocks are finally executed, this time in LIFO order. The very first one seen will execute last, and then you're done. (END blocks are skipped only if you exec or your process is blown away by an uncaught catastrophic error. Ordinary exceptions are not considered catastrophic.
Now we'll discuss these phases in greater detail, and in a different order.
Compiling Your Code
Perl is always in one of two modes of operation: either it is compiling your program, or it is executing it--never both at the same time. Throughout this book, we refer to certain events as happening at compile time, or we say that "the Perl compiler does this and that". At other points, we mention that something else occurs at run time, or that "the Perl interpreter does this and that". Although you can get by with thinking of both the compiler and interpreter as simply "Perl", understanding which of these two roles Perl is playing at any given point is essential to understanding why many things happen as they do. The perl executable implements both roles: first the compiler, then the interpreter. (Other roles are possible, too; perl is also an optimizer and a code generator. Occasionally, it's even a trickster--but all in good fun.)
It's also important to understand the distinction between compile phase and compile time, and between run phase and run time. A typical Perl program gets one compile phase, and then one run phase. A "phase" is a large-scale concept. But compile time and run time are small-scale concepts. A given compile phase does mostly compile-time stuff, but it also does some run-time stuff via BEGIN blocks. A given run phase does mostly run-time stuff, but it can do compile-time stuff through operators like eval STRING.
In the typical course of events, the Perl compiler reads through your entire program source before execution starts. This is when Perl parses the declarations, statements, and expressions to make sure they're syntactically legal. [3] If it finds a syntax error, the compiler attempts to recover from the error so it can report any other errors later in the source. Sometimes this works, and sometimes it doesn't; syntax errors have a noisy tendency to trigger a cascade of false alarms. Perl bails out in frustration after about 10 errors.
In addition to the interpreter that processes the BEGIN blocks, the compiler processes your program with the connivance of three notional agents. The lexer scans for each minimal unit of meaning in your program. These are sometimes called "lexemes", but you'll more often hear them referred to as tokens in texts about programming languages. The lexer is sometimes called a tokener or a scanner, and what it does is sometimes called lexing or tokenizing. The parser then tries to make sense out of groups of these tokens by assembling them into larger constructs, such as expressions and statements, based on the grammar of the Perl language. The optimizer rearranges and reduces these larger groupings into more efficient sequences. It picks its optimizations carefully, not wasting time on marginal optimizations, because the Perl compiler has to be blazing fast when used as a load-and-go compiler.
This doesn't happen in independent stages, but all at once with a lot of cross talk between the agents. The lexer occasionally needs hints from the parser to know which of several possible token types it's looking at. (Oddly, lexical scope is one of the things the lexical analyzer doesn't understand, because that's the other meaning of "lexical".) The optimizer also needs to keep track of what the parser is doing, because some optimizations can't happen until the parse has reached a certain point, like finishing an expression, statement, block, or subroutine.
You may think it odd that the Perl compiler does all these things at once instead of one after another, but it's really just the same messy process you go through to understand natural language on the fly, while you're listening to it or reading it. You don't wait till the end of a chapter to figure out what the first sentence meant. You could think of the following correspondences:
Assuming the parse goes well, the compiler deems your input a valid story, er, program. If you use the -c switch when running your program, it prints out a "syntax OK" message and exits. Otherwise, the compiler passes the fruits of its efforts on to other agents. These "fruits" come in the form of a parse tree. Each fruit on the tree--or node, as it's called--represents one of Perl's internal opcodes, and the branches on the tree represent that tree's historical growth pattern. Eventually, the nodes will be strung together linearly, one after another, to indicate the execution order in which the run-time system will visit those nodes.
Each opcode is the smallest unit of executable instruction that Perl can think about. You might see an expression like $a = -($b + $c) as one statement, but Perl thinks of it as six separate opcodes. Laid out in a simplified format, the parse tree for that expression would look like Figure 18.2. The numbers represent the visitation order that the Perl run-time system will eventually follow.
Perl isn't a one-pass compiler as some might imagine. (One-pass compilers are great at making things easy for the computer and hard for the programmer.) It's really a multipass, optimizing compiler consisting of at least three different logical passes that are interleaved in practice. Passes 1 and 2 run alternately as the compiler repeatedly scurries up and down the parse tree during its construction; pass 3 happens whenever a subroutine or file is completely parsed. Here are those passes:
- Pass 1: Bottom-Up Parsing
- During this pass, the parse tree is built up by the yacc(1) parser using the tokens it's fed from the underlying lexer (which could be considered another logical pass in its own right). Bottom-up just means that the parser knows about the leaves of the tree before it knows about its branches and root. It really does figure things out from bottom to top in Figure 18.2, since we drew the root at the top, in the idiosyncratic fashion of computer scientists. (And linguists.)
As each opcode node is constructed, per-opcode sanity checks verify correct semantics, such as the correct number and types of arguments used to call built-in functions. As each subsection of the tree takes shape, the optimizer considers what transformations it can apply to the entire subtree now beneath it. For instance, once it knows that a list of values is being fed to a function that takes a specific number of arguments, it can throw away the opcode that records the number of arguments for functions that take a varying number of arguments. A more important optimization, known as constant folding, is described later in this section.
This pass also constructs the node visitation order used later for execution, which is a really neat trick because the first place to visit is almost never the top node. The compiler makes a temporary loop of opcodes, with the top node pointing to the first opcode to visit. When the top-level opcode is incorporated into something bigger, that loop of opcodes is broken, only to make a bigger loop with the new top node. Eventually the loop is broken for good when the start opcode gets poked into some other structure such as a subroutine descriptor. The subroutine caller can still find that first opcode despite its being way down at the bottom of the tree, as it is in Figure 18.2. There's no need for the interpreter to recurse back down the parse tree to figure out where to start.
- Pass 2: Top-Down Optimizer
A person reading a snippet of Perl code (or of English code, for that matter) cannot determine the context without examining the surrounding lexical elements. Sometimes you can't decide what's really going on until you have more information. Don't feel bad, though, because you're not alone: neither can the compiler. In this pass, the compiler descends back down the subtree it's just built to apply local optimizations, the most notable of which is context propagation. The compiler marks subjacent nodes with the appropriate contexts (void, scalar, list, reference, or lvalue) imposed by the current node. Unwanted opcodes are nulled out but not deleted, because it's now too late to reconstruct the execution order. We'll rely on the third pass to remove them from the provisional execution order determined by the first pass.
- Pass 3: Peephole Optimizer
Certain units of code have their own storage space in which they keep lexically scoped variables. (Such a space is called a scratchpad in Perl lingo.) These units include eval STRINGs, subroutines, and entire files. More importantly from the standpoint of the optimizer, they each have their own entry point, which means that while we know the execution order from here on, we can't know what happened before, because the construct could have been called from anywhere. So when one of these units is done being parsed, Perl runs a peephole optimizer on that code. Unlike the previous two passes, which walked the branch structure of the parse tree, this pass traverses the code in linear execution order, since this is basically the last opportunity to do so before we cut the opcode list off from the parser. Most optimizations were already performed in the first two passes, but some can't be.
Assorted late-term optimizations happen here, including stitching together the final execution order by skipping over nulled out opcodes, and recognizing when various opcode juxtapositions can be reduced to something simpler. The recognition of chained string concatenations is one important optimization, since you'd really like to avoid copying a string back and forth each time you add a little bit to the end. This pass doesn't just optimize; it also does a great deal of "real" work: trapping barewords, generating warnings on questionable constructs, checking for code unlikely to be reached, resolving pseudohash keys, and looking for subroutines called before their prototypes had been compiled.
- Pass 4: Code Generation
This pass is optional; it isn't used in the normal scheme of things. But if any of the three code generators -- B::Bytecode, B::C, and B::CC -- are invoked, the parse tree is accessed one final time. The code generators emit either serialized Perl bytecodes used to reconstruct the parse tree later or literal C code representing the state of the compile-time parse tree.
Generation of C code comes in two different flavors. B::C simply reconstructs the parse tree and runs it using the usual runops() loop that Perl itself uses during execution. B::CC produces a linearized and optimized C equivalent of the run-time code path (which resembles a giant jump table) and executes that instead.
During compilation, Perl optimizes your code in many, many ways. It rearranges code to make it more efficient at execution time. It deletes code that can never be reached during execution, like an if (0) block, or the elsifs and the else in an if (1) block. If you use lexically typed variables declared with my ClassName $var or our ClassName $var, and the ClassName package was set up with the use fields pragma, accesses to constant fields from the underlying pseudohash are typo-checked at compile time and converted into array accesses instead. If you supply the sort operator with a simple enough comparison routine, such as {$a <=> $b} or {$b cmp $a}, this is replaced by a call to compiled C code.
Perl's most dramatic optimization is probably the way it resolves constant expressions as soon as possible. For example, consider the parse tree shown in Figure 18.2. If nodes 1 and 2 had both been literals or constant functions, nodes 1 through 4 would have been replaced by the result of that computation, something like Figure 18.3....
[2] Your original script is an executable file too, but it's not machine language, so we don't call it an image. An image file is called that because it's a verbatim copy of the machine codes your CPU knows how to execute directly.
[3] No, there's no formal syntax diagram like a BNF, but you're welcome to peruse the perly.y file in the Perl source tree, which contains the yacc(1) grammar Perl uses. We recommend that you stay out of the lexer, which has been known to induce eating disorders in lab rats.