Random number generator

From Citizendium
Revision as of 22:51, 26 November 2010 by imported>Sandy Harris (→‎Random devices)
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

While defining true randomness can drift from computer science and mathematics into philosophy, Bruce Schneier has a reasonable criterion for a source of random numbers: "It cannot be reliably reproduced."[1] This article focuses first on the idea of randomness, rather than on how the random or pseudo-random sequence is produced. For properly selected applications, it may be possible to be adequately random with a technique that does not depend on a true random physical process. In other cases, it may be practical to use a combination of physical and computational methods.

In any of these cases, wise implementers do not assume randomness or even adequate pseudo-randomness. Even when the generator is an apparently random physical process, the motto "trust, but verify" still holds because some physical phenomena may indeed be random in the short- or long-term, but the nature of the physical resource (e.g., the declining radioactivity of a source) may affect its properties. It may be possible to compensate for a weak area, but only if it is known. See #Testing for Randomness.

Choosing random quantities to foil a resourceful and motivated adversary is surprisingly difficult. — IETF Best Current Practice 106[2]

A truly random one-time pad may be generated with a combination of measurement of a random physical phenomenon (e.g., thermal noise or radioactive disintegrations), which is then captured by a computer and put into high-density storage. An expert in the physical phenomenon being measured needs to be consulted to determine if postprocessing is needed.

There are a wide range of applications for random or pseudo-random numbers, with various degrees of randomness. A computer game can give an apparently different scenario whenever played, with some simple fairly random physical inputs, such as selected bits from the computer's time of day clock, and, perhaps, the time between the last several keystrokes or mouse movements. In other cases, such as extremely critical cryptography, only the best uncorrelated physical random sequences will do. See related articles for examples of applications.

Manual generation

In the past, one-time pads were generated by typists who were told to type randomly, but, in practice, had patterns. Limited to manual methods, a not-unreasonable way is to put numbered balls in a bowl, having a blindfolded person take it out, the number recorded, the ball returned to the bowl, and the bowl shaken thoroughly before taking the next ball.

This is a point mostly of historical interest, but typist patterns did cause weakness in some one-time pads. It is hard to imagine why manual generation would be useful today.

One vendor offers dice with barcodes on them so that results can easily be scanned into a computer.

Random sequences from physical phenomena

Real-world physical phenomena often exhibit randomness, although some it is possible to be random in the short term, yet to exhibit some unexpected patterns or self-similarity. The number of cosmic rays striking an object at a given second is as random a process as is known. The number of disintegration events, in a radioactive material, is often considered random, but consider that if the amount of radioisotope is finite, as more and more half-lives pass, there will be fewer disintegrations. There is no simple answer to the question of whether a downward trend in disintegrations makes that sequence nonrandom.

Many natural sources of random bits may be defective in that they produce biased output bits (so that the probability of a one is different than the probability of a zero), or bits which are correlated with each other.

— Goldwasser & Bellare

[3]

There are, however, methods that can be used to postprocess such a source to remove bias and correlation.

One widespread technique is due to John von Neumann. Take the input bits in pairs, discard 11 or 00 pairs, output one if the input is 10 and zero if the input is 01. As long as the input bits are independent, this completely eliminates bias. For example, suppose the input is 90% ones. The chance of a 10 sequence is .9*.1 while that of a 01 sequence is .1*.9; the two are exactly equal so the output is unbiased. If the inputs are unbiased, then the four input combinations 00 01 10 and 11 are equally likely and on average you get two bits out for eight in. Any bias reduces this ratio; with large bias the technique can be quite inefficient. In our example with 90% bias, 81% of pairs are 11 and 1% are 00; these are thrown away. On average, 200 bits of input (100 pairs) give only 18 output bits. More problematic in some applications, there is some chance of getting a long run of 11 inputs and generating no output at all for an appreciable time. Also, if the input bits are correlated rather than independent, the technique works less well.

Another common technique is to apply a cryptographic hash to the input. Suppose we estimate that there is about 1 bit of randomness per input byte and our hash is SHA-1 with 160-bit output. Hash 160 input bytes and we should have about 160 bits of randomness; the whole hash output should be unbiased and adequately random. In practice, one would hash more than 160 bytes to give a safety margin, and might use only part of the hash output as random output to avoid giving an enemy enough information to determine the internal state of the hash.

In theory, any high-grade compression algorithm could be substituted for the hash provided any non-random headers are thrown away; well-compressed data is very close to random. However, this technique is rare in practice. To be confident that it was secure, it would need thorough cryptographic analysis of the compression methods; the general practice is therefore to just use a cryptographic hash that has already undergone analysis.

Various sources may be used to provide (partly) random inputs to the process. A geiger counter or a radio tuned so it only gets static are possible sources. A digital camera can be used, either pointed at some random physical process or pointed at a plain background to use its circuit noise. Silicon Graphics built one called lavarand using a video camera pointed at a group of lava lamps; the descendant lavarnd uses a digital camera with the lens cap on. A microphone can be used in much the same two ways.

The Turbid generator [4] is a sophisticated design using a sound card, without a microphone, for input. Other designs attempt to estimate or measure the input entropy, then include safety factors to allow for estimation errors. Turbid takes a different approach, proving a lower bound on the input entropy from the physical properties of the devices involved. From that and some relatively weak assumptions about properties of the hash function, it is possible to derive lower bounds on the output entropy. Parameters are chosen to make the bound 159.something bits of entropy per 160 output bits. The documentation talks of "smashing it up against the asymptote".

Hardware generators can be designed with internal sources of random bits. Thermal noise in diodes is often used. An alternative is to run two oscillators with different frequencies and without good frequency stability on either. The slower one acts as a clock, determining when to take the faster one's state as an output bit. The frequency ratio is typically around 100:1, so quite minor variations in the slow one have large effects on choice of data from the fast one. There is some research work on laser-based sources with very high data rates.

A real generator will often combine several of these techniques. For example, the RNGs built into some Intel chipsets [1] use two oscillators, and modulate the frequency of the slower one with thermal noise. The hardware includes a de-skewing operation based on Von Neumann's technique and the software driver uses SHA-1 hashing.

Random devices

Many operating systems provide a random device as a source of random numbers for applications. A program can just read the device when it needs random numbers. Typically input comes from events within the operating system kernel — keystroke timings, mouse data, disk access delays, and so on. If the machine has a hardware generator, its output is often mixed in to the random device as well so programs need only read the random device.

Generally a buffer is used, so the sequence is something like:

  1. mix some input into the buffer
  2. hash the buffer to get a substantial chunk of hash output
  3. use part of that as random output
  4. mix the remainder back into the buffer

A complication is that the process is asynchronous, with input data mixed in as it arrives and output produced as required; care must be taken to ensure that there is enough randomness to support the output. Generally two devices are implemented; one provides high quality random numbers for critical applications and will block (make the client program wait) if there has not been enough input entropy to justify the output. The other device never blocks but does not provide a strong guarantee of output quality.

All such devices use a hash to mix the data, and some use a cipher for output processing as well — the Yarrow [5] generator uses a block cipher in counter mode while the Open BSD device uses a stream cipher. The Linux device uses a second buffer and a second hash.

For a typical desktop system, these devices are generally adequate — requirements for random numbers are not heavy, and there is plenty of mouse, keyboard and disk activity. However, on other systems the random device may be over-stressed without hardware assistance. Consider a web server that supports many SSL connections or an IPsec gateway running many tunnels. These applications demand considerable quantities of high-grade random numbers but available inputs are limited — a server may not even have a keyboard or mouse and it might use a solid state disk or an intelligent RAID controller so no randomness from disk delay is available. Information from things like timing of network interrupts may be partly known to an attacker who monitors the network, so it can be used only with great caution. Such systems usually require hardware assistance to meet their randomness requirements.

Current chips from Intel, Via [2] and others provide a hardware RNG. On a machine without that, Turbid [4] may be an alternative; a server has little other use for a sound card but there is often one built into the motherboard or it may be easy to add one.

Pseudorandom number generators

In practical computing, it is convenient to use a pseudorandom number generator (PRNG), which produces a sufficiently varying sequence that, for example, a computer game driven with that generator appears to produce different characteristics each time that it is played. Nevertheless, most PRNGs that are operating system utilities, such as UNIX random(), will eventually repeat the sequence over a sufficiently long period of time. In addition, the programmer may have the choice of giving an explicit initialization value, which may be called a seed or a nonce, to the PRNG.

Most general-purpose PRNGs will produce the same sequence as long as they are given the same seed, which can even be useful for some software development purposes (see pitfalls in computer simulation, or for such things as being sure that a series of psychological research volunteers all see the same set of events. A given PRNG, however, may be told to use some reasonably random physical event in the computer as the seed, such that the seed is unpredictable.

Donald Knuth, an authority on PRNGs, quoted John von Neumann in a tongue-in-cheek but realistic view of the limits of PRNGs[6]:

Anyone who considers arithmetical method of producing random digits is, of course, in a state of sin — John von Neumann (1951)

It is possible to write very good and very bad pseudorandom number generators. Known pitfalls need to be avoided, both in initialization and in computing the next number.

Known bad ideas

Any PRNG must be initialised with a seed. If you have any concern about someone attacking your generator, such seeds must be truly random. Using something like your computer's clock or your process ID, or even a combination of such things, makes it dangerously easy for an enemy to guess the value and break your generator. In a well-known example, Ian Goldberg and David Wagner broke an early version of Netscape's SSL server in exactly this way. [3] Using another PRNG to provide a seed only moves the problem without solving it, since that PRNG must also be intialised.

Several common types of PRNG — such as linear congruential generators or linear feedback shift registers — cannot be used directly in adversarial applications. With well-chosen parameters, they have reasonable statistical properties and are useful for such things as controlling Monte Carlo simulations. However, in the presence of an adversary — someone who wants to break your cryptosystem or cheat at your game — they are inadequate, easily broken. They might be used as components in some more complex generator, but they cannot safely be used alone in such applications.

"Cryptanalytic Attacks on Pseudorandom Number Generators" [4] by Kelsey, Schneier, Wagner, and Hall enumerates attacks and assesses vulnerabilities in widely used generators.

Some PRNGs

Linear congruential

Knuth illustrates the most commonly used PRNG, which he attributes to D.H. Lehmer in 1949[7]. This is an easy generator to implement in software and is very widely used. For example, the rand() function in most C libraries works this way.

A linear congruential generator creates successive outputs xn for n = 1,2, 3... using the formula:

xi = a * xi-1 + b  modulo m

The seed is the initial value x0.

Choice of the constants a, b and the modulus m is critical; bad choices can give very weak generators.

The period of such a generator is at most m-1. a, b and m are normally chosen so that this maximum is achieved. Schneier [8] provides a table with recommended choices for a, b, m.

Shift registers

Many random number generators based on shift registers with feedback are possible. The commonest type uses linear feedback shift registers. These are very easy to implement in hardware and reasonable in software. They are commonly used in stream ciphers; we describe some in the stream cipher article.

OFB and CTR Sequences

A block cipher can provide a good pseudo-random sequence; the modes of operation for this are Output Feedback (OFB) or Counter (CTR) mode.

Encryption devices can be used as practical and strong PRNGs, if the output applies some type of masking so an adversary cannot know the full internal state of the generator.[9]

Blum Blum Shub

This is the strongest known generator, with the advantage of relative simplicity, but the disadvantage of being computationally intensive. If it is used to generate numbers that are not needed frequently, it may be useful.[10] [11]

The possibility also exists that one or more processors could be dedicated to BBS computation.

The possibility of pseudo-random methods that are adequately random

Some research indicates that in carefully selected situations, a pseudo-random number generator, which meets a number of cryptographic criteria, may be adequately random for security. The proofs are complex and not universally accepted, but may be true for a useful number of situations.[12] [3]

Random numbers that follow particular statistical distributions

In many applications, the exact sequence is desired to be as random as possible, but the sequence may be under additional constraints to meet specific statistical requirements.

See generating values for normal random variables for an example of generating random numbers that fit a desired distribution.

Testing for Randomness

Knuth presents an extensive range of tests, of which a few are mentioned here.

A more recent reference is a web page where the US National Institute of Standards and Technology describe their work on "a battery of statistical tests suitable in the evaluation of random number generators and pseudo-random number generators used in cryptographic applications."

Testing for runs

It is entirely reasonable to have individual numbers recur in a random sequence — a random sequence may even have a "run" of a number such as 100 354 972 972 972 155 579. Indeed, when typists have been asked to type randomly, they may intuitively avoid runs and produce a less random sequence. However, one statistical test for randomness is to look at the frequency of runs.

Frequency test

The simplest test is to chop the output up into conveniently-sized chunks and look at frequency of possible chunk values over time. For example, if bytes are the unit, all 256 possible values should occur and their frequencies should be approximately equal. Using larger units gives more stringent tests; for example sampling 16-bit chunks and testing that all values occur equally often tells more than just looking at frequency of byte values.

Chi-squared statistic

The Chi-squared goodness of fit test provides a more sophisticated and powerful method of analyzing frequency data.

Consider an algorithm that looks at pairs of bytes and creates a 256x256 matrix for frequencies of byte sequences, indexed with (arbitrarily) rows for the first byte and columns for the second. This need not be equivalent to breaking the data into 16-bit chunks; one can look at sequences ab, bc, cd, de, ... rather than just ab, cd, ef ... If the data is random, one expects all cells of the matrix to have approximately the same frequency values. However, many types of non-random bias will show up as deviations in that pattern.

Chi-squared is the standard technique for drawing inferences from such a matrix.

Serial test

Gap test

Partition test

Applications

Computer simulation

If it is desired to do repeated runs of a simulation, either for software testing or to evaluate variable components under repeatable condition, a PRNG that will reuse the same seed to produce the same sequence is useful.

Cryptography

Cryptographic one-time pads preferably are generated from physical random phenomena. For high-volume ciphers, however, reproducibility is needed, so, while the initialization variables tend to be truly random, a key generator for a stream or block cipher uses a demonstrably strong pseudo-random number generator.

References

  1. Schneier, Bruce (Second Edition, 1996), Applied Cryptography: Protocols, Algorithms, and Source Code in C, John Wiley & Sons p. 46
  2. Eastlake, D. 3rd; J. Schiller & S. Crocker (June 2005), Randomness Requirements for Security, RFC 4086; IETF Best Current Practice 106 p. 1
  3. 3.0 3.1 Goldwasser, Shafi & Mihir Bellare (July 2008), Chapter 3, Pseudo-random bit generators, Lecture Notes on Cryptography p. 39 Cite error: Invalid <ref> tag; name "Goldwasser2008" defined multiple times with different content
  4. 4.0 4.1 John S. Denker, High-Entropy Randomness Generator
  5. J. Kelsey, B. Schneier, and N. Ferguson (1999). Yarrow-160: Notes on the Design and Analysis of the Yarrow Cryptographic Pseudorandom Number Generator.
  6. Knuth, Donald (3rd Edition, 1998), Chapter III, Random Numbers, The Art of Computer Programming, Volume II: Seminumerical Algorithms, Addison-Wesley
  7. Knuth, p. 10
  8. Schneier p370
  9. Eastlake et al. p. 24
  10. Eastlake, p. 25
  11. Goldwasser & Ballare, p. 47
  12. Eastlake et al., pp. 23-27