Prime number module

A

Andrew Dalke

Adding another thought to the discussion. Instead of pregenerating
all primes using a seive, store the cumulative counts for, say,
every 10,000 terms (so 10,000 numbers are needed for the 1E8
range requested by the poster) then use a fast prime tester, like the
Rabin-Miller strong pseudoprime test (see
http://mathworld.wolfram.com/Rabin-MillerStrongPseudoprimeTest.html )
to fill in any missing range. Could cache any hits in a bitstring
along with the counts, like

counts = [(None, 9999), # however many are between 0 and 10000
(None, 8888), # between 10,000 and 20,000 (or 0 to 20,000)
...
]

where the None field stores the bitstring May want to do every
1,000 if the tests are too slow.

pycrypto has Rabin-Miller and here's a copy I found in pure Python:
http://senderek.de/pcp/release/tools/number.py

It uses the first 7 primes for tests, which is good for all numbers
up to 341550071728321 ~ 3.4E14.

Andrew
(e-mail address removed)
 
J

Juha Autero

Lulu of the Lotus-Eaters said:
I was disappointed that gzip only reduces the size by 41%. I would have
guessed better, given the predominance of 0 bits.

I think it is not a surprise. (But maybe that's because I tried it
before.) You need redundancy to compress data. If bitmap of prime
numbers contained lot of redundancy, you could use it to generate
prime numbers. Of course bitmap is not the most efficient way to store
list of prime numbers, but neither is gzip the most efficient way to
find redundancy in it. IIRC gzip basically encodes repeating bit
patterns and it's hardly a suprise that bitmap of prime numbers
doesn't have lots of repeating bit patterns.
 
L

Lulu of the Lotus-Eaters

|use a fast prime tester, like the Rabin-Miller strong pseudoprime test

Well sure, you can cheat. But aren't you worried about undiscovered
small pseudoprimes? *wink* (I guess for Rabin-Miller they are not
called Carmichael numbers, just for Fermat pseudoprimes[*])

Yours, Lulu...

[*] Does anyone know whether there are Carmichael-like numbers for
Rabin-Miller tests? Last I knew, it was unproven either way, but that
was a while. That is, can any composite pass R-M for every base?
 
L

Lulu of the Lotus-Eaters

|> I was disappointed that gzip only reduces the size by 41%. I would have
|> guessed better, given the predominance of 0 bits.

|patterns and it's hardly a suprise that bitmap of prime numbers
|doesn't have lots of repeating bit patterns.

Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
times :). Certainly much more than any other pattern. After all, my
10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
a lot of runs of zeros in there :).

The problem, I think, is that these bit patterns don't fall on byte
borders as neatly as I would like. I am pretty sure that I could write
a custom LZ-style compressor that did better by being bit-oriented
rather than byte-oriented. But I'm not going to do that work, 3.8 megs
is small enough for me.

FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

Yours, Lulu...

[*] I'm missing two primes versus some other claims. My odd numbers
don't include 2; and I called 1 a non-prime. I don't feel strongly
about the second matter though.

[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :).
 
A

Andrew Dalke

Lulu of the Lotus-Eaters:
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**].

As I recall, the Burrows-Wheeler transform used in bzip2 uses
a clever way to rearrange the data in a block into a form which is
more compressible, yes?

Well, what about rearranging the bits so that every multiple of
3 is before every non-multiple of 3, then every multiple of 5, etc.
There's a simple algorithm to unravel the ordering, and you'll
end up with lot of 0s at the start of the block. It's very much like
the ordering+compression you did when you moved all the terms
divisible by 2 to the front of the group then realized they could
all be represented as part of the decompression algorithm.

Extend my idea far enough and you'll have a prime sieve.
[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :).

Why doesn't bzip2 feel like cheating? It's good at compressing
text because it's designed for the sort of patterns found in text.
Why doesn't mp3 feel like cheating? It's designed to handle the
sounds we normally hear, as compared to the output of a
random noise generator.

In essense, if you know the full corpus beforehand, it's really
easy to compress - just store the identifier for each text.
There's only one text here so the result is trivial compressed
to nothing.

The problem is, your algorithm tries to compress an ordered
list of odd numbers, and so fails to compress as well as a
compression algorithm tuned for just compressing the first
N primes.
---[ to our friends at TLAs (spread the word) ]--------------------------
Echelon North Korea Nazi cracking spy smuggle Columbia fissionable Stego
White Water strategic Clinton Delta Force militia TEMPEST Libya Mossad

Hmm... Seems dated. Terms like whitewater and Clinton
(Bill, not Hillary) don't hit many hot buttons these days, the
number of people in various milita has gone way down [*]
and Libya is working towards renormalization.

What about these words?

Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized
enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin

Andrew
(e-mail address removed)

[*] Actually, US law states
http://www4.law.cornell.edu/uscode/10/311.html

(a) The militia of the United States consists of all able-bodied males at
least 17
years of age and, except as provided in section 313 of title 32, under 45
years of age who are, or who have made a declaration of intention to
become, citizens of the United States and of female citizens of the United
States who are members of the National Guard.

(b) The classes of the militia are -
(1) the organized militia, which consists of the National Guard and
the Naval Militia; and

(2) the unorganized militia, which consists of the members of the
militia who are not members of the National Guard or the Naval Militia

Therefore, the numbers of people in the militia hasn't changed that
drastically -- and I'm a member of the unorganized militia of the US.
Maybe I should get a patch saying that, and wear it next time
I visit Europe? :)
 
A

Andrew Dalke

Lulu of the Lotus-Eaters:
Well sure, you can cheat. But aren't you worried about undiscovered
small pseudoprimes?

I am. I'm doing an extensive study of the small pseudoprimes between
0 and 1. I'm working on the distributed computing framework now,
and hope to start the actual mathematical analysis in about 3 months.

Andrew
(e-mail address removed)
 
A

Alex Martelli

Lulu said:
|Lulu of the Lotus-Eaters wrote:
|> Along those lines, what's the quickest way to count bits in Python?

|Probably gmpy's popcount function.

What about fast AND portable?

(I'm sure gmpy can be installed lots of places, but not necessarily by
receivers of Python code).

Why not? gmpy's downloads include a self-contained .pyd for Windows
users (the ones who typically lack easy access to a C compiler) -- don't
most users of Linux, MacOs and other BSD's, etc, typically have gcc
available to install extensions if they want...? Or are you worrying
about the operations' availability to users of Jython...?

I see you guys are having a great time concocting pure-Python ways to
count bits, etc -- but GMP, the library underying gmpy, has carefully
tuned machine-code accelerations for many important operations and
platforms... if one really needs to perform advanced multi-precision
arithmetic, as opposed to just playing with it, GMP is just too
precious a resource to ignore it. Given how widespread Java is, I would
not be surprised if libraries of similar quality were available there
(with native-methods and all) -- I just don't know of any because I
never needed them, so didn't research the issue.

I do acknowlegde the usefulness of pure Python (AND readable, NOT
distorted for fun & optimization-profit:) code in *teaching* even
the lowest-level concepts of multi-precision arithmetic & bit-fiddling.
And Python is ideal for playing with higher-level architectures, such
as you guys have been doing about various potential ways to store
compressed pre-computed representations of primes. Finally, one day,
if and when pypy triumphs, one will no doubt need to start with pure
Python code (most likely SIMPLE, NON-TRICKY pure Python code) in
order to obtain maximum speed.

But from a pragmatic point of view, I'll keep using gmpy for the
foreseeable future, thanks;-).


Alex
 
A

Alex Martelli

Alex said:
Lulu of the Lotus-Eaters wrote:
...
Along those lines, what's the quickest way to count bits in Python?

Probably gmpy's popcount function.
Maybe someone has a clever idea... something that would run a whole
bunch faster than my bit mask style above for counting all the "1"s in a
multibyte string.

I would try gmpy.mpz(multibytestring, 256).popcount(). E.g.:
163

and, on my trusty 30-months-old Athlon box...:

[alex@lancelot python2.3]$ python timeit.py -s'import gmpy' \
-s'x="gmpy IS pretty fast for many kinds of things"' \
'gmpy.mpz(x,256).popcount()'
100000 loops, best of 3: 8.13 usec per loop
[alex@lancelot python2.3]$

Silly me, for not including a "pure Python" alternative run on the
same machine for comparison purposes -- sorry. Given the following
bitcount.py module (only the initialization is coded with gmpy,
and just because I'm lazy -- that doesn't affect timings!):

import gmpy

byte_bit_counts = dict([(chr(b), gmpy.popcount(b)) for b in range(256)])

def popcount(s, bbc=byte_bit_counts):
return sum([bbc[c] for c in s])
163

and, on the same box:

[alex@lancelot python2.3]$ python timeit.py -s'import bitcount' \
-s'x="gmpy IS pretty fast for many kinds of things"' \
'bitcount.popcount(x)'
10000 loops, best of 3: 82.2 usec per loop

So, the speed advantage of gmpy is just about one order of magnitude
for this trivial operation.


Alex
 
M

Michael Hudson

Alex Martelli said:
Why not? gmpy's downloads include a self-contained .pyd for Windows
users (the ones who typically lack easy access to a C compiler) -- don't
most users of Linux, MacOs and other BSD's, etc, typically have gcc
available to install extensions if they want...? Or are you worrying
about the operations' availability to users of Jython...?

I'm not sure "most" OS X installations have the dev tools installed.

I'm not sure GMP has been ported to, e.g. VMS.
But from a pragmatic point of view, I'll keep using gmpy for the
foreseeable future, thanks;-).

Well, sure.

Cheers,
mwh
 
T

Tim Hochberg

Lulu of the Lotus-Eaters wrote:
[SNIP]
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.
[SNIP]

FWIW, using escaped nibbles plus compression put me down around 3.4
megs. This includes storing enough metadata to do fast how many primes
below n testing. Much slower for straight primality testing than the
bitmap scheme though.

-tim
 
L

Lulu of the Lotus-Eaters

|In essense, if you know the full corpus beforehand, it's really
|easy to compress - just store the identifier for each text.

I've invented an excellent compression algorithm. It represents the
12.5 meg bit sequence for prime/no-prime (<10^8) as the bit "1". To
encode other sequences, you use the escape bit "0" as the first bit...

Pick the best algorithm for the job at hand...

Yours, Lulu...

|Iran nuclear neocon expose POTUS patriot Pakistan armed weaponized
|enriched uranium UN smallpox Gitmo invasion Castro Tikrit revolution sarin

Yeah... I like those. I'll add a new .signature to my rotation list.
I can troll both the spooks of the mid-1990s and those of the mid-2000s.

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
..cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------
 
A

Andrew Dalke

Stephen Horne:
I think bitsets are the way to go for compression. The idea of only
storing every nth prime until you consider how you implement the seive
for the missing primes.

Well, my idea was to store the counts for every, say, 10,000th prime
then do a fast prime test to fill in missing ranges, and cache that list
of primes for later use.
Basically, you only need to store bits where (value % 6) is one of the
values that gets a dot in the bottom line. Then to determine which bit
you need, you set up something like...

You could also get better compression testing for %5, %7, %11, etc.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.

Andrew
(e-mail address removed)
 
S

Stephen Horne

You could also get better compression testing for %5, %7, %11, etc.

That was the point of the modulo 2*3*5 == 30 and modulo 2*3*5*7 ==
210. As I said, working with the pattern needs a rapidly increasing
modulo size and lookup table, and the advantages rapidly diminish as
you add more and more primes.
Note that by doing so you convert your lookup system into a simple
prime tester, and if you extend it along 2, 3, 5, ... then it is the seive
of Eratosthenes and you don't need to store any bit patterns at all.

No, you don't need the bitset - but the whole thing becomes a waste of
time and space. If you extend it sufficiently to not need any bit
patterns at all, what you do need is a huge modulo-based lookup table
that serves no real purpose at all (it only has benefits for the
second and so-on repetitions of the repeating pattern) and a complete
list of the 'special case' primes (ie all primes up to 10^8).

Storing the list of special primes up to 10^8 would take much more
space than the original bitset, and the lookup table would take even
more space. The whole exercise would be far worse than pointless as
you could get more efficient results by discarding the method and the
lookup table, and just searching that list of 'special case' primes.

Optimisation problems often require a balance. In this case, I was
trying to balance prior knowledge of the repeating pattern of definite
non-primes in the program with externally stored data for the bulk of
the primes (the bitset on disk).
 
E

Emile van Sebille

FWIW, using array pops the speed up significantly for long strings:

import string, time, array

nibbles = [(w+x+y+z, w*8+x*4+y*2+z)
for w in (0,1) for x in (0,1)
for y in (0,1) for z in (0,1)]

bitcounts = [
(chr(hinibblevalue*16 + lonibblevalue),
hibitcount+lobitcount)
for hibitcount, hinibblevalue in nibbles
for lobitcount, lonibblevalue in nibbles ]

bytes = dict(bitcounts)

xlater = "".join([chr(bitcount) for byte, bitcount in bitcounts])

def bitcount1(bytestr):
return sum([bytes[ii] for ii in bytestr])

def bitcount2(bytestr):
return sum(map(ord, string.translate(bytestr,xlater)))

def bitcount4(bytestr):
return sum(array.array("B",string.translate(bytestr,xlater)))

refstr = 'a'*1000000

t=time.time()
bitcount1(refstr)
print time.time()-t

t=time.time()
bitcount2(refstr)
print time.time()-t

t=time.time()
bitcount4(refstr)
print time.time()-t
 
B

Bengt Richter

|> I was disappointed that gzip only reduces the size by 41%. I would have
|> guessed better, given the predominance of 0 bits.

|patterns and it's hardly a suprise that bitmap of prime numbers
|doesn't have lots of repeating bit patterns.

Sure it does. The bit pattern "0 0 0 0 0 0 0" occurs a WHOLE LOT of
times :). Certainly much more than any other pattern. After all, my
10^8/2 bits only contain 5,761,453 ones among them[*]--there's got to be
a lot of runs of zeros in there :).

The problem, I think, is that these bit patterns don't fall on byte
borders as neatly as I would like. I am pretty sure that I could write
a custom LZ-style compressor that did better by being bit-oriented
rather than byte-oriented. But I'm not going to do that work, 3.8 megs
is small enough for me.

Hm, a bit sequence 001010010010000100100001001000010000001001 ... could be
encoded in a series of symbols '001', '01', '001', '001', '00001', '001', ...
which are essentially just different string representations for prime first differences,
e.g., (faking in the starting 0)
>>> prime = [0,2,3,5,7,11,13,17,19,23,29,31]
>>> ['0'*(prime[i+1]-prime)+'1' for i in range(len(prime)-1)]

['001', '01', '001', '001', '00001', '001', '00001', '001', '00001', '0000001', '001']

Or, using different concrete symbols,
>>> [chr(prime[i+1]-prime+ord('a')) for i in range(len(prime)-1)]

['c', 'b', 'c', 'c', 'e', 'c', 'e', 'c', 'e', 'g', 'c']

Maybe it would be best to analyze the frequency and assign huffman codes and
pack those, for maximal compression? I guess that is about what zip would do to
the latter (byte-oriented, as you mentioned) symbols.

But actually, I was looking at a modulo 2*3*5*7... mask as a pre-compression step, which
might pack things even tighter, because it eliminates whole structured patterns
from the whole. One could do a bigger modulo too. If I get ambitious, I might
try it. I have an inkling of a multi-step modulo, multi-bitmap check that could go
fast in smaller space, but I have to meet someone now...

[Back next day, didn't post that]

The idea alluded to above is to make a mask using a selected part of the front
of the big bitmap, and noting that its zeroes are guaranteed to repeat in forward
repeats of that mask, so those zeroes don't need actual representation going forward.

I.e., if pfact(p) is the "factorial" product of all primes from p on down, then
if mlen=pfact(p), [bitmap[i:i+mlen] for i in range(1,len(bitmap):mlen)] will be
(untested) a list of bitmap chunks with guaranteed zeroes everywhere bitmap[1:mlen+1]
had them. I downloaded your bitmap, so it should be easy to confirm...

(Ok, but I reconstituted the theoretical bitmap by including factors of 2 and the
primes 1 & 2 in the following)

===< primask.py >====================================================
def getbitmasks(fpath, nb):
chunksize = (nb/2+7)//8 # nb/2 since even bits are not in bitfile
f = file(fpath,'rb')
bits = []
def bufbits():
if len(bits)>=nb: return
bitchunk = f.read(chunksize) # bigendian bytes w/ bits for odd no's 1,3,5...
for byte in bitchunk:
ibyte = ord(byte)
for b in xrange(8):
bits.append('.1'[(0x80>>b)&ibyte>0])
bits.append('.') # add in evens

bufbits()
bits[0:2] = ['1','1'] # account for primes 1 & 2 not in bit file at front
while 1:
bufbits() # make sure
if not bits: break
yield ''.join(bits[:nb])
bits = bits[nb:]

def primask(p, nlines=5):
pf = {2:2, 3:6, 5:30, 7:210, 11:2310, 13:30030, 17:510510}[p] # should suffice
bgen = getbitmasks('primes.bitarray', pf)
print 'Prime for mask = %s, mask length = %s' % (p, pf)
for n in xrange(nlines):
s = bgen.next()
nz = s.count('.')
if not n: print 'Precompression possible: (%s-%s zeroes)/%s => %.1f %% of orig' %(
pf,nz,pf,100.*(pf-nz)/pf)
print '%4s: %s' %(nz, len(s)<80 and s or s[:80]+' etc.')

if __name__ == '__main__':
import sys
try:
args = map(int, sys.argv[1:])
p = args.pop(0)
nl = args and args.pop(0) or 5
primask(p, nl)
except: raise; SystemExit("""
Usage: primask prime [number_of_lines]
where prime determines size of mask as product of primes at and below this
and number of lines is number to print.""")
=====================================================================

This will print segments of the bit map (reconstituted to include all numbers starting with 1,2,3,...).
The segments are the length of a possible mask taken from the front whose non-prime slots are bound
to repeat through the rest, since corresponding slots will be multiples of the primes in the first set.
Since those zeroes repeat, one could remove them from the rest of the bit file, as you did with
even numbers. I print the possible compression based on the number of zeroes in the first segment.
So unless I goofed, ...

(To read primes from masks start with 1,2,3,(.),5,(.) etc)

[11:00] C:\pywk\clp\prime>primask.py 3
Prime for mask = 3, mask length = 6
Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
2: 111.1.
123 5 <<-- (primes represented, etc)
4: 1...1.
7 +--11
4: 1...1.
| +--17
+------13
4: 1...1.
5: ....1.

[11:00] C:\pywk\clp\prime>primask.py 5
Prime for mask = 5, mask length = 30
Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
19: 111.1.1...1.1...1.1...1.....1.
23: 1.....1...1.1...1.....1.....1.
23: 1.....1...1.1.....1...1.....1.
24: ......1...1.1...1.1...1.......
25: ......1...1.....1.1.........1.

[11:00] C:\pywk\clp\prime>primask.py 7
Prime for mask = 7, mask length = 210
Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
163: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
175: 1...........1...1.1...1.....1.1.........1.....1.....1.....1.1.....1...1.1....... etc.
177: 1.........1.1.....1...1.....1.......1...1.1...1...........1.......1...1.......1. etc.
178: 1.........1.1...1.....1.....1.1...........1...1.....1.......1.........1.......1. etc.
180: ............1...1.1...1.............1...1.1...1...................1...1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 11
Prime for mask = 11, mask length = 2310
Precompression possible: (2310-1966 zeroes)/2310 => 14.9 % of orig
1966: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
2030: 1.....................1.....1.1.....1...1.....1.............1.....1...1.1.....1. etc.
2043: 1...............1.1...1.....1.1.....1.....1.........1.....1...........1......... etc.
2055: ................1.1.........1.1.....1...1.....1.....1.......1.....1...1......... etc.
2064: 1...............1...................1...1.1.........1.................1.......1. etc.

[11:00] C:\pywk\clp\prime>primask.py 13
Prime for mask = 13, mask length = 30030
Precompression possible: (30030-26781 zeroes)/30030 => 10.8 % of orig
26781: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
27216: ................1...........1...........1.................1.1.....1.....1.....1. etc.
27366: ................1.....1.....1.1.........1.1...1...................1.....1.....1. etc.
27444: ................1.............1.....1.....................1.............1....... etc.
27481: 1...................................1.....1...1.............1...........1.....1. etc.

[11:00] C:\pywk\clp\prime>primask.py 17
Prime for mask = 17, mask length = 510510
Precompression possible: (510510-468178 zeroes)/510510 => 8.3 % of orig
468178: 111.1.1...1.1...1.1...1.....1.1.....1...1.1...1.....1.....1.1.....1...1.1.....1. etc.
472790: ..................1.....................1.1...............1...........1.1.....1. etc.
474186: ......................1.......................1.....1.......1.....1...1.1....... etc.
475041: ..................1...........1.....1.......................1................... etc.
475727: ..................1.................1.....1.......................1...1......... etc.

So it appears we could compress that file (imagining it as containing evens also) quite a bit,
by using a (510510+7)//8 => 63814 byte mask as the pattern for squishing zeroes out of the rest.

BTW, I thought primes ran about 10% of all numbers, so 8.3 % seems a little suspicious.
Did I goof, or what is the right percentage? I guess I'll make a dict of all the gap sizes
with their respective counts in the bit map and see, but have to leave this for now ...
FWIW, bzip2 doesn't do much better either. I'd be kinda surprised at
this point if another representation of the primes under 10^8 actually
did better[**]. I proposed listing the gaps as a strategy; but given
that there are 5.7 million gaps to list, it's hard to beat 3.8 megs.
Maybe encoding gaps in nibbles-with-escaping would work. But now I'm
fond of the speed of the bit array.

Yours, Lulu...

[*] I'm missing two primes versus some other claims. My odd numbers
don't include 2; and I called 1 a non-prime. I don't feel strongly
about the second matter though.

[**] Well, a MUCH shorter representation is a Python-coded algorithm for
generating the primes in the first place. But somehow that seems like
cheating :).

Oops, better post this and go...

Regards,
Bengt Richter
 
L

Lulu of the Lotus-Eaters

(e-mail address removed) (Bengt Richter) wrote previously:
|I thought primes ran about 10% of all numbers, so 8.3 % seems a little
|suspicious.

The primes become sparser over larger sizes. Thinking about the Seive
of Erotosthenes makes that apparent, I think: the larger you get, the
more values get filtered out.

However, your ratios, I believe, *must* be wrong:

|Prime for mask = 3, mask length = 6
|Precompression possible: (6-2 zeroes)/6 => 66.7 % of orig
|Prime for mask = 5, mask length = 30
|Precompression possible: (30-19 zeroes)/30 => 36.7 % of orig
|Prime for mask = 7, mask length = 210
|Precompression possible: (210-163 zeroes)/210 => 22.4 % of orig
[...]

Bracket the Python implementation, just think of the numbers. If you
have a sequence of N bits, the 2 mask takes out half of them. What's
left is odd numbers: exactly 1/3 of which are divisible by 3. Of the
filtered list, exactly 1/5 of what's left is divisible by 5. And so
on.[*]

[*] Actually, exact divisibilty can be off-by-one, since the original
sequence might not divide evenly by all those primes.

So, in other words, the 2 mask gets you to 50%; the 2+3 mask gets
1/2*2/3==33%; the 2+3+5 mask gets 1/2*2/3*4/5==27%; the 2+3+5+7 mask
gets 1/2*2/3*4/5*6/7==22.8%; and so on.

Verifying this "experimentally":
19

Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :)

Yours, Lulu...
 
A

Anton Vredegoor

Lulu of the Lotus-Eaters said:
Anything that is left over after this filtering is--by definition--of
unknown primality. You need to encode that by putting a 1 or 0 in the
bit position corresponding to the number remaining. IOW:

toEncode = filter(lambda n: n%2 and n%3 and n%5 and n%7, range(10**8)))
for n in toEncode:
if isPrime(n):
bitfield.append(1)
else:
bitfield.append(0)

You might want to run that first line on a fast machine with plenty of
memory. :)

I've been keeping some code to myself, because there seemed to be no
direct connection to this thread. However, now that you mention
needing a lot of resources there's a reason to post because my script
creates an infinite generator function that can produce the numbers
for your toEncode list more efficiently.

Anton

def period(L):
"""suppose we enumerate all integers >= 2 which are
not a multiple of any of the numbers in L (L is a list
of prime numbers) and take the first order differences.
This list repeats itself after some time. The period length
is the product of [x-1 for x in L]."""
return reduce(lambda i,j: i*(j-1),L,1)

def primedeltas(L):
"""return one full period of differences"""
diffs,p = [],period(L)
prev = i = 1
while len(diffs) < p:
i += 2
for j in L:
if i%j == 0: break
else:
diffs.append(i-prev)
prev= i
return diffs

def excludemultiples(L):
"""return each number in L and then return numbers
that are not multiples of any number in L. L is a list
of primes"""
i,pp = 1,primedeltas(L)
for p in L: yield p
while 1:
for p in pp:
i += p
yield i

def test():
L= [2,3,5,7]
em = excludemultiples(L)
for i,m in enumerate(em):
print m,
if i==14:
print
break
#print [x/2 for x in primedeltas(L)]

if __name__=='__main__':
test()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,440
Latest member
YoungBorel

Latest Threads

Top