Saturday, January 30, 2016

Race results are in

Some people wanted to compare machines, so here is the exact code I used and some sample values I got from running it on my laptop. I'm curious what values other people get.
;;; -*- Mode: scheme; coding:us-ascii -*-

;;; This is code for MIT/GNU Scheme.  We attempt to measure the speed of ASSQ.
;;; The code avoids obvious abstractions because we don't want to
;;; measure the speed of function calling.

;; Inline the standard scheme primitives.
(declare (usual-integrations))

;; We change the size and alignment of the heap by consing a small
;; data structure to this list.
(define *change-heap-size-and-alignment* '())

;;; Creates a test alist that looks like this:
;;; ((13 . "13")
;;;  (23 . "23")
;;;  (25 . "25")
;;;  (18 . "18")
;;;  (0 . "0")
;;;  (19 . "19")
;;;  (5 . "5")
;;;  (4 . "4")
;;;  (6 . "6")
;;;    ...
;;; 
(define (make-test-alist n-elements)
  (let ((alist 
         (fold-left (lambda (alist number)
                      (alist-cons number (number->string number)
                                  alist))
                    '()
                    ;; shuffle the numbers from 1 to n
                    (map car
                         (sort   
                          (map (lambda (n)
                                 (cons n (random 1.0)))
                               (iota n-elements))
                          (lambda (l r) (< (cdr l) (cdr r))))))))
    (gc-flip)
    (set! *change-heap-size-and-alignment* 
          (cons (vector #f) *change-heap-size-and-alignment*))
    alist))

;;; Creates an alist of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-alist size n-lookups)
  (let ((test-alist (make-test-alist size)))
    (show-time
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       0
                       (fix:+ idx 1)))
            (answer '() (assq idx test-alist)))
           ((fix:>= i n-lookups) answer))))))
Here's a sample run on my laptop. We make an alist with 10 elements and call ASSQ 100,000,000 times on it, fetching each entry about the same number of times.
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-alist 10 100000000))

;process time: 2260 (2260 RUN + 0 GC); real time: 2259
;process time: 2260 (2260 RUN + 0 GC); real time: 2265
;process time: 2290 (2290 RUN + 0 GC); real time: 2291
;process time: 2250 (2250 RUN + 0 GC); real time: 2247
;process time: 2260 (2260 RUN + 0 GC); real time: 2259
;process time: 2240 (2240 RUN + 0 GC); real time: 2240
;process time: 2240 (2240 RUN + 0 GC); real time: 2243
;process time: 2250 (2250 RUN + 0 GC); real time: 2258
;process time: 2240 (2240 RUN + 0 GC); real time: 2247
;process time: 2250 (2250 RUN + 0 GC); real time: 2250
Process time is reported in milliseconds, so it took about 2.26 seconds to do 100,000,000 million lookups. This divides out to .0000000225 seconds = 22.5 nanoseconds per lookup.

It should be about linear with the size of the list, so we'd expect a 100 element list to take somewhere around 225 nanoseconds.

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-alist 100 100000000))

;process time: 20720 (20720 RUN + 0 GC); real time: 20753
;process time: 20700 (20700 RUN + 0 GC); real time: 20733
;process time: 20640 (20640 RUN + 0 GC); real time: 20671
;process time: 20690 (20690 RUN + 0 GC); real time: 20695
;process time: 20670 (20670 RUN + 0 GC); real time: 20690
;process time: 21010 (21010 RUN + 0 GC); real time: 21026
;process time: 20800 (20800 RUN + 0 GC); real time: 20832
;process time: 20760 (20760 RUN + 0 GC); real time: 20747
;process time: 20710 (20710 RUN + 0 GC); real time: 20702
;process time: 20690 (20690 RUN + 0 GC); real time: 20700
;Value: #t
Testing a hash table:
(define (make-test-hash-table n-entries)
  (alist->hash-table (make-test-alist n-entries)))

;;; Creates a hash-table of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-hash-table size n-lookups)
  (let ((test-hash-table (make-test-hash-table size)))
    (show-time
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       0
                       (fix:+ idx 1)))
            (answer '() (hash-table/get test-hash-table idx #f)))
           ((fix:>= i n-lookups) answer))))))
Put 10 elements or a thousand in a hash table, it takes a constant amount of time to look things up:
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-hash-table 10 100000000))

;process time: 8320 (8320 RUN + 0 GC); real time: 8321
;process time: 8300 (8300 RUN + 0 GC); real time: 8304
;process time: 8420 (8420 RUN + 0 GC); real time: 8419
;process time: 8280 (8280 RUN + 0 GC); real time: 8304
;process time: 8380 (8380 RUN + 0 GC); real time: 8387
;process time: 8280 (8280 RUN + 0 GC); real time: 8288
;process time: 8320 (8320 RUN + 0 GC); real time: 8311
;process time: 8330 (8330 RUN + 0 GC); real time: 8327
;process time: 8290 (8290 RUN + 0 GC); real time: 8290
;process time: 8310 (8310 RUN + 0 GC); real time: 8307
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-hash-table 1000 100000000))

;process time: 8400 (8400 RUN + 0 GC); real time: 8403
;process time: 8550 (8550 RUN + 0 GC); real time: 8553
;process time: 8620 (8620 RUN + 0 GC); real time: 8639
;process time: 8420 (8420 RUN + 0 GC); real time: 8435
;process time: 8400 (8400 RUN + 0 GC); real time: 8425
;process time: 8460 (8460 RUN + 0 GC); real time: 8455
;process time: 8460 (8460 RUN + 0 GC); real time: 8459
;process time: 8480 (8480 RUN + 0 GC); real time: 8486
;process time: 8500 (8500 RUN + 0 GC); real time: 8502
;process time: 8520 (8520 RUN + 0 GC); real time: 8518
;Value: #t

Testing an rb-tree:
(define (make-test-rb-tree n-entries)
  (alist->rb-tree (make-test-alist n-entries) fix:= fix:<))

;;; Creates a rb-tree of <size> entries and then measures the time to
;;; perform n-lookups on it.  Specialized to fixnum-only arithmetic.

(define (time-rb-tree size n-lookups)
  (let ((test-rb-tree (make-test-rb-tree size)))
    (show-time
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       0
                       (fix:+ idx 1)))
            (answer '() (rb-tree/lookup test-rb-tree idx #f)))
           ((fix:>= i n-lookups) answer))))))

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 10 100000000))

;process time: 3910 (3910 RUN + 0 GC); real time: 3908
;process time: 3810 (3810 RUN + 0 GC); real time: 3805
;process time: 4090 (4090 RUN + 0 GC); real time: 4090
;process time: 3970 (3970 RUN + 0 GC); real time: 3967
;process time: 4060 (4060 RUN + 0 GC); real time: 4051
;process time: 3980 (3980 RUN + 0 GC); real time: 3979
;process time: 4040 (4040 RUN + 0 GC); real time: 4040
;process time: 4090 (4090 RUN + 0 GC); real time: 4094
;process time: 3810 (3810 RUN + 0 GC); real time: 3810
;process time: 4090 (4090 RUN + 0 GC); real time: 4092
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 100 100000000))

;process time: 7700 (7700 RUN + 0 GC); real time: 7720
;process time: 7760 (7760 RUN + 0 GC); real time: 7767
;process time: 7700 (7700 RUN + 0 GC); real time: 7710
;process time: 7890 (7890 RUN + 0 GC); real time: 7893
;process time: 7920 (7920 RUN + 0 GC); real time: 7914
;process time: 7650 (7650 RUN + 0 GC); real time: 7646
;process time: 7740 (7740 RUN + 0 GC); real time: 7738
;process time: 7760 (7760 RUN + 0 GC); real time: 7761
;process time: 7670 (7670 RUN + 0 GC); real time: 7671
;process time: 8140 (8140 RUN + 0 GC); real time: 8136
;Value: #t

1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-rb-tree 1000 100000000))

;process time: 13130 (13130 RUN + 0 GC); real time: 13124
;process time: 13160 (13160 RUN + 0 GC); real time: 13153
;process time: 13150 (13150 RUN + 0 GC); real time: 13149
;process time: 13140 (13140 RUN + 0 GC); real time: 13140
;process time: 13310 (13310 RUN + 0 GC); real time: 13304
;process time: 13170 (13170 RUN + 0 GC); real time: 13172
;process time: 13140 (13140 RUN + 0 GC); real time: 13167
;process time: 13250 (13250 RUN + 0 GC); real time: 13238
;process time: 13300 (13300 RUN + 0 GC); real time: 13318
;process time: 13420 (13420 RUN + 0 GC); real time: 13416
;Value: #t
And wt-trees while we're at it:
(define (make-test-wt-tree n-elements)
  (alist->wt-tree number-wt-type (make-test-alist n-elements)))

(define (time-wt-tree size n-lookups)
  (let ((test-wt-tree (make-test-wt-tree size)))
    (show-time
     (lambda ()
       (do ((i   0 (fix:+ i 1))
            (idx 0 (if (fix:> idx size)
                       0
                       (fix:+ idx 1)))
            (answer '() (wt-tree/lookup test-wt-tree idx #f)))
           ((fix:>= i n-lookups) answer))))))
1 ]=> (do ((i 0 (+ i 1))) ((not (< i 10))) (time-wt-tree 100 100000000))

;process time: 6400 (6400 RUN + 0 GC); real time: 6397
;process time: 6740 (6740 RUN + 0 GC); real time: 6736
;process time: 6760 (6760 RUN + 0 GC); real time: 6763
;process time: 6070 (6070 RUN + 0 GC); real time: 6068
;process time: 6450 (6450 RUN + 0 GC); real time: 6461
;process time: 6800 (6800 RUN + 0 GC); real time: 6812
;process time: 6330 (6330 RUN + 0 GC); real time: 6346
;process time: 6060 (6060 RUN + 0 GC); real time: 6066
;process time: 6050 (6050 RUN + 0 GC); real time: 6039
;process time: 6300 (6300 RUN + 0 GC); real time: 6303
;Value: #t

Friday, January 29, 2016

Alist vs. hash-table

An alist is a simple data structure that holds key-value pairs in a linked list. When a key is looked up, the list is searched to find it. The time it takes is proportional to the length of the list, or the number of entries.

A hash-table is a more complex data structure that holds key-value pairs in a set of "hash buckets". When a key is looked up, it is first "hashed" to find the correct bucket, then that bucket is searched for the entry. The time it takes depends on a number of things, the hash algorithm, the number of buckets, the number of entries in the bucket, etc. A hash-table can be faster than an alist because the hashing step is quick and the subsequent search step will have very few entries to search.

In theory, an alist takes time proportional to the number of entries, but a hash-table takes constant time independent of the number of entries. Let's find out if this is true for MIT/GNU Scheme.

I wrote a little program that measures how long it takes to look things up in an alist vs. a hash table. Here's what I measured:
It does indeed seem that alists are linear and hash tables are constant in lookup time. But the hashing step of a hash table does take time, so short alists end up being faster that hash tables. The breakeven point looks like a tad over 25 elements. So if you expect 25 or fewer entries, an alist will perform better than a hash table. (Of course different implementations will have different break even points.)

A tree data structure is slightly more complex than an alist, but simpler than a hash table. Looking up an entry in a tree takes time proportional to the logarithm of the number of entries. The logarithm function grows quite slowly, so a tree performs pretty well over a very large range of entries. A tree is slower than an alist until you have about 15 entries. At this point, the linear search of an alist cannot compete with the logarithmic search of a tree. The time it takes to search a tree grows, but quite slowly. It takes more than 100 entries before a tree becomes as slow as a hash table.
With a big enough tree, the growth is so slow that you can pretend it is constant.

Wednesday, January 27, 2016

Latest amusement

Here's a procedure that CDRs down a list until it hits the end:
(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))
        tail)))
The inner loop of the procedure compiles to this:
loop-3:
 (cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))

 (mov q (r 0) (@r 4))         ;; Fetch 'tail'

 (mov q (r 1) (r 0))          ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))         ;; Check for pair?, jump if not
 (jne (@pcr label-14))

 (and q (r 0) (r 5))          ;; Mask the tag

 (mov q (r 1) (@ro 0 8))      ;; Fetch the CDR

 (mov q (@r 4) (r 1))         ;; Assign 'tail'

 (jmp (@pcr loop-3))          ;; Do it again.
On my laptop, each iteration of this loop takes a few nanoseconds.

I noticed that there seems to be a lot more going on than CDRing down a list. The interrupt check is responsible for a lot of the overhead. MIT/GNU Scheme polls for interrupts. The compiler inserts an interrupt check at every procedure entry and backwards branch. This ensures that only a small, bounded number of instructions can run between interrupt polls. The interrupt poll and the heap check are simultaneous because of a hack. To interrupt Scheme, you save the "Top of Heap" pointer and set it to zero. This causes the heap check to fail as if there were an out of memory condition. The out of memory handler first checks to see if this was actually an interrupt masquerading as an out of memory, and restores the heap pointer if so.

The two instructions,

(cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))
perform the check. The first compares register 7 with the contents of memory that register 6 points at. The convention for the compiler is that register 7 contains the free pointer and register 6 holds the address of MEMTOP. The interrupt check does a memory cycle.

The reason you poll for interrupts is that you need the virtual machine to be in a coherent state because the interrupt could attempt to run arbitrary code. The garbage collector in particular has to be able to parse the machine state
at an interrupt. Interrupts are polled at apply time. Before application, the interrupts are checked. If an interrupt is to be processed, a continuation is pushed that will do the original application, and an interrupt handler is applied instead.

In MIT-Scheme, the virtual machine is a stack machine, and at apply time the entire state of the virtual machine is pointed to by the stack pointer. This is a good state to be in when you handle interrupts or garbage collect. But this means that arguments are passed on the stack. This instruction:

(mov q (r 0) (@r 4))
loads register 0 with the contents of memory at register 4. (The compiler convention is that register 4 is the stack pointer.) In other words, it is fetching the value of the argument tail from the stack. This is the second memory cycle.

Between interrupt polls, the compiled code can freely manipulate Scheme objects without worrying about being embarrassed by the garbage collector. But at apply time, when a possible interrupt poll could happen, the compiled code must put the virtual machine back into a coherent state. In particular, modified values are stored back on the stack.I This instruction just before the jump puts the value of tail back on the stack before jumping to the top of the loop.

(mov q (@r 4) (r 1)) 
That's three memory cycles in addition to the actual fetching of the CDR in this instruction:
(mov q (r 1) (@ro 0 8))
Of course these are quite inexpensive memory cycles because the top of stack is cached, but there is at least the time time it takes to validate the cache entry.

The interrupt poll occurs every time around the loop, so we copy the arguments back and forth between the stack and registers on each iteration. If we unroll the loop we can avoid some of the copying:

(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))
        tail)))

(define (last-cdr2 list)
  (let loop ((tail list))
    (if (not (pair? tail))
        tail
        (let ((tail (cdr tail)))
          (if (not (pair? tail))
              tail
              (loop (cdr tail)))))))
The loop in last-cdr2 compiles to this:
loop-6:
 (cmp q (r 7) (@r 6))           ;; Interrupt check
 (jge (@pcr interrupt-15))  

 (mov q (r 0) (@r 4))           ;; Fetch 'tail'

 (mov q (r 1) (r 0))            ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))           ;; Check for pair?
 (jne (@pcr label-17))

 (and q (r 0) (r 5))            ;; Mask the tag

 (mov q (r 1) (@ro 0 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 1))           ;; Assign 'tail'

 (mov q (r 0) (r 1))
 (shr q (r 0) (&u #x3a))        ;; Extract the tag

 (cmp b (r 0) (&u 1))           ;; Check for pair?
 (jne (@pcr label-19))

 (and q (r 1) (r 5))            ;; Mask the tag

 (mov q (r 0) (@ro 1 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 0))           ;; Assign 'tail'

 (jmp (@pcr loop-6))            ;; Do it again
If I count correctly, there are six memory cycles per iteration, two of which are CDRs. We also avoid a single jmp instruction per iteration.

Here's the timing on my machine:

(test-last-cdr)
3.643 nanoseconds per cdr
3.643 nanoseconds per cdr
3.711 nanoseconds per cdr
3.643 nanoseconds per cdr
3.576 nanoseconds per cdr

(test-last-cdr2)
2.766 nanoseconds per cdr
2.699 nanoseconds per cdr
2.834 nanoseconds per cdr
2.699 nanoseconds per cdr


Sunday, January 24, 2016

Locality and GC

I was curious about how long it took to CDR down a list. I wrote a little program that adds up the elements in a list
and tried it out.

(define test-list (iota (expt 10 6)))

(define (sum-list list)
  (let loop ((total 0)
	     (tail  list))
    (if (pair? tail)
	(loop (+ total (car tail)) (cdr tail))
	(begin (if (not (null? tail))
		   (error:not-list list 'sum-list))
	       total))))

(define (test-sum-list n)
   (do ((i 0 (+ i 1))
        (answer '() (sum-list test-list)))
       ((not (< i n)) answer)))

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4280 (4280 RUN + 0 GC); real time: 4281
;Value: 499999500000
But after garbage collection, the performance dropped a bit.
1 ]=> (gc-flip)

;GC #18 15:18:59: took:   0.13   (0%) CPU,   0.13   (0%) real; free: 246114893
;Value: 246114893

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4490 (4490 RUN + 0 GC); real time: 4483
;Value: 499999500000
Now it's taking 4.49 ms.

The garbage collector in MIT Scheme is a very simple, stop the world, copying collector. The GC algorithm traverses the heap in breadth-first order. An undesired effect of this is data structures being spread about the heap. When I first built the test-list, the cons cells were close together. Here are the first few addresses:
1 ]=> (do ((tail test-list (cdr tail))
                (i 0 (+ i 1)))
               ((not (< i 10)) #f)
             (display (object-datum tail))
             (newline))
253936080
253936064
253936048
253936032
253936016
253936000
253935984
253935968
253935952
253935936
;Value: #f
But after garbage collection...
1 ]=> (gc-flip)

;GC #19 15:33:28: took:   0.15   (1%) CPU,   0.16   (0%) real; free: 246114593
;Value: 246114593

1 ]=> (do ((tail test-list (cdr tail))
              (i 0 (+ i 1)))
             ((not (< i 10)) #f)
           (display (object-datum tail))
           (newline))
194190168
194326376
194416512
194499936
194555336
194584992
194609544
194631760
194652360
194669208
;Value: #f
All the cons cells are on different pages.

This is annoying, so I hacked the GC so that it transports chains of cons-cells eagerly. If we do the same experiment, before the flip,

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4280 (4280 RUN + 0 GC); real time: 4280
;Value: 499999500000
And after.
1 ]=> (gc-flip)

;GC #5 15:38:53: took:   0.23   (2%) CPU,   0.24   (0%) real; free: 249429521
;Value: 249429521

1 ]=> (show-time (lambda () (test-sum-list 1000)))

;process time: 4100 (4100 RUN + 0 GC); real time: 4100
;Value: 499999500000
Now we see a performance increase. If we look at the test list,
1 ]=> (do ((tail test-list (cdr tail))
           (i 0 (+ i 1)))
          ((not (< i 10)) #f)
        (display (object-datum tail))
        (newline))
193516792
193516808
193516824
193516840
193516856
193516872
193516888
193516904
193516920
193516936
;Value: #f

You can see that the cons-cells addresses are next to each other.