saleha noor

optical computers

Recommended Posts

I propose a bit transposition capability to take advantage of a processor's ability to make logic operations on 32, 64 or more bits at a time.

Some applications, especially databases, must compute bitlogic operations on many items, but the homologue information bits are often spread as one per item (example: item for sale is new with tag, new without box, refurbished, used) while bitlogic needs them grouped in one word. So the instruction would perform


Destination [word i, bit j] = Source [word j, bit i]

on 64 words of 64 bits for instance, prior to the sequence of bitlogic operations that filters the items through a set of conditions, and possibly after it.

The bit transposition would nicely combine with the "scaled copy" I already described
http://www.scienceforums.net/topic/78854-optical-computers/page-2#entry778467
The scaled copy extracts the bit fields of a group of items and packs them in consecutive words, then the bit transposition puts homologue bits extracted from varied items in one word, ready for parallel bitlogic processing, after which a bit transposition and a scaled copy can write the processed bits to the items when necessary. One operation could combine the scaled copy with the bit-reversed copy, but I don't feel it vital.

The execution unit could hardly shuffle so much data, but the Dram or the datacache can. It could then be a transposed copy instruction similar to the already described "scaled copy" and "bit-reversed copy", to be done between address areas corresponding to variables named by the source program, which would then care about data coherency as it has the necessary information for it while the hardware has not. Here too, the cache or Dram must receive and execute more detailed instructions than read and write, which is easier when they're linked at manufacture.

Transposing 32 words * 32 bits or 64 words * 64 bits means 128 or 512 bytes, an efficient size for data transfer between the Dram and the cache. For words of 256 or 512 bits, do it as you can.

Data areas don't always start and stop at multiples of 64 words, and memory page protection will meddle in too. I happily leave these details to other people :huh: . Solutions exist already for vector computers.

Marc Schaefer, aka Enthalpy

 

=================================================================


I suggest a wide cascaded logic operation that operates on many independent bits in parallel, for instance the word width of the execution unit, and computes at once a small chain of logic operations over a few registers.

It can be similar and complementary to the cascaded logic operation I aleady described :
http://www.scienceforums.net/topic/78854-optical-computers/page-3#entry854751 and following

  • The reverse polish notation and an implicit stack simplify the execution;
  • Four bits can encode the operations And, Or, Eq, Gt, Lt and complements, plus Push, Swap, Nop...
  • Three or four bits can designate the successive source registers (rather than the source bits for the previous cascaded logic)

The logic expression to parse a database isn't known at compilation, so the operation follows a computed data rather than a set of opcodes. For healthy context switching, some general registers can hold this information about the source registers and the chain of operations.

For instance, one 64b register can indicate 7 source registers among 32 coded on 5 bits plus 7 operations coded on 4 bits. This seems enough for a simple low-power processor that specializes in databases. Or one 64b register can indicate 16 source registers among 16 coded on 4 bits, and an other register indicate 16 operations coded on 4 bits. This fits the consumption and complexity of a number cruncher better. The throughput of the Dram and cache can be a limit.

The "wide cascaded logic" operation combines nicely with the "transposed copy" and "scaled copy" operations.

As the resulting bitfield can indicate which database items to process further, and the wide cascaded logic screens the items in less than one cycle each, one more instruction could usefully indicate where the first 1 bit is in the result and discard it. Some Cpu have that already.

Marc Schaefer, aka Enthalpy

Edited by Enthalpy

Share this post


Link to post
Share on other sites

These are the technology development for the last architectures I described as of today, with the date of first suggestion and if possible a search keyword.

---------- Processors

  • Explicit macro-op instructions, 01 January 2016, search "macro-op".
    Meaningful for any Cpu, most useful for the simplest ones with lowest consumption. The compiler must be adapted.
    This needs thoughts and paperwork first, possibly at a university or elsewhere.
  • Cpu optimized for integer add and compare at low power.
    31 January 2015. Hints on 26 February 2015, 12:44 AM. Description on 25 October 2015.
    Useful for databases, web servers, web search machines, artificial intelligence as noted on 31 January 2015 and meanwhile.
    This can begin as paperwork before switching to a semiconductor company.
  • Cascaded logic instruction, 26 February 2015 two messages, search "cascade", and
    Wide cascaded logic instruction, Jul 06, 2016 6:34 pm, search "wide cascade".
    Mostly for databases, web servers, web search machines, hence belongs to any Cpu.
    Small hardware development immediately at the silicon design company. Extensions to the optimizing compiler by the editor, may benefit from theoretical thoughts at a university.
  • Find string instruction, 26 February 2015, search "string".
    For any Cpu. Some existing Cpu must already offer it.
    Hardware development at the silicon design company. Extensions to the optimizing compiler by the editor.
  • Complex 32b mul-acc, 18 October 2015 - 01:25 PM, search "complex".
  • Multiple flow control units, 26 October 2015, search "flow control".
    Subtract-one-and-branch instruction, 26 October 2015, search "subtract one".
    Desired everywhere outside linear algebra. May well exist already.
    This can begin as paperwork before switching to a semiconductor company.

---------- Memory

Common point: at most one Dram chip per computing node, with enough throughput to feed the processor.

  • Dram chip with many accesses, 19 November 2013.
  • Full-speed scaled indexed access, 19 November 2013, search "scaled".
    For any Cpu, Gpu, gamestation, database, web server, number cruncher.
    Needs clear thoughts before, possibly at a university. Then, easy silicon.
  • Full-speed butterfly access, 19 November 2013, search "butterfly".
    For number crunchers, especially signal processing.
    Needs clear thoughts before, possibly at a university. Then, easy silicon.
  • Bit transposed copy, Jul 06, 2016 5:05 pm, search "transposed copy".
    Mostly for databases.
    Needs thoughts before. Then, easy silicon.
  • Flash chips with big throughput, 01 May 2015.
    A need, not a solution. Flash chips close to the computing nodes, with ports faster than Usb and Sata.
    For server machines and also number crunchers.
    A task for the Flash company.

---------- Stacked chips

  • Adaptive connections accept misalignment, 08 and 23 February 2015, search "adaptive" (not only chips).
    For any Cpu, Gpu, gamestation, database, web server, number cruncher.
    Hardware development at the silicon design company. Optional upsized proof-of-concept at any electronics lab.
  • Capacitive connections, Nov 12, 2015 2:47 am, search "capacit" (not only chips)
    For any Cpu, Gpu, gamestation, database, web server, number cruncher. PhD thesis exists already.
    No fine lithography, hence by any semiconductor lab, including equipped university.
  • Connections by reflow accept misalignment, search "reflow".
    Nov 12, 2015 2:47 am, details 27 December 2015.
    No fine lithography, hence by any semiconductor lab, including equipped university.

---------- Software

  • OS subset for small unit Dram, 13 September 2015.
    A need, not a solution. Database machines can have a unit Dram even smaller than present supercomputers.
    Usually done by the computer manufacturer, could be a collective effort too.
  • Lisp, Prolog interprets, inference engine for small unit Dram, 07 December 2013.
    A need, not a solution.

---------- Boards

  • Data vessels, 08 February 2015, search "vessel".
    By any electronics lab, or the Pcb company.
  • Pcb with more signals: denser or more layers (or bigger).
    A task for a Pcb company.

---------- Machines

  • Crossboards, 01 February 2015.
    For most machines with several boards.
    Electrical engineers, possibly with a (opto-) chip company, shall bring throughput.
  • Optical board-to-board connectors, 02 February 2015 and 22 February 2015 - 12:47 AM, search "optical link".
    For most machines with several boards - but consider capacitive connectors.
    Needs some optics and some fast electronics, possibly together with an optochip company.
  • Capacitive board-to-board connectors, 08 February 2015, search "capacitive".
    For most machines with several boards.
    Proof-o-concept by a skilled and equipped electronics lab, then chips by semiconductor designer and manufacture.
  • Flexible cables with repeaters for many fast signals, 02 December 2013.
    For every machine except the smallest ones.
    By a skilled and equipped electronics lab together with a manufacturer of flexible printed circuits.
  • Cable to board connectors by contact, 02 December 2013.
    For most machines with several boards - but consider capacitive connectors.
    Needs skills for mechanical engineering and fast electronics.
  • Wider so-dimm-like connector, 18 October 2015 - 01:25 PM.
    For Pci-E boards, maybe others.
    Development by the connector manufacturer, needs skills for fast electronics.
  • Capacitive cable to board connectors, Nov 12, 2015 1:49 am, search "capacitive".
    For most machines with several boards.
    Almost the same development as the capacitive connections between chips.
  • Superconducting data cables, 02 December 2013, search "superconduct".
    For many-cabinet machines.
    Needs skills for superconductors, (printed?) cables and fast electronics.
  • Insulating coolant, 15 November 2015.
    Chemistry lab.
Edited by Enthalpy

Share this post


Link to post
Share on other sites

Some chips progressed from mid-2016 to end 2018.

Processors have stalled. 14nm finfet then, 12nm finfet now. Intel's Xeon Phi got a minor update to the Knights Landing. nVidia's Titan V provides as many 64b GFlops using as many watts as the Knights Landing. Waiting for 7nm processors.

Dram chips have still 1GB capacity. Their throughput increased thanks to parallelism and new bus standards, not to faster access to the cells.

But Flash memory did improve:

Imft (Intel-Micron) made a chip, nonvolatile but not flash, that is fast that draws too much power for my goal here

Samsung brought its Z-Nand (Slc V-Nand) to around 20µs read and write latency. The chips are not documented, but as deduced from the Ssd, they could each hold 25GB and read or write 200MB/s. This is fantastic news for a database machine and also for a number cruncher with good virtual memory. Better: this throughput seems to access one single 4kB page, so each processor could have its own 200MB/s access to its part of the Flash chip. 50* the Dram capacity, read or write the complete Dram to virtual memory in 0.3s, that looks sensible again. How much does a chip cost? How much does it consume?

Share this post


Link to post
Share on other sites

I had suggested an instruction copied from the Vax 11 's Subtract One and Branch, on
26 October 2015
and the 86 family has already one.

==========

One computation I have quite often in my programmes is
  if (|x-ref| < epsilon)

The operation on floating numbers is lighter and faster than a multiplication, hence easily done in one cycle. It's often in inner loops of heavy computations. Processors that don't provide this operation should.

Depending on hardware timing, the instruction set could provide variants, preferibly the most integrated one:
 

|x-ref|
Compare |x-ref| with epsilon
Branch if |x-ref| < epsilon

Simd processors (Sse, Avx...) could compute on each component and group the comparisons by a logical operation, possibly with a mask, in the same instruction or a following one. Inevitably with Simd, it makes many combinations.

==========

An other computation frequent in scientific programmes is
  if (|x-ref|2 < epsilon)

It looks about as heavy as a multiplication-accumulation, but denormalizations take more time. If it fits in a cycle, fine, with the comparison and the branch, better, but it's obviously not worth a slower cycle.

Here too, Simd machines could group the comparisons by a logical operation.

I feel the square less urgent than the previous absolute value, which can replace it often. Also, a test is often done on the sum of the squared components of the difference instead, and such a sum is also useful alone, without a test nor a branch.

Marc Schaefer, aka Enthalpy

Share this post


Link to post
Share on other sites

Small corporate file servers, web servers, database servers and number crunchers are commonly built of few dozen blades holding each a pair of big recent expensive processors, with a rather loose network between the blades, and a Dram throughput that follows less and less the Cpu needs. I propose to use the better networks already described here and assemble more of used outfashioned cheap processors. This shall cumulate more throughput from the Dram, the network, optionally the disks.

==========

Here I compare old and recent processors. All from Intel as they made most servers recently, but I have nothing against Amd, Arm and the others. I checked few among the 1001 variants and picked subjectively some for the table.

North- and southbridges are soldered on existing mobos, hence not available second-hand nor usable, and new ones should remain expensive when outfashioned. So I checked only Cpu that directly access the Dram and create many Pci-E links to make the network. Chips on Pci-E links shall make ports for Ethernet and the disks like on add-on cards; I like Pci-E disks but they rob too much mobo area here. I didn't check what the Bios and monitoring need. One special card shall connect a screen, keyboard etc.

I excluded Pci-E 2 Cpu for throughput and Pci-E 4 Cpu for price in 2019. Pci-E 3 offers 8GT/s per link and direction, so a good x16 network provides ~16GB/s to each Cpu.

Line e is a desktop Cpu, less cheap. Lines f g are modern big Cpu that make a server expensive for a medium company. Lines a b c d are candidates for my proposal, in 2019 these have Avx-256 and Ddr3. None integrates a Gpu that draws 15W.


 

  |  #  GHz  64b | #  MT/s |  W  |   W/GHz   |    Cy/T    |
===========================================================
a |  4  2.8  (4) | 3  1333 |  80 | 7.1 (1.8) | 2.8 (11)   |
b |  6  2.9  (4) | 4  1600 | 130 | 7.5 (1.9) | 2.7 (11)   |
c |  8  2.4  (4) | 4  1600 |  95 | 4.9 (1.2) | 3.0 (12)   |
d |  6  2.4  (4) | 4  1600 |  60 | 4.2 (1.0) | 2.3  (9.0) | <<==
===========================================================
e |  6  3.3  (4) | 4  2133 | 140 | 7.1 (1.8) | 2.3  (9.3) |
===========================================================
f | 72  1.5  (8) | 6  2400 | 245 | 2.3 (0.3) | 7.5 (60)   |
g | 24  2.7  (8) | 6  2933 | 205 | 3.2 (0.4) | 3.7 (30)   |
===================================================================
a = Sandy Bridge-EN Xeon E5-1410.   LGA1356, DDR3, 24 lanes,   20€.
b = Sandy Bridge-EP Xeon E5-2667.   LGA2011, DDR3, 40 lanes,   40€.
c = Sandy Bridge-EP Xeon E5-4640.   LGA2011, DDR3, 40 lanes,   50€.
d = Ivy Bridge-EP Xeon E5-2630L v2. LGA2011, DDR3, 40 lanes,   40€.
===================================================================
e = Haswell-E Core i7-5820K.      LGA2011-3, DDR4, 28 lanes,  130€.
===================================================================
f = Nights Landing Xeon Phi 7290.   LGA3647, DDR4, 36 lanes, ++++€
g = Cascade Lake-W Xeon W-3265.     LGA3647, DDR4, 64 lanes, 3000€
===================================================================

 

  • First # is the number of cores, GHz the base clock, 64b is 4 for Avx-256 and 8 for Avx-512.
  • Next # is the number of Dram channels, MT/s how many 64b words a channel transfers in a µs.
  • W is the maximum design consumption of a socket, called TDP by Intel.
  • W/GHz deduces an energy per scalar or (Simd) computation. It neglects the small gains in cycle efficiency of newer Core architectures.
  • Cy/T compares the Cpu and Dram throughputs in scalar and (Simd) mode, ouch! It's the number of cycles the cores shall wait to obtain one 64b or (Avx) word from the Dram running ideally.
  • The price is for second hand, observed on eBay for reasonable bargains on small amounts in 2019.

While an OS runs well from cache memory, most scientific programmes demand Dram throughput or are difficult to write for a cache. For databases, the Dram latency decides and it didn't improve for years. An Dram easy to use would read two data and write one per core and cycle, or Cy/T=0.33, which no present computer achieves. I favoured this ratio when picking the list. More cores make recent Cpu worse, wider Simd even worse. Assembling many old Cpu cumulates more Dram channels.

Process shrinks improve the consumption per computation. If an oldfashioned Cpu draws 60W and a recent one saves the half by finishing faster, over 1/3 activity and 5 years the gain is 438kWh <100€, which doesn't buy a fashionable Cpu. If the Dram stalls the recent Cpu more often, the gain vanishes.

==========

Each Cpu with its Dram, Ethernet and disk ports shall fit on a daughter board plugged by Pci-E x16 (or two x16 if any possible) on a big mobo that makes the network. But if Pci-E 3 signals can cross a Ddr4 connector, then carry 32+ lanes.

The network comprises 16 independent planes where chips make a full crossbar, or if needed a multidimensional crossbar.

  • Can a Cpu on the mobo make a 40*40 crossbar? It takes many Cpu there, and software makes slow communications.
  • At least one crossbar Asic exists for Pci-E connections among Cpu. If that Asic isn't available, make an other.
  • A 32*32 full matrix chip fits in a Dram chip package and can connect 1024 Cpu in a 2D crossbar. 15*16=240 Cpu and 8+8 lanes each take (15*16)*8=248 matrix chips. A chip can serve several smaller planes.
  • The routing density needs a mobo with many layers. Repeaters may be necessary.
  • A big machine with 480 Cpu connects any two Cpu in two hops and transfers 2TB/s through any equator in any direction. Better than a few fibres as a hypertorus. Many small Cpu outperform again few big ones.

==========

Liquid cooling takes few mm over the Cpu. Some alkanes are insulators, good coolants, and hard to light
Low-freezing rocket fuels

New Dram chips soldered directly on the daughter boards, like on graphics cards, would enable 12.7mm spacing. Few 10€ pay 16GB presently. Used Dram modules would be bigger and more flexible, tilted connectors exist for Ddr4 at least
au.rs-online.com
and horizontal connectors for So-dimm. Or have a minimum Pcb to hold a second Dimm connector making the angle.

Daughter boards need local regulators for the Cpu, Dram etc. Like graphics cards do, they can receive 12V from (many) Pc power supplies with minimum recabling. As the cabinet's sides are usefully reserved to Ethernet, the Sata disks and power supplies could reside in the doors.

Using an Ivy Bridge-EP Xeon E5-2630L v2 or similar, each daughter board might sell for 250€.

  • A small cabinet with 30 daughter boards would sell for 10k€, cumulate 3.5TFlops on doubles, Dram 1.5TB/s, network 240GB/s through the Equator.
  • A big cabinet with 480 daughter boards would sell for 160k€. 56TFlops, 240TB/s, 1.9TB/s. 400MB/s disks, one per board, would cumulate 190GB/s.

Drawings may come. Perhaps.

==========

While not a competitor for the clean-sheet architectures I proposed previously, such machines assemble existing hardware. As Pci-E is fully compatible, the number and nature of the daughter boards can evolve, the size of the mobo too, and the boards can serve in successive machines at different customers. As they depend on available second-hand processors, the daughter boards would be diverse within a machine, and the software must cope with small variations in the instruction set.

With reasonable capital, a startup company can buy used Cpu on eBay-Alibaba-etc, or rather complete old-fashioned servers with Dram and Ssd, and reorganize more components better around the superior network.

Marc Schaefer, aka Enthalpy

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now