08:49PM EDT - Some of the big news of today is Cerebras announcing its wafer-scale 1.2 trillion transistor solution for deep learning. The talk today goes into detail about the technology.

08:51PM EDT - Wafer scale chip, over 46,225 mm2, 1.2 trillion transistors, 400k AI cores, fed by 18GB of on-chip SRAM

08:51PM EDT - TSMC 16nm

08:51PM EDT - 215mm x 215mm - 8.5 inches per side

08:51PM EDT - 56 times larger than the largest GPU today

08:52PM EDT - Built for Deep Learning

08:52PM EDT - DL training is hard (ed: this is an understatement)

08:52PM EDT - Peta-to-exa scale compute range

08:53PM EDT - The shape of the problem is difficult to scale

08:53PM EDT - Fine grain has a lot of parallelism

08:53PM EDT - Coarse grain is inherently serial

08:53PM EDT - Training is the process of applying small changes, serially

08:53PM EDT - Size and shape of the problem makes training NN really hard

08:53PM EDT - Today we have dense vector compute

08:54PM EDT - For Coarse Grained, require high speed interconnect to run mutliple instances. Still limited

08:54PM EDT - Scaling is limited and costly

08:54PM EDT - Specialized accelerators are the answer

08:55PM EDT - NN: what is the right architecture

08:55PM EDT - Need a core to be optimized for NN primitives

08:55PM EDT - Need a programmable NN core

08:55PM EDT - Needs to do sparse compute fast

08:55PM EDT - Needs fast local memory

08:55PM EDT - All of the cores should be connected with a fast interconnect

08:56PM EDT - Cerebras uses flexible cores. Flexible general ops for control processing

08:56PM EDT - Core should handle tensor operations very efficiency

08:56PM EDT - Forms the bulk fo the compute in any neural network

08:56PM EDT - Tensors as first class operands

08:57PM EDT - fmac native op

08:57PM EDT - NN naturally creates sparse networks

08:58PM EDT - The core has native sparse processing in the hardware with dataflow scheduling

08:58PM EDT - All the compute is triggered by the data

08:58PM EDT - Filters all the sparse zeros, and filters the work

08:58PM EDT - saves the power and energy, and get performance and acceleration by moving onto the next useful work

08:58PM EDT - Enabled because arch has fine-grained execution datapaths

08:58PM EDT - Many small cores with independent instructions

08:59PM EDT - Allows for very non-uniform work

08:59PM EDT - Next is memory

08:59PM EDT - Traditional memory architectures are not optimized for DL

08:59PM EDT - Traditional memory requires high data reuse for performane

09:00PM EDT - Normal matrix multiply has low end data reuse

09:00PM EDT - Translating Mat*Vec into Mat*Mat, but changes the training dynamics

09:00PM EDT - Cerebras has high-perf, fully distributed on-chip SRAM next to the cores

09:01PM EDT - Getting orders of magnitude higher bandwidth

09:01PM EDT - ML can be done the way it wants to be done

09:01PM EDT - High bandwidth, low latency interconnect

09:01PM EDT - fast and fully configurable fabric

09:01PM EDT - all hw based communication avoicd sw overhead

09:02PM EDT - 2D mesh topology

09:02PM EDT - higher utlization and efficiency than global topologies

09:02PM EDT - Need more than a single die

09:02PM EDT - Solition is a wafer scale

09:03PM EDT - Build Big chips

09:03PM EDT - Cluster scale perf on a single chip

09:03PM EDT - GB of fast memory (SRAM) 1 clock cycle from the core

09:03PM EDT - That's impossible with off-chip memory

09:03PM EDT - Full on-chip interconnect fabric

09:03PM EDT - Model parallel, linear performance scaling

09:04PM EDT - Map the entire neural network onto the chip at once

09:04PM EDT - One instance of NN, don't have to increase batch size to get cluster scale perf

09:04PM EDT - Vastly lower power and less space

09:04PM EDT - Can use TensorFlow and PyTorch

09:05PM EDT - Performs placing and routing to map neural network layers to fabric

09:05PM EDT - Entire wafer operates on the single neural network

09:05PM EDT - Challenges of wafer scale

09:05PM EDT - Need cross-die connectivity, yield, thermal expansion

09:06PM EDT - Scribe line separates the die. On top of the scribe line, create wires

09:07PM EDT - Extends 2D mesh fabric across all die

09:07PM EDT - Same connectivity between cores and between die

09:07PM EDT - More efficient than off-chip

09:07PM EDT - Full BW at the die level

09:08PM EDT - Redundancy helps yield

09:08PM EDT - Redundant cores and redundant fabric links

09:08PM EDT - Reconnect the fabric with links

09:08PM EDT - Drive yields high

09:09PM EDT - Transparent to software

09:09PM EDT - Thermal expansion

09:09PM EDT - Normal tech, too much mechanical stress via thermal expansion

09:09PM EDT - Custom connector developed

09:09PM EDT - Connector absorbs the variation in thermal expansion

09:10PM EDT - All components need to be held with precise alignment - custom packaging tools

09:10PM EDT - Power and Cooling

09:11PM EDT - Power planes don't work - isn't enough copper in the PCB to do it that way

09:11PM EDT - Heat density too high for direct air cooling

09:12PM EDT - Bring current perpendicular to the wafer. Water cooled perpendicular too

09:14PM EDT - Q&A Time

09:14PM EDT - Q and A

09:14PM EDT - Already in use? Yes

09:15PM EDT - Can you make a round chip? Square is more convenient

09:15PM EDT - Yield? Mature processes are quite good and uniform

09:16PM EDT - Does it cost less than a house? Everything is amortised across the wafer

09:17PM EDT - Regular processor for housekeeping? They can all do it

09:17PM EDT - Is it fully synchronous? No

09:20PM EDT - Clock rate? Not disclosed

09:20PM EDT - That's a wrap. Next is Habana

Comments Locked

28 Comments

View All Comments

  • Andy Chow - Monday, August 19, 2019 - link

    Incredible. I didn't believe it when I first saw an article about this. Because cross-die connectivity, which I know was something nVidia paid a lot for a larger overall individual die. But the full explanation makes sense. This is truly game-changing. I wonder about the clock rate. Is this 1+Ghz, or sub-200 Mhz?
  • anonymous5 - Monday, August 19, 2019 - link

    sub-200 Mhz probably would not even have thermal expansion or water cooling problems. has to be 1GHz+
  • brakdoo - Tuesday, August 20, 2019 - link

    You should have asked how this approach holds up against Panel-Level Packaging (PLP).

    Interconnect between chips would be slightly slower but all chips were 100% working, max possible size much bigger than these ~210x210mm, integration of DRAM/3DXPoint possible and everything would be much cheaper.
  • Duncan Macdonald - Tuesday, August 20, 2019 - link

    Is the faulty core bypass fixed or changeable (ie if a core fails sometime in the future can it be bypassed) ?
    How are defective inter die links handled ?
  • Ian Cutress - Friday, August 23, 2019 - link

    It looks like there are about 1-1.5% spare cores onboard, and defective cores are bypassed with one of the spare cores being enabled in its place.
  • Achtung_BG - Tuesday, August 20, 2019 - link

    46225 mm2 OMG!!! One chip per wafer unbelievable....
  • PixyMisa - Tuesday, August 20, 2019 - link

    I anticipate a moderate degree of expense in acquiring this device.
  • PeachNCream - Wednesday, August 21, 2019 - link

    And the understatement of the year award goes to PixyMisa!
  • willis936 - Tuesday, August 20, 2019 - link

    The Q to A ratio in that Q and A session is quite low.
  • eastcoast_pete - Tuesday, August 20, 2019 - link

    Thanks Ian! Amazing chip, but my question is (would have been if there): How fault-tolerant is this design? We all know that as the transistor count goes up, so does the chance of having too many manufacturing defects that render an IC useless. With this enormous number of transistors, the accumulated number of faults must be substantial. I know they addressed fault-tolerance, but only sort-of. Do they have a (even just one) functional chip they can demonstrate? Showing silicon but not silicon in action raises more questions for me than it answers.

Log in

Don't have an account? Sign up now