1 / 17

Many-Core Operating Systems

Many-Core Operating Systems. Burton Smith Technical Fellow Advanced Strategies and Policy. The von Neumann Premise. Simply put: “there is exactly one program counter” It has led to some artifacts: Synchronous coprocessor coroutining ( e.g. 8087) Interrupts for asynchronous concurrency

aram
Download Presentation

Many-Core Operating Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many-Core Operating Systems Burton SmithTechnical FellowAdvanced Strategies and Policy

  2. The von Neumann Premise • Simply put: “there is exactly one program counter” • It has led to some artifacts: • Synchronous coprocessor coroutining (e.g. 8087) • Interrupts for asynchronous concurrency • Demand paging • to make memory allocation incremental • to let |virtual| > |physical| • And some serious problems: • The memory wall (insufficient memory concurrency) • The ILP wall (diminished improvement in ILP) • The power wall (the cost of run-time ILP exploitation) • Given multiple program counters, what should we change? • Scheduling? • Synchronization?

  3. Computing is at a Crossroads • Continual performance improvement is our lifeblood • It encourages people to buy new hardware • It opens up new software possibilities • Single-thread performance is nearing the end of the line • But Moore’s Law will continue for some time to come • What can we do with all those transistors? • Computation needs to become as parallel as possible • Henceforth, serial means slow • Systems must support general purpose parallel computing • The alternative to all this is commoditization • New many-core chips will need new system software • And vice versa! • This talk is about interplay between OS and hardware

  4. Many-Core OS Challenges • Architecture of the parallel virtual machine • Processor management • Multiple processors • A mix of in-order and out-of-order CPUs • GPUs and other performance accelerators • I/O processors and devices • Memory management • Performance problems due to paging • TLB pressure from larger working sets • Bandwidth resources • Quality of service (time management) • For media applications, games, real-time apps, etc. • For deadlines

  5. The Parallel Virtual Machine • What should the interface that the OS presents to parallel application software look like? • Stable, negotiated resource allocation • Isolation among protection domains • Freedom from bottlenecks in OS services • The key objective is fine-grain application parallelism • We need the whole tree, not just the low-hanging fruit

  6. Fine-grain Parallelism • Exploitable parallelism grows as task granularity shrinks • But dependences among tasks become more numerous • Inter-task dependence enforcement demands scheduling • A task needing a value from elsewhere must wait for it • User-level work scheduling is called for • No privilege change is needed to stop or restart a task • Locality (e.g. cache content) can be better preserved • Todays OS and hardware don’t encourage waiting • OS thread scheduling makes blocking dangerous • Instruction sets encourage non-blocking approaches • Busy-waiting wastes instruction issue opportunities • Impact: • Better instruction set support for blocking synchronization • Changes to OS processor and memory resource management

  7. Multithreading and Synchronization • Fine-grain multithreading can use TLP to tolerate latency • Memory latency • Other operation latency, e.g. branch latency • Synchronization latency • In the latter case, some architectural support is helpful • To stop issuing from a context while it is waiting • To resume issuing when the wait is over • To free up the context if and when a wait becomes long • The benefits: • Waiting does not consume issue slots • Overhead is automatically amortized • I talked about this stuff in my 1996 FCRC keynote

  8. Resource Scheduling Consequences • Since the user runtime is scheduling work on processors, the OS should not attempt to do the same • An asynchronous OS API is a necessary corollary • Scheduling memory via demand paging is also problematic • Instead, the two schedulers should negotiate • The application tells the OS its resource needs/desires • The OS makes decisions based on the big picture: • Availability of resources • Appropriateness of power consumption level • Requirements for quality of service • The OS can preempt resources to reclaim them • But with notification, so the application can rearrange things • Resources should be time- and space-shared in chunks • Scheduling turns into a bin-packing problem

  9. Bin Packing • The more resources allocated, the more swapping overhead • It would be nice to amortize it… • The more resources you get, the longer you may keep them • Roughly, this means scheduling = packing squarish blocks • QOS applications might need long rectangles instead • When the blocks don’t fit, the OS can morph them a little • Or cut corners when absolutely necessary Quantity of resource Time

  10. What About Priority Scheduling? • Priorities are appropriate for some kinds of scheduling • Especially when some things to be scheduled are optional • If it all has to be done, how do the priorities get set? • The answer is usually “ad-hoc, and often!” • Fairness is seldom maintained in the process • Quality of service needs a different approach • “How much work must be done before the next deadline?” • Even highly interactive tasks can benefit • Deadlines are harder to implement than priorities • Then again, so is bin packing compared to fixed quanta • Fairness can also be based on quality-of-service concepts • Relative work rates rather than absolute • “In the next 16 milliseconds, give level i activities r times as many processor-seconds as level i-1 activities”

  11. Heterogeneous Processors • There are two kinds of heterogeny: • In architecture, i.e. different instruction sets • In implementation, i.e. different performance characteristics • Both are likely to be important • A single application might ask for a heterogeneous mix • Failure in the HA case might need multiple versions or JIT • In the HI case, scheduling might be based on instrumentation • A key question is whether a processor is time-sharable • If not, the OS has to dedicate it to one application at a time • With user-level scheduling and some support for preemption, application state save and restore can be done at user level

  12. Virtual Memory Design Alternatives • Swapping instead of demand paging • Address-space names/identifiers • TLB shootdown becomes a rarer event • Hardware TLB coherence • Two-dimensional addressing (segmentation w/o registers) • To assist with variable granularity memory allocation • To help mitigate upward pressure on TLB size • To leverage persistent memory via segment sharing • A variation of mmap()might suffice for this purpose • To accommodate variations in memory bank architecture • Local versus global, for example

  13. Physical Memory Bank Architecture • Consider this example • An application is using 31 cores, about half of them • 50% of its cache misses are stack references • The stacks are all allocated in a compact virtual region • How many of the 128 memory banks are available? • Interleaving addresses across the banks is a solution • Page granularity is the standard choice • If memory access is non-uniform, this is not the best idea • Stacks should be allocated near their processors • So should compiler-allocated temporary arrays on the heap • Is it one bank architecture scheme fits all, or not? • If not, how do we manage the virtual address space?

  14. “Hot Spots” • When processors share memory, they can interfere • Not only data races, but also bandwidth oversubscription • Within an application, this creates performance problems • Hardware help is needed to discover “where” these are • Beween applications, interference is even more serious • Performance unpredictability • Denial of service • Covert-channel signaling • Bandwidth is a resource like any other • We need to be able to partition and isolate it

  15. I/O Architecture • Direct memory access is usually a good way to do I/O • Today’s DMA mostly demands “wired down” pages • This leads to lots of data copying and other OS warts • But I/O devices are getting smarter all the time • Transistors are cheaper than almost anything else • Why not treat I/O devices like heterogeneous processors? • Teach them to do virtual address translation • Allocate them to real-time or sensor-intensive applications • Allocate them to a not-very-trusted “driver application” • Address space sharing can be partial, as it is now • There is a problem, though: inter-domain signaling (IPC) • This is what interrupts do • I have some issues with interrupts

  16. Interrupts • Interrupts are OK when there is only one processor • Some people avoid them to make systems more predictable • If there are many processors, which one do you interrupt? • The usual solution: “just pick one and leave it to software” • A better idea is to signal via an address space you already share (perhaps only in part) with the intended recipient • “The DMA at address <a> is (ready)(done)” • This is kinda like doing programmed I/O via device CSRs • It’s also the way the CDC 6600 and 7600 did things • You may not want to have the signal recipient busy-wait

  17. Conclusions • It is time to rethink some of the basics of computing • There is lots of work for everyone to do • e.g. I’ve left out compilers, debuggers, and applications • We need basic research as well as industrial development • Research in computer systems is deprecated these days • In the USA, NSF and DOD need to take the initiative

More Related