I've been thinking a lot about the shift in direction in CPU design lately, and I've come to the conclusion that things will pretty much stay the same, even given an explosion in the number of cores available.
First, the average desktop computer is hardly ever running the CPU at 100% unless there's something wrong. Very few applications that I use are CPU bound, and my guess it that this is also true for server apps. For server apps the problems that are CPU bound are either trivially spread among multiple servers using a load-balancing scheme, or are likely long-running scientific calculations.
There are easily 50 processes running on this WinXP computer right now, and I have Firefox and Outlook open. Even with a CPU with 50 cores, the OS can spread the load at the process boundary to get speed/latency increases without ever having to change programming styles.
I can think of one or two apps I use that actually peg the CPU for any significant time, and those are video editing and encoding programs. These would benefit from multiple cores, but the rest of the software I use happily runs with a small percentage of single-core CPU right now.
So what's the future? Will email ever develop into something so complex and slow that it needs more than an entire 3 GHz CPU to run smoothly? I'm not discounting this possibility, but it seems unlikely.
I think the future of computing will belong to an architecture that uses the increased power for reliability and ease of development rather than performance. I think it's safe to assume that the application you write five to ten years from now will have more than enough CPU power to run. The trick is to make it dead-simple to create reliable, responsive programs.
With that in mind, I'm working on a new programming language (starting with a virtual machine) that combines a number of features I've been interested in for a while:
1) Compiler-enforced cooperative threading: The big idea behind this language is that, like Erlang, you'll handle most tasks by spawning an extremely lightweight thread. The threads (called tasks) are non-native and the compiler will insert time checks at function calls and loop iterations to run the scheduler. There is no need to call a yield() function, the compiler does that for you at the right time.
All I/O will be asynchronous so one task doesn't block progress of the others. Each task has its own heap and stack; the lack of shared data means that garbage collection can be run on a per-task basis and thread synchronization and locking are non-issues. Tasks will communicate via message passing.
The advantages of doing this are numerous. Cooperatively threaded programs won't ever context-switch during inopportune times. In my case, the only time tasks may give up control is between function calls or loop iterations. Things proceed in a very deterministic way (especially with the simple round-robin scheduler that I have now.)
Also, creating a task is extremely fast, requiring only that a small stack and heap be allocated (I'm using malloc for portability reasons).
The disadvantage to this design is that it takes a little more work to ensure that one task won't hang the entire process, and there is no hardware or operating system support (portably) that will allow automatic checking of stack/heap overflow, so those limits have to be checked manually. This check is done at the beginning of every function call, since the compiler can tell in advance how much stack space a function will use.
2) Message-passing architecture: Tasks communicate via messages, with complex data being serialized and copied. The advantage is that sharing no data means that read/write race conditions are a non-issue. The other advantage is that the serialized messages can just as easily be sent over a network.
The disadvantage here is performance, which is a cost I'm willing to pay.
3) Mixed bytecode and compiled code run time: The virtual machine I'm writing is stack-based, and can call compiled functions from byte-coded functions and vice-versa. There is no overhead for doing this. The drawback is that for this to work effectively, a bytecoded function must be compiled along with all of the functions it calls.
4) Stack-based, bytecode virtual machine: This ensures portability and ease of implementation. The virtual machine is coded almost entirely in C.
5) Distributed VM's: multiple VM instances can be run on the same machine and/or on multiple machines in a network. This offers a simple way to take advantage of multiple cores (1 process per core) or multiple nodes in a network, as necessary.
6) Language Agnostic VM: The VM is designed for dynamic, functional languages, but will support most language features you can throw at it. Apparently people who want first-class functions, continuations, tail-call optimization, weak-types, etc., aren't well served by the CLR or Java virtual machine. I'd like to support these types of languages first.
7) Easy integration with native code: self-explanatory.
The VM I'm working on now has no JIT compiler, it's purely bytecode. That said, it can create tasks and schedule between them, pass messages between tasks, support tail calls and iterative loops, and is slightly faster than PERL running similar basic test programs. I consider the VM's performance to be "good enough to ignore for now."
I haven't begun on the syntax of the language itself, but there will definitely be a syntax that looks a lot like C and another that looks a lot like Lisp. These will be compiled to an intermediate bytecode assembly language, and from there to bytecode. The bytecode assembler is written and working.
Other than the compiler-enforced cooperative threads I don't know if there's anything original here (and the cooperative thread thing must exist somewhere), but I'm trying to take the best ideas from the more esoteric languages and bring them under the C-like syntax umbrella.