Most applications in distributed computing center around a set of common subproblems. Distributed Systems: An Algorithmic Approach presents the algorithmic issues and necessary background theory that are needed to properly understand these challenges. Achieving a balance between theory and practice, this book bridges the gap between theoreticians and practitioners. With a set of exercises featured in each chapter, the book begins with background information that contains various interprocess communication techniques and middleware services, followed by foundational topics that cover system models, correctness criteria, and proof techniques. The book also presents numerous important paradigms in distributed systems, including logical clocks, distributed snapshots, deadlock detection, termination detection, election, and several graph algorithms. The author then addresses failures and fault-tolerance techniques in diverse applications, such as consensus, transactions, group communication, replicated data management, and self-stabilization. He concludes with an exploration of real-world issues, including distributed discrete-event simulation and security, sensor networks, and peer-to-peer networks. By covering foundational matters of distributed systems and their relationships to real-world applications, Distributed Systems provides insight into common distributed computing subproblems,
I've been working on a project, which is a combination of an application server and an object database, and is currently running on a single machine only. Some time ago I read a paper which describes a distributed relational database, and got some ideas on how to apply the ideas in that paper to my project, so that I could make a high-availability version of it running on a cluster using a shared-nothing architecture.
My problem is, that I don't have experience on designing distributed systems and their protocols - I did not take the advanced CS courses about distributed systems at university. So I'm worried about being able to design a protocol, which does not cause deadlock, starvation, split brain and other problems.
Question: Where can I find good material about designing distributed systems? What methods there are for verifying that a distributed protocol works right? Recommendations of books, academic articles and others are welcome.
Where can I find good material about designing distributed systems?
I have never been able to finish the famous book from Nancy Lynch. However, I find that the book from Sukumar Ghosh Distributed Systems: An Algorithmic Approach is much easier to read, and it points to the original papers if needed.
What methods there are for verifying that a distributed protocol works right?
In order to survey the possibilities (and also in order to understand the question), I think that it is useful to get an overview of the possible tools from the book Software Specification Methods.
My final decision was to learn TLA+. Why? Even if the language and tools seem better, I really decided to try TLA+ because the guy behind it is Leslie Lamport. That is, not just a prominent figure on distributed systems, but also the author of Latex! You can get the TLA+ book and several examples for free.
Learning distributed computing isn't easy. Its really a very vast field covering areas on communication, security, reliability, concurrency etc., each of which would take years to master. Understanding will eventually come through a lot of reading and practical experience. You seem to have a challenging project to start with, so heres your chance :)
The two most popular books on distributed computing are, I believe:
1) Distributed Systems: Concepts and Design - George Coulouris et al.
2) Distributed Systems: Principles and Paradigms - A. S. Tanenbaum and M. Van Steen
Both these books give a very good introduction to current approaches (including communication protocols) that are being used to build successful distributed systems. I've personally used the latter mostly and I've found it to be an excellent text. If you think the reviews on Amazon aren't very good, its because most readers compare this book to other books written by A.S. Tanenbaum (who IMO is one of the best authors in the field of Computer Science) which are quite frankly better written.
PS: I really question your need to design and verify a new protocol. If you are working with application servers and databases, what you need is probably already available.
It appears that a lot of software, including low-level system calls, etc., relies on obtaining and being properly synchronized with the correct time in order to properly function. I was wondering if a better method could be implemented in order to avoid this dependency?.. e.g. rely on hardware-genrated fixed cycles (ticks), etc.
It's possible to write software without depending on global synchronized time. If you want to learn more, read one of the books on distributed programming, for example this one: http://www.amazon.com/Distributed-Systems-Algorithmic-Approach-Information/dp/1584885645/ref=sr_1_12?ie=UTF8&qid=1341166029&sr=8-12&keywords=distributed+systems
If you don't have time for books, you can take a look at Vector Clocks: http://en.wikipedia.org/wiki/Vector_clock