ipdlogo Core competences dept-logo unilogo

What are we doing?

Our research interests mostly focus on:

Cluster Computing (RESH)

Communication Software

We are running a 16 node cluster with Myrinet interconnects at our institute, using the ParaStation communication and administration software. With ParaStation (now sold by ParTec AG, a spin-off from the Department of Computer Science), we achieve high communication throughput between processes of a parallel application. Since the communication layer so far merely focused on MPI programs, in 2002, we added a network driver that simulates Ethernet connections. With this new module, applications can establish standard TCP/IP connections over high-speed communication hardware without recompilation. The module was evaluated in cooperation with the group for elementary particles and computer aided physics of the University of Wuppertal. On their cluster ALiCE, they are using the parallel file system PVFS for very I/O intensive quantum chromodynamics simulations. By employing our new module for all PVFS communication, they achieved very good performance. The results will be published later this year.

Parallel Programming Environments

In the project "parallel and distributed programming of clusters in Java", we are exploring the advantages of writing parallel applications in Java with respect e.g. to the usage of the cluster resources. We developed JavaParty, a domain specific language for programming cluster applications. JavaParty extends Java by adding transparent remote objects. It is used by our partners in the RESH project sponsored by the DFG, as well as by many external users.

JavaParty realizes a distributed object space through remote method invocation (RMI). Multiple threads are executed in parallel to jointly solve a problem. In remote calls, a thread's point of execution moves to a different node, thereby creating a distributed thread. Using standard libraries for remote method invocations, deadlocks can occur in synchronization operations, because monitors are no longer reentrant, and remote monitor acquisition is impossible. Both problems were solved by adding support for transparent distributed threads to KaRMI, a fast implementation of RMI for clusters. With transparent distributed threads, Java monitors are reentrant even in recursive remote invocations. Remote monitor acquisition is realized through a combination of an enhanced KaRMI API and a program transformation. Additionally, the application gains full control over its threads as signals are forwarded to the current point of execution.

We further designed a concept for checkpointing parallel and distributed systems. By regularly saving the state of a program to persistent storage medium, any need for recalculating data obtained so far can be avoided in the case of a system crash. This is especially important for long running simulation programs. The checkpointing process should be transparent to the application and cost-effective with respect to cluster resources (computing time, main memory and storage, communication bandwidth). Checkpointing a distributed application is more complex than checkpointing a single process because of dependencies induced by communication operations. Existing strategies focus on message passing systems. A JavaParty extension for distributed checkpointing, however, could address all these problems at the language level, thereby making further optimizations feasible.

Additional information about JavaParty and KaRMI can be found on this website: http://www.ipd.uka.de/JavaParty/.

Parallel Filesystems

Clusterfile is a parallel file system for clusters of computers. In 2002, we focused on broadening the application area. The goal of an early design was to efficiently use the internal parallelism of applications. Internal parallelism emerges from the I/O access of multiple processes of the same application. On the other side, external parallelism arises from concurrent access of different applications. Our extensions address external parallelism by introducing not only application-specific, but also system-wide optimizations.

We implemented the file system partially in user level, partially in the Linux kernel. Based on a kernel module supporting the VFS (Virtual Filesystem Switch) interface, Clusterfile may be mounted in the local directory tree of any cluster node. Metadata is managed through the cooperation of the kernel module with a central server. We introduced collective I/O operations in order to optimize simultaneous access from many nodes to the same file. Furthermore, we implemented an MPI I/O interface for Clusterfile, which we currently compare with other MPI I/O implementations.

In the future, we plan to increase application performance and scalability by introducing cooperative caching. This is joint work with the subproject Scalable Servers on Cluster of Computers. Furthermore, we will improve scalability of metadata management by decentralization. Currently, we are studying various policies, such as metadata distribution or replication over the cluster nodes.

Scheduling Policies on Clusters

Gang Scheduling coordinates process task switches on multi-processor systems to enhance performance. It runs in parallel groups of processes that communicate intensively with each other in order to avoid task switches that result from one process waiting to communicate with another one (process thrashing).

While Gang Scheduling is state of the art on classical parallel computers, it creates new challenges on cluster computers. Applications couple the operating system kernels only loosely, making them running almost independent from each other. Process coordination across host boundaries may be vulnerated by high priority tasks inside the kernel (such as hardware interrupts or swapping). Communication latency is high as compared to integrated parallel computers, hence limiting the precision of coordination.

During the period of report, a mechanism for remotely triggering a process group oriented task switch has been implemented in the Linux kernel based on ICMP packets. For validation, the kernel was instrumented and analysis tools were developed, such that the effect of the meachanism could be proved, as expected. Based upon this work, we will develop various scheduling policies, evaluate and optimize them.

Scalable Servers on COTS Clusters

This project aims at using clusters of computers as a powerful platform for developing scalable servers. Our work focuses on two aspects. First, we try to develop efficient mechanisms for load balancing and cooperative caching among the cluster nodes. Second, we are interested in achieving a performant trade-off between the hard-to-reconcile goals of load balancing and data reference locality.

In this regard, we developed Cluster Aware Remote Disks (CARD). CARDs are disk drivers in the kernel that operate on cooperative caching algorithms. We designed and developed such an algorithm, the Home-Based Serverless Cooperative Caching (HSCC). For further information on CARDs, HSCC and their evaluation, see http://www.ipd.uka.de/RESH/publ.html.

Furthermore, we designed and developed Home-Based Locality-Aware Request Distribution (HLARD), a request distribution policy that combines HSCC with TCP connection endpoint migration. By migrating a TCP connection endpoint, a server machine physically moves the endpoint to another server in the cluster. The client is totally oblivous to the procedure. HLARD distributes incoming requests according to the locality of the requested data as advertised by HSCC.

Multicore Software Engineering

With the emergence of multicore chips (containing multiple processors), parallel programming will enter the mainstream. Since clock frequencies are no longer increasing regularly, performance-critical applications of all sorts will need to run in parallel. We develop software engineering concepts, methods, and tools for developing reliable, parallel software of all kinds. In particular, we focus on:

  • Architectures/design patterns/frameworks/libraries for general-purpose parallel programs
  • Autotuning
  • Programming models and language extensions for multicore
  • Testing and debugging of parallel programs
  • Reengineering sequential programs for parallelism
  • Tools and development environments for multicore software
Our group in the press:
Contact: PD Dr. Victor Pankratius, Prof. Dr. Walter Tichy

Activities:

Young Investigator Group "Multicore Software Engineering" (PD Dr. Victor Pankratius)
International SEPARS Working Group

Software Engineering

Empirical Software Engineering

In 2002, we conducted a controlled experiment to compare pair programming with single developers. The single developers were assisted by an additional review of their program code. The main incentive of the study was to find a development technique which exploits only 20% of the cost of developer pairs but delivers 80% of their quality. Inspections are an accepted technique for quality assurance. Thus, reviews used during the preparation of an inspection meeting seemed to be a reasonable candidate. The study was conducted with 20 participants of the Extreme Programming course.

Two preliminary results could be observed. First, single programmers are nearly as expensive as developer pairs, if they have to produce the same code quality as the programmer pairs do. Second, if equal code quality is of no concern, developer pairs produce on average 7 - 13% more reliable programs with about 24% higher cost than single developers. However, both results are far from being statistically significant.

Scheduling of Software Projects

To cut development costs and meet tight deadlines in short staffed software projects, it is essential that software project managers optimize the project plan and schedule. Good software project scheduling is an extremely hard task in practice, though. The time needed to complete a software development activity is difficult to estimate since it depends not only on technical factors, but also on human factors such as the experience of the developers. Even worse, it is typical for software projects that the completion of tasks is delayed by unanticipated rework. Such rework is caused by feedback in the development process.

To support software project managers in scheduling, we have developed a simulator which represents key factors in the dynamics of software projects: rework caused by design changes, varying staff skill levels, component coupling,and changing task assignments. The simulator takes as input statistical data collected during past projects and high-level design data about the current project. In addition, the simulator explicitly takes a scheduling strategy as input. Using the simulator, a manager can compare different strategies and choose the one which he thinks is best for the next project setting.

As a first application, we have systematically studied the performance of the so-called list policies for a sample project. Our computations clearly show that the choice of the scheduling strategy has a strong impact on the progress and completion time of a project. Using the simulation traces, we have also provided a detailed analysis of why the list policies perform as observed.

Our research is funded by the Deutsche Forschungsgemeinschaft DFG.

Lightweight Software Processes

Lightweight development processes such as Extreme Programming (XP) are still widely discussed. However, there is a lack of models to evaluate the cost and benefits of agile methods. To start a comparison, the individual techniques of lightweight processes, as for example Pair Programming, have to be analyzed. During Pair programming two developers work on the same task, such that the personal cost is almost doubled as compared to conventional single developers. However, a developer pair is quicker than a single programmer and, most often, the pairs develop higher quality code.

The main question is, if the benefit of programmer pairs outweighs their additional personal cost. We developed a cost-benefit model for Pair Programming to answer this question. It turns out that the economic context to which a project is related, is the decisive factor: if the market pressure is high, Pair Programming can be profitable. However, if the market pressure is low, the cost of Pair Programming will outweigh its benefit.

Methods of Software Reliability

Inspections are a successful technique to detect defects in software documents. Inspections can be applied to all kinds of documents, such as designs, specifications, or code. Usually, not all the defects contained in a document are detected during an inspection. Thus, management must decide whether to re-inspect the document to find additional defects before passing the document on to the next development phase. As a basis for this decision, management must have a reliable estimate for the number of defects which have escaped the inspection.

Empirical studies show that the existing methods for estimating the defect content after an inspection are much too unreliable to be used in practice. The methods show extreme outliers and a high variation in the estimation error. The reason is that the existing methods do not take into account the experience made in past inspections; their only input are the results of the inspection to be estimated.

We have developed a novel method which is much different from the existing approaches. Besides data about the inspection to be estimated, we also use data from past inspections as input. Our method uses the empirical data to compute a relation between the number of defects detected and the number of defects which escaped. For the standard benchmark in the field, our new method outperforms the existing methods by a factor of 4 to 7.


blank
 Login
Login: 
Passwort: 
 Links
  » IPD Tichy
  » JavaParty
  » JPlag
  » ClusterOS
  » Jamaica
  » Education Support Centre
  » CHIL
blank up