My name is Florian Schoppmann.

I am a consulting software engineer based in San Francisco. I have also been an architect and manager, a data scientist, and a researcher (my Ph.D. is in theoretical computer science). This site shows some of my open-source, independent, and academic projects. It also serves as personal archive.

See my LinkedIn profile if interested in my resume.

GitHub LinkedIn Email


Open Source

The following two projects I have worked on as part of my daytime jobs. For other smaller-scale projects, please see my GitHub repositories.

CloudKeeper (2012–current)

CloudKeeper is a domain-specific language and runtime system for implementing and running dataflows on the Java Virtual Machine. Designed to facilitate “programming in the large”, CloudKeeper abstracts away concerns such as data transfer, serialization, scheduling, checkpointing, and package/dependency management. It is written in Java 8 and uses Akka Actors for the distributed runtime system.

I originally designed and implemented CloudKeeper while at Lifecode, a molecular-diagnostics startup that helped oncologists make treatment decisions for cancer patients. CloudKeeper was the core infrastructure piece at Lifecode, used by computational biologists to write genome-analysis workflows in a high-level, modular, and portable fashion.

CloudKeeper has been open source since January 2016. Currently, work on the project is geared towards the first public release, version 2.0. See the web site for more information, including slide decks and examples (also note manuscript S16 below, which gives a formal description of CloudKeeper’s check-pointing algorithm).

Web Site GitHub

MADlib (2010–2012)

MADlib is an open-source library for running machine-learning algorithms at scale within relational databases such as PostgreSQL and Greenplum. I co-started the project after joining Greenplum in 2010 (now Pivotal) and was the MADlib lead engineer. MADlib is written in C, C++, Python, and SQL.

One of my main contributions to MADlib included designing and writing a C++ layer that eases the burden of writing high-performance user-defined function (UDFs) and that encapsulates DBMS-specific logic inside the abstraction layer. In brief, the MADlib C++ abstraction provides three classes of functionality: type bridging, resource management shims, and math library integration. See the MADlib design document (publication M12 below) and chapters 3 and 4 in “The MADlib Analytics Library” (publication HRSW+12 below) for more information.

Using the abstraction layer, I implemented core algorithms such as linear/logistic regression, k-means, random sampling, statistical tests, etc. I also owned other essential parts of the project, including the build system, continuous integration, release management, and documentation (using a custom SQL parser for Doxygen, which I wrote in flex and bison). In order to formally document the machine-learning algorithms implemented in MADlib, I started the design document (manuscript M12 below).

After leaving Greenplum, I no longer had time to contribute much to MADlib. Luckily, the project is in good shape and is currently in the Apache Incubator.

Apache Incubator Site GitHub (official) GitHub (my old fork)

Proprietary Software as Independent Developer (1998–2005)

Latin-German Dictionary

The Latin-German dictionary is indispensable when searching for words and when translating Latin writings into German. It allows to instantly look up entries in all conjugated/declined forms, in order to obtain the base form and its translation. In total, the software can recognize almost one million distinct word forms stemming (pun intended) from over 11,000 dictionary entries. For each base form, it can also show overviews of all inflected forms.

I started developing the Latin-Germany while in 10th grade in high school and released the first version in 1998. Over the years, the Latin-German dictionary found more than 2,000 paying customers. Today, it is available as freeware.

I implemented the Latin-German dictionary in C++, using initially the PowerPlant framework for the Mac version (those glorious 1990’s!) and the Microsoft Foundation Classes for the Windows version. After the advent of Mac OS X, I rewrote the user interface code of the Mac version in Objective-C.

Web Site


I developed an inventory control system called “WinCOWAS” (German for “computergestütztes Warenwirtschaftssystem”), tailored and simplified for educational use. From 1999 to the mid-2000’s, I sold this software to vocational schools. It was later acquired by the Europa-Lehrmittel Verlag and became part of a text book (when following the link, see the table of contents). I also developed a likewise simplified accounting software for education purposes, called “WinCOBUS”. I discontinued distribution and development of this software.




The following articles contain original and novel work but are so far unpublished in this form.


Florian Schoppmann: The CloudKeeper dataflow language and runtime system, January 2016.

This document (currently) gives a description of the interpreter component, including a formal proof of the check-pointing algorithm. To be extended into a full article about all aspects of CloudKeeper.



The MADlib project (multiple authors): MADlib Design Document, my contributions from 2012.

The design document gives technical details about architecture decisions as well as mathematical details about implemented machine-learning algorithms. While leading the MADlib project (see above), I started the design document in order to ensure formal correctness in mathematical code.



Florian Schoppmann: Generalizing the Smoothness framework, June 2010.

The purpose of this note is to further generalize the smoothness framework introduced in “Intrinsic robustness of the price of anarchy” (Roughgarden, 2009), in order to provide a unified proof for many existing price-of-anarchy results.



Yvonne Bleischwitz, Burkhard Monien, and Florian Schoppmann: To Be or Not to Be (Served): Cost-Sharing Without Indifferences, March 2010.

Supersedes conference version (BMS07). Roughly equivalent to chapter 4 of my Ph.D. thesis (S09).



Yvonne Bleischwitz, Burkhard Monien, Florian Schoppmann, and Karsten Tiemann: The Power of Lexicographic Maximization: Beyond Cross-Monotonicity, August 2009.

Supersedes conference version (BMST07). Roughly equivalent to chapter 3 of my Ph.D. thesis (S09).


Technical Notes and Informal Talks

The following notes and slides provide introductions, summaries, simplified proofs, and opinions.

Academic Research (2006–2010)

Present-day information and communication technologies crucially rely on large-scale networks that are created and maintained by a huge number of autonomous players with heterogeneous goals. Evidently, this includes the Internet as the most prominent example, but no less also peer-to-peer, logistics, and social networks. Research on (distributed) algorithms has responded to the rapid emergence of these new decentralised networks by a corresponding paradigm shift: Starting shortly before the new millennium, computer scientists have been increasingly focusing on optimization in networks where the self-interested players’ behaviour cannot be directly controlled (Nisan and Ronen, 2001; Koutsoupias and Papadimitriou, 2009; Papadimitriou, 2001). The lack of central coordination ensuing from players’ self-interested behavior is instead regarded as an inevitable constraint, similar to the lack of computing power when devising approximation algorithms or the lack of information about the future when designing online algorithms.

Located at the intersection of computer science with mathematics and economics, this field of research is known today as algorithmic game theory. Bringing together algorithms and game theory has given rise to several important questions that were not rigorously considered before: Can the loss due to selfish behaviour be quantified? In a more catchy term: What is the “price of anarchy”? What is the complexity to compute stable states (“equilibria”) or approximations of them? How should new systems be designed in order to give players an incentive to act in a particular fashion? How can the “right” incentives be reconciled with computational tractability? The following sections briefly outline three topics from my research in algorithmic game theory.

Cost Sharing

In a cost-sharing problem, finitely many players have an unknown valuation for some non-rivalrous but excludable service (e.g., network connectivity). The challenge is to design mechanisms that elicit truthful reports of the players’ valuations, determine which set of players Q to serve, and decide how to distribute the incurred service cost C(Q). Providing players with an incentive to reveal truthful information is key, because it removes the possibility to game the system. Further constraints for cost-sharing problems include budget balance (i.e., recovery of the service cost with the prices charged) and economic efficiency (i.e., a reasonable trade-off between the service cost and the excluded players’ valuations). Practical applications moreover require that the outcome of cost-sharing mechanisms be computable in polynomial time.

Cost-sharing problems are fundamental in economics and have a broad area of applications; e.g., distributing volume discounts in electronic commerce, sharing the cost of public infrastructure projects, allocating development costs of low-volume built-to-order products, etc. Despite this fundamental nature, general techniques for solving cost-sharing problems used to be rare. When requiring group-strategyproofness – meaning that not even coordinated manipulation must ever be profitable – essentially only one technique was known, due to Moulin (1999). Unfortunately, there are several natural cost-sharing problems for which any Moulin mechanism inevitably suffers poor budget balance and economic efficiency (Immorlica et al., 2008).

My research in this area therefore focused on finding alternative general techniques that perform better than Moulin mechanisms (BMST07, BMS07), on characterising the inherent limitations of group-strategyproofness with respect to the other desirable properties of cost-sharing mechanisms (BMST07, BS08b, S08), on investigating new incentive-compatibility desiderata that arise particularly in large anonymous settings (PSSW09), and on extending the cost-sharing model to a more general fault-tolerant setting where players may have demand for multiple levels of service (BS08b).

Self-Interested Resource Usage

So-called congestion games play a central role for modeling self-interested resource usage. There is a ground set of resources, and each player selects a subset of them. For concreteness, we assume here that resources are edges in a network, and each player needs to route traffic from some source to a destination node. A natural behavioral assumption is that each player simply chooses the fastest route when taking into account the current network congestion – while being oblivious to the effect her own action has on the system. In fact, similar models have already been considered in the 1950’s in the context of road traffic. Today, selfish routing models gain further motivation from the necessity of efficient routing protocols in the Internet, in logistics networks, etc.

My research in this area primarily focused on quantifying the loss due to selfishness. In two works (ADGMS06, GS07), we considered a model with players who each have a non-negligible effect on the system. For the case that the travel time over each edge is described by a polynomial of the total traffic on this edge, we gave exact values for the price of anarchy. In a different work (MMST06), we considered a selfish routing scenario where not the total travel time (e.g., of data packets) is an issue but much more the throughput. The latter is clearly a question of the bottleneck, i.e., the most congested edge on a path. We gave results on the minimum loss any equilibrium inevitably incurs, together with a characterisation of special network topologies (series-parallel graphs).

We also precisely determined the price of anarchy for the case when players can arbitrarily split their traffic over several routes (RS11) – a problem that had been open for several years (see, e.g., “Algorithmic Game Theory”, chapter 18, 2007).

Competitive Location

Competitive location games provide simple scenarios of competitive sellers: The market is modeled as some metric measurable space, and a seller’s market share is assumed to be proportional to the measure of the points that are closer to it than to any other seller. How will the sellers behave in order to maximise their market shares? There is a huge body of literature on location theory (Eiselt et al., 1993), with contributions also from operations researchers, geographers, political scientist etc. However, articles dealing with discrete location models – such as graphs – are rare.

Motivated by applications in computer networks, my research on competitive location made some first steps towards understanding discrete models: We considered competitive location games on cycle graphs and give a comprehensive examination of their equilibria. This then allowed us to also determine the exact price of anarchy (MMPS08).


From 2006–2009, I was a Ph.D. student at the International Graduate School Dynamic Intelligent Systems. Before that, from 2001–2005, I studied computer science, with mathematics as minor subject, at the University of Paderborn.



Collusion-Resistant Cost-Sharing Mechanisms: Design Techniques, Analyses, Trade-Offs. Ph.D. thesis. University of Paderborn, Germany, June 2009.

Thesis Slides URN:NBN

Price of Anarchy for Congestion Games with Polynomial Latency Functions. Diploma thesis (Diplomarbeit). University of Paderborn, Germany, December 2005.

I was awarded the prize of the computer-science department for the best Diplom (first among 76 graduates during the academic year 10/2005–09/2006).

Thesis Slides


Online Occlusion Culling. Bachelor’s thesis. University of Paderborn, Germany, January 2005.

Thesis Slides


Included are many of my summaries, term papers, and slide decks for talks. The computer-science department at the University of Paderborn distinguished five areas of study. Undergraduate/Diplom classes are tagged accordingly.

  1. Software engineering and information systems
  2. Models and algorithms
  3. Embedded systems and system software
  4. Human-machine interaction
  5. Mathematics

Please note: While the summaries are for specific classes, and while they may even incorporate official lecture material, they are otherwise entirely my own work and have not been reviewed. Beware of mistakes. Most of the documents below are in German.