Characterizing the Open Source Software Process: a Horizontal Study

Characterizing the Open Source Software Process: a Horizontal Study A. Capiluppi, P. Lago, M. Morisio

Outline • Rationale behind the current study • Methodology • Conclusions • Actual and future work

Rationale • Most Open Source analyses focus on a single, flagship project (Linux, Apache, GNOME) • Limitation: the conclusions are based on a ‘vertical’ study • there is a lack of ‘horizontal’ studies • a pool of projects • a wider area of interest

Methodology • Choice of projects • Attributes definition • Coding • Analysis

Choice of projects: repository • Selected FreshMeat repository • FreshMeat (http://freshmeat.net) is focused on Open Source development since 1996 • It gathers thousands of projects, either doubled on the pages of SourceForge (http://sourceforge.net), or hosted on FreshMeat only. • FreshMeat lists more than 24000 projects (many inactive)

Choice of projects: sampling I • From 24000 to 406 - how? • FreshMeat organizes projects by filters and categories • Filter = “Topic” • Categories = {“Internet”, “Database”, “Multimedia”,…} • Other filters: Programming language, Topic (i.e. application domain), Status of Evolution, etc.

Choice of projects: sampling II • We picked randomly a number of projects through the “Status” filter • Rationale: limited number of categories associated {“Planning”, “PreAlpha”, “Alpha”, “Beta”, “Stable”, “Mature”} • The overall count is 406 projects

Attribute definition • Age • Application domain • Programming language • Size [KB] • Number of developers • Stable and transient developers • Number of users Modularity level Documentation level Popularity Status Success of project Vitality • Red: defined by FreshMeat • Black: defined by us

Coding • Each attribute was coded twice, to capture evolutive trends • First observation: January 2002 • Second observation: July 2002

Analysis • Here we discuss: • Application domain issues • Developers [stable & transient] issues • Subscribers (as users) issues • Code size issues

Application domain distribution

Attributes: project’s developers • We evaluate how many people write code for an application • External contributions are always credited in special-purpose files, or in the ChangeLog • We distinguish between • Stable developers • Transient developers • Core team: more than one stable developer • Manual inspections and pattern-recognition scripts

Developers over projects • We observe: • 72% of projects have a single stable developer • 80% of projects have at most a number of 10 developers

Developers distribution over projects

Definition: clusters of developers • Cluster 1: 1 to 3 developers (64.5%) • Cluster 2: 4 to 10 developers (20%) • Cluster 3: 11 to 20 developers (9.5%) • “Average” nr. of stable dev: 2 • “Average” nr. of transient dev: 3 • Cluster 4: more than 20 developers (6%) • “Average” nr. of stable dev: 6 • “Average” nr. of stable dev: 19

Productivity vs. ‘global’ developers

Productivity vs. ‘stable’ developers

Code variation over clusters

Attributes: subscribers • We use some publicly available data to gather some proxy about users • Users ~ Mailing List subscribers (public datum) • It’s not a monotonic measure: subscribers can join and leave as well • We have a measure of users in two different observations

Distribution of subscribers over project Around 42% of projects have at most 1 subscriber-user

Users evolution

Attributes: project’s size • We evaluate the code of each project twice • Code evaluated is contained in packages. We exclude from the count: • Auxiliary files: documentation, configuration files, GIF files, etc. • Legacy code: inherited libraries (e.g. Gnome macros), internationalization code

Distribution of code size over projects

Evolutive observations of size changes

Conclusions I • The vast majority of projects are developed by only one developer • Adding people to a project has small effect on productivity (i.e. code per developer) • Open Source software is made by experts for experts (72% of horizontal projects have more than 10 developers) • 58% of projects didn’t change their size • 63% of projects had a change within 1%

Conclusions II • Java is relevant for 8% of the projects, C/C++ for 56%, PERL with Python for 20% • Observations from flagship projects (Apache, Linux, Gnome) are not confirmed for an average Open Source project • Several projects are white noise: to be filtered out • Huge amount of data on public repositories: empirical researchers have an invaluable resource of software data

Current and future work • Eliminating white noise: only projects in cluster 3 and 4 have been selected • Deeper analysis: the whole story of a project is being studied • What can we say with respect of conclusions on bigger OS projects? • What can be said about OSS evolution compared with traditional software evolution?

Characterizing the Open Source Software Process: a Horizontal Study

Characterizing the Open Source Software Process: a Horizontal Study

Presentation Transcript

Classical Open Source Software Process Model

Process and Open Source Software

Open Source Software

Process and Open Source Software

Open Source Software

Open Source Software

Open Source Software

Evolution in Open Source Software: A Case Study

Open Source Software

Evolution in Open Source Software: A Case Study

Open Source Software

Classical Open Source Software Process Model

Open Source Software

Open Source Software: A Case Study

Open Source Software

Open Source Software

Open Source Software

Characterizing the Software Process: A Maturity Framework

Open Source Software

Open Source Software

Characterizing the Software Process