270 likes | 359 Views
Characterizing the Open Source Software Process: a Horizontal Study. A. Capiluppi, P. Lago, M. Morisio. Outline. Rationale behind the current study Methodology Conclusions Actual and future work. Rationale.
E N D
Characterizing the Open Source Software Process: a Horizontal Study A. Capiluppi, P. Lago, M. Morisio
Outline • Rationale behind the current study • Methodology • Conclusions • Actual and future work
Rationale • Most Open Source analyses focus on a single, flagship project (Linux, Apache, GNOME) • Limitation: the conclusions are based on a ‘vertical’ study • there is a lack of ‘horizontal’ studies • a pool of projects • a wider area of interest
Methodology • Choice of projects • Attributes definition • Coding • Analysis
Choice of projects: repository • Selected FreshMeat repository • FreshMeat (http://freshmeat.net) is focused on Open Source development since 1996 • It gathers thousands of projects, either doubled on the pages of SourceForge (http://sourceforge.net), or hosted on FreshMeat only. • FreshMeat lists more than 24000 projects (many inactive)
Choice of projects: sampling I • From 24000 to 406 - how? • FreshMeat organizes projects by filters and categories • Filter = “Topic” • Categories = {“Internet”, “Database”, “Multimedia”,…} • Other filters: Programming language, Topic (i.e. application domain), Status of Evolution, etc.
Choice of projects: sampling II • We picked randomly a number of projects through the “Status” filter • Rationale: limited number of categories associated {“Planning”, “PreAlpha”, “Alpha”, “Beta”, “Stable”, “Mature”} • The overall count is 406 projects
Attribute definition • Age • Application domain • Programming language • Size [KB] • Number of developers • Stable and transient developers • Number of users Modularity level Documentation level Popularity Status Success of project Vitality • Red: defined by FreshMeat • Black: defined by us
Coding • Each attribute was coded twice, to capture evolutive trends • First observation: January 2002 • Second observation: July 2002
Analysis • Here we discuss: • Application domain issues • Developers [stable & transient] issues • Subscribers (as users) issues • Code size issues
Attributes: project’s developers • We evaluate how many people write code for an application • External contributions are always credited in special-purpose files, or in the ChangeLog • We distinguish between • Stable developers • Transient developers • Core team: more than one stable developer • Manual inspections and pattern-recognition scripts
Developers over projects • We observe: • 72% of projects have a single stable developer • 80% of projects have at most a number of 10 developers
Definition: clusters of developers • Cluster 1: 1 to 3 developers (64.5%) • Cluster 2: 4 to 10 developers (20%) • Cluster 3: 11 to 20 developers (9.5%) • “Average” nr. of stable dev: 2 • “Average” nr. of transient dev: 3 • Cluster 4: more than 20 developers (6%) • “Average” nr. of stable dev: 6 • “Average” nr. of stable dev: 19
Attributes: subscribers • We use some publicly available data to gather some proxy about users • Users ~ Mailing List subscribers (public datum) • It’s not a monotonic measure: subscribers can join and leave as well • We have a measure of users in two different observations
Distribution of subscribers over project Around 42% of projects have at most 1 subscriber-user
Attributes: project’s size • We evaluate the code of each project twice • Code evaluated is contained in packages. We exclude from the count: • Auxiliary files: documentation, configuration files, GIF files, etc. • Legacy code: inherited libraries (e.g. Gnome macros), internationalization code
Conclusions I • The vast majority of projects are developed by only one developer • Adding people to a project has small effect on productivity (i.e. code per developer) • Open Source software is made by experts for experts (72% of horizontal projects have more than 10 developers) • 58% of projects didn’t change their size • 63% of projects had a change within 1%
Conclusions II • Java is relevant for 8% of the projects, C/C++ for 56%, PERL with Python for 20% • Observations from flagship projects (Apache, Linux, Gnome) are not confirmed for an average Open Source project • Several projects are white noise: to be filtered out • Huge amount of data on public repositories: empirical researchers have an invaluable resource of software data
Current and future work • Eliminating white noise: only projects in cluster 3 and 4 have been selected • Deeper analysis: the whole story of a project is being studied • What can we say with respect of conclusions on bigger OS projects? • What can be said about OSS evolution compared with traditional software evolution?