I am frequently asked this question so I figured it merits a detailed posting for others to read. What does it mean to scale? For build servers the primary measure of scale is the number of concurrent builds running across a number of worker nodes.
To handle a large number of concurrent builds and worker nodes, there are multiple pieces of the system that need to scale, including the server, build queue, database and workers. We can address each of these individually.
So what is a large number of concurrent builds and worker nodes? This is a difficult question to answer because most large corporations with high build volume do not publish this information. I do know companies like Mozilla and HP have a peak volume of around 1000+ concurrent builds, so let's use this as our target level.
We should also estimate how many builds per day are going to get executed. Lets assume constant build concurrency and average build duration of 30 minutes. This means approximately 2000 builds / hour, which assuming a 40 hours work week is around 4 million builds per year.
The mysql or postgres database will likely be the first piece of our infrastructure that you need to scale. The database has been performance tested with 5 million repositories and 60 millions builds on a 1GB ram single-core machine and served approx 200-1000 req/second for the most intensive query, which is used to display the build feed.
I do think one issue we'll need to address is storing build logs in the database, which is better suited for something like S3 or Swift. In fact, the Drone service uses S3, so hopefully this code will get ported to the open source edition.
Based on initial testing, and if we enable S3 or Swift storage, I believe a modest size database server will handle our target scale of 1000 concurrent builds and approximately 4 million yearly builds. Over time you will need to increase the size of this server, and perhaps implement master/slave or other configurations to manage growth.
Some details provided at https://github.com/drone/drone/issues/1234#issuecomment-148874156
The server is responsible for processing build requests (received from GitHub hooks), managing the build queue, and rendering the results via the website. At our estimated levels, the server would receive approximately 30 webhooks per minute which Drone will handle fine. The build queue will be addressed in the next section, so let's focus on web traffic next ...
Build servers generally see low web traffic. Users are typically notified of build success and failure via Email or Slack, and only visit the website to view the logs when a build fails. The server itself is relatively low overhead, running at 15mb ram for small test installations so I would expect a 4GB ram server is adequate for large installations.
I will run some tests and post the results after the holidays.
I should also mention that you will likely need to increase the open file descriptor limit to handle larger workloads and more pageviews.
The build queue is embedded to avoid an extra piece of infrastructure. The queue uses a Go channel, a primitive Go structure, which can easily store 1000 builds in queue consuming just a few megabytes of ram.
I've tested executing 10,000 builds in a 30 minute span on a 4GB ram single core machine on Google Compute. This test was successfully executed multiple times without issue.
The biggest problem with our embedded queue is that restarting the server flushes the queue. This will be mitigated in the next release. See https://github.com/drone/drone/issues/1195
Drone stores a pool of Docker daemon urls and manages scheduling and execution. The number of workers is really limited to the size of the pool Drone can manage. Based on our estimates this amounts to a map of 1,000 urls and map of 1,000 Go channels. We need to test the upper bound, but I would not expect any issue managing a pool of 1000 workers.
So Does Drone Scale to 1000 workers?
Parts of it may, but I doubt the overall solution would scale to 1000 concurrent builds simply because we haven't tested this much load yet. I do think it is absolutely possible that with testing, measurement and refinement a single large Drone server can scale to thousands of nodes and thousands of concurrent builds.
My guess is that setting up the tests and measurements will be more work than performance tuning the system to meet our target load.
If anyone is willing to dedicate 1000 servers to run such a test please let me know. I think it would be a really fun exercise, and I think this is a worthwhile, attainable goal for the project.