Atlassian has admitted that the tools it developed to move Jira users into the cloud were actually slower than older code that did the same job, and that its efforts to speed things up also had speed problems.
The Australian collaborationware company last year decided to discontinue its datacenter products and shift users to cloudy equivalents. That decision came five years after Atlassian killed its server products.
As explained in a Tuesday post by senior software engineer Priyansh Jain, Atlassian operates a migration platform team that built a migration pipeline on an API-driven architecture.
“This architecture, however, proved to be blocking and less scalable, and the customers in our migration pipeline were too large to migrate using this approach,” Jain wrote.
So Atlassian built a new migration architecture that Jain said operates “in a streamlined fashion, and gracefully avoids the bottlenecks and scalability issues that existed in the API-driven architecture.” The company released it to clients who needed to move Jira implementations of up to 20,000 seats.
But when Atlassian tested it, the company found the new migration pipeline took about 34 percent longer than its previous tools, and overall work item throughput dropped by roughly 60 percent on synthetic tests.”
“For customers with tens of thousands of users and massive project portfolios, fixing this became non-negotiable,” Jain wrote.
The fix involved many tweaks.
“We benchmarked different worker node sizes and configurations. The original setup ran on small nodes; scaling them up yielded significantly better throughput, balancing cost vs performance,” Jain’s post states. He says Atlassian also tightened autoscaling rules so that worker nodes spun up quickly whenever CPU usage spiked, maintaining high throughput from the start.”
Atlassian also “uncovered misconfigurations in the polling timeout. Our work item processing often took 60-120 seconds, but the consumer timeout was set to 40 seconds. That meant batches were being retried mid-flight, wasting work and slashing throughput by 30–40 percent. Setting a realistic timeout – 300 seconds – resolved the issue immediately.”
Many bug fixes later, Atlassian had a tool that saw median throughput for large, multi-project migrations improve by roughly 6x.
Jain’s post doesn’t say how long it took Atlassian to turn things around, but does say the company also started work on tools to migrate even larger Jira instances of up to 50,000 seats,
“The bar was clear: Migrate thousands of projects (roughly 6,500) and up to ~7.5 million work items in under 36 hours, with the import phase fitting within ~24 hours,” Jain revealed, adding “We couldn’t just ‘turn the knobs to 11’ and hope for the best.”
Such wisdom is surely rare.
But we digress: Jain details efforts to create the new tool hit issues like migrations starting slowly and only gaining speed after 45–60 minutes, because Atlassian tied autoscaling logic to CPU load. The company fixed that with what Jain described as “a mechanism to proactively ensure a minimum number of worker nodes were running whenever a significant migration kicked off. Imports started at full strength, shaving 30–60 minutes off overall migration time.”
That change paid off for Atlassian, too, because it allowed the company to maintain a smaller steady-state footprint and scale up only when needed – reducing monthly infrastructure costs by up to $65,000.
Another issue saw large migrations cause API errors, “because the read replicas of our database cluster struggled to keep up with the heavy write load.”
Atlassian fixed them all, and this time validated the tool’s readiness to serve 50,000-seat customers.
“End-to-end, the system migrated 6k+ projects in a single day. We’re now ready to serve 50K-scale Jira customers,” Jain enthused. And this time, without degrading performance. ®
Source: The register