Draft changes to a key research and development (R&D) tax credit program are being lauded by players in the Canadian tech ecosystem, as some argue they could address Canada’s productivity woes by further de-risking research spending.
The Department of Finance introduced draft reforms for the Scientific Research and Experimental Development (SR&ED), Canada’s largest federal program for business R&D, last Friday.
The proposed SR&ED reforms address longstanding pain points, experts told BetaKit, by reintroducing capital expenditures as claimable under SR&ED and making public companies eligible for the preferred tax credit rate. The changes would also increase the total amounts companies can claim, and allow them to stay within the eligibility threshold for longer.
“This will enable growth-stage companies to invest more and scale faster.”
Dani Lipkin TMX Group
First introduced in the feds’ Fall Economic Statement last December, the updates were put into legislative limbo after Finance Minister Chrystia Freeland resigned and Parliament was prorogued in January.
Administered by the Canada Revenue Agency, SR&ED was first created in 1948 and provides billions in tax incentives annually to encourage Canadian businesses, including tech companies, to engage in research activities.
Canadian tech and business interest groups have been debating the merits of a SR&ED overhaul since the Trudeau government pledged to review the program in 2022. In early 2024, the government launched consultations on how to improve its approach, with a focus on improving the creation and retention of intellectual property (IP).
Benjamin Bergen, president of tech scaleup lobby group the Council of Canadian Innovators (CCI), welcomed the drafted reforms but acknowledged that the “economic and geopolitical environment” is different than when they were introduced.
“What we need is policies that help to protect and commercialize key intellectual property and ensure that Canadians benefit from homegrown innovation,” Bergen said in a statement.
The organization has advocated for the program to incentivize domestic IP creation through a patent-box regime. Such a system would give tax breaks to profits generated from domestic IP.
Though the new draft legislation does not mention IP, the government said it would explore a patent-box regime in last year’s Fall Economic Statement, the details of which would be revealed in Budget 2025.
The draft legislation is open for public feedback until Sept. 12, according to the finance department. The House of Commons is set to resume sessions on Sept. 15, which is the earliest point at which this legislation could be tabled. If set into motion, the changes could apply retroactively to companies with fiscal years ending in 2025.
The return of capital expenditures
The draft legislation seeks to reintroduce capital expenditures, which include spending on equipment used for research and testing, not production. For example, companies would be allowed to claim up to 40 percent of their spending on machinery such as microscopes.
Business and tax experts told BetaKit that this change, if implemented, would positively impact Canadian tech companies’ research efforts and could make a dent in national productivity rates, which is typically measured as the gross domestic product output per hour worked. Canada’s labour productivity growth has declined over the past 30 years, according to the Bank of Canada, with the decrease accelerating after 2014.
Martha Breithaupt, a tax partner for credits and incentives at accounting firm BDO Canada, told BetaKit she sees a relationship between the removal of this provision in 2013 to Canada’s flagging productivity today, which is among the lowest of G7 countries.
“That coincided with a massive innovation and productivity downturn when it was taken out,” Breithaupt said.
Incentivizing companies to invest in hard assets could create more jobs through new research pursuits and bolster investment opportunities, she said. Experts have argued that boosting IP protections, investing in artificial intelligence (AI), and increasing competition could also increase productivity.
Bryan Watson, managing partner of CleanTech North and longtime SR&ED expert and commentator, argued to BetaKit that the reforms are “absolutely a step in the right direction” for hardtech, medtech, and engineering companies. “Anything that’s not just coding only” could benefit from the capital expenditure claims, he said.
The potential legislation comes as Canadian hardware companies deal with the uncertainty of an ongoing trade war as well as a gloomy early-stage venture capital (VC) funding landscape.
For the year ending in March 2025, 40 percent of SR&ED tax credits were issued for software development while roughly 40 percent more went towards electrical, mechanical, and medical engineering, according to official program statistics.
Public companies could reap benefits
The drafted changes would allow publicly listed Canadian companies to also benefit from the enhanced SR&ED tax credit rate of 35 percent, which was previously only accessible to private Canadian companies.
That barrier had contributed to challenges for companies that sought to raise capital from the public markets but stood to lose SR&ED benefits to fund R&D efforts, such as life sciences companies with long-term research needs. At the BetaKit Town Hall: Vancouver last year, AbCellera vice-president of business development Anne Stevens said that companies who need to go public earlier lose out on tax credits under the current system.
Andrew White, CEO of TSX Venture-listed cleantech company CHAR Technologies, said that losing SR&ED eligibility was a downside to going public in 2016. CHAR converts wood waste into biocarbon to replace metallurgical coal, among other renewables.
White told BetaKit his company plans to reinvest in R&D for more intellectual property and chase new research avenues if the changes take effect.
Dani Lipkin, managing director of the global innovation sector at TMX Group, told BetaKit that it’s currently “harmful” for private companies to go public if they are reliant on SR&ED. The new changes could potentially level the playing field for publicly listed companies, he said.
“This will enable growth-stage companies to invest more and scale faster,” Lipkin added.
Feature image courtesy François-Philippe Champagne via LinkedIn.
Great read with way too approachable and maybe relatable security issues.
Just having different ways to sanitize input at the API layer and logic layer resulting in these important security issues is so interesting and scary.
Google published Zanzibar: Google’s Consistent, Global Authorization System in 2019. It describes a system for authorization – enforcing who can do what – which maxes out both flexibility and scalability. Google has lots of different apps that rely on Zanzibar, and bigger scale than practically any other company, so it needed Zanzibar.
The Zanzibar paper made quite a stir. There are at least four companies that advertise products as being inspired by or based on Zanzibar. It says a lot for everyone to loudly reference this paper on homepages and marketing materials: companies aren’t advertising their own innovation as much as simply saying they’re following the gospel.
I read the paper, and have a few notes, but the Google Zanzibar Paper, annotated by AuthZed is the same thing from a real domain expert (albeit one who works for one of these companies), so read that too, or instead.
Features
My brief summary is that the Zanzibar paper describes the features of the system succinctly, and those features are really appealing. They’ve figured out a few primitives from which developers can build really flexible authorization rules for almost any kind of application. They avoid making assumptions about ID formats, or any particular relations, or how groups are set up. It’s abstract and beautiful.
The gist of the system is:
Objects: things in your data model, like documents
Users: needs no explanation
Namespaces: for isolating applications
Usersets: groups of users
Userset rewrite rules: allow usersets to inherit from each other or have other kinds of set relationships
Tuples, which are like (object)#(relation)@(user), and are sort of the core ‘rule’ construct for saying who can access what
There’s then a neat configuration language which looks like this in an example:
It’s pretty neat. At this point in the paper I was sold on Zanzibar: I could see this as being a much nicer way to represent authorization than burying it in a bunch of queries.
Specifications & Implementation details
And then the paper discusses specifications: how much scale it can handle, and how it manages consistency. This is where it becomes much more noticeably Googley.
So, with Google’s scale and international footprint, all of their services need to be globally distributed. So Zanzibar is a distributed system, and it is also a system that needs good consistency guarantees so that it avoid the “new enemy” problem, nobody is able to access resources that they shouldn’t, and applications that are relying on Zanzibar can get a consistent view of its data.
Pages 5-11 are about this challenge, and it is a big one with a complex, high-end solution, and a lot of details that are very specific to Google. Most noticeably, Zanzibar is built with Spanner Google’s distributed database, and Spanner has the ability to order timestamps using TrueTime, which relies on atomic clocks and GPS antennae: this is not standard equipment for a server. Even CockroachDB, which is explicitly modeled off of Spanner, can’t rely on having GPS & atomic clocks around so it has to take a very different approach. But this time accuracy idea is pretty central to Zanzibar’s idea of zookies, which are sort of like tokens that get sent around in its API and indicate what time reference the client expects so that a follow-up response doesn’t accidentally include stale data.
To achieve scalability, Zanzibar is also a multi-server architecture: there are aclservers, watchservers, a Leopard indexing system that creates compressed skip list-based representations of usersets. There’s also a clever solution to the caching & hot-spot problem, in which certain objects or tuples will get lots of requests all at once so their database shard gets overwhelmed.
Conclusions
Zanzibar is two things:
A flexible, relationship-based access control model
A system to provide that model to applications at enormous scale and with consistency guarantees
My impressions of these things match with AuthZed’s writeup so I’ll just quote & link them:
There seems to be a lot of confusion about Zanzibar. Some people think all relationship-based access control is “Zanzibar”. This section really brings to light that the ReBAC concepts have already been explored in depth, and that Zanzibar is really the scaling achievement of bringing those concepts to Google’s scale needs. link
And
Zookies are very clearly important to Google. They get a significant amount of attention in the paper and are called out as a critical component in the conclusion. Why then do so many of the Zanzibar-like solutions that are cropping up give them essentially no thought? link
I finished the paper having absorbed a lot of tricky ideas about how to solve the distributed-consistency problems, and if I were to describe Zanzibar, those would be a big part of the story. But maybe that’s not what people mean when they say Zanzibar, and it’s more a description of features?
For my own needs, zookies and distributed consistency to the degree described in the Zanzibar paper are overkill. There’s no way that we’d deploy a sharded five-server system for authorization when the main application is doing just fine with single-instance Postgres. I want the API surface that Zanzibar describes, but would trade some scalability for simplicity. Or use a third-party service for authorization. Ideally, I wish there was something like these products but smaller, or delivered as a library rather than a server.
This week, Heroku made Router 2.0 generally available, bringing features like HTTP/2, performance improvements and reliability enhancements out of the beta program!
Throughout the Router 2.0 beta, our engineering team has addressed several bugs, all fairly straight-forward with one exception involving Puma-based applications. A small subset of Puma applications would experience increased response times upon enabling the Router 2.0 flag, reflected in customers’ Heroku dashboards and router logs. After thorough router investigation and peeling back Puma’s server code, we realized what we had stumbled upon was not actually a Router 2.0 performance issue. The root cause was a bug in Puma! This blog takes a deep dive into that investigation, including some tips for avoiding the bug on the Heroku platform while a fix in Puma is being developed. If you’d like a shorter ride (aka. the TL;DR), skip to The Solution section of this blog. For the full story and all the technical nitty gritty, read on.
The long response times issue first surfaced through a customer support ticket for an application running a Puma + Rails web server. As the customer reported, in high load scenarios, the performance differences between Router 2.0 and the legacy router were disturbingly stark. An application scaled to 2 Standard-1X dynos would handle 30 requests per second just fine through the legacy router. Through Router 2.0, the same traffic would produce very long tail response times (95th and 99th percentiles). Under enough load, throughput would drop and requests would fail with H12: Request Timeout. The impact was immediate upon enabling the http-routing-2-dot-0 feature flag:
At first, our team of engineers had difficulty reproducing the above, despite running a similarly configured Puma + Rails app on the same framework and language versions. We consistently saw good response times from our app.
Then we tried varying the Rails application’s internal response time. We injected some artificial server lag of 200 milliseconds and that’s when things really took off:
This was quite the realization! In staging environments, Router 2.0 is subject to automatic load tests that run continuously, at varied request rates, body sizes, protocol versions. etc.. These request rates routinely reach much higher levels than 30 requests per second. However, the target applications of these load tests did not include a Heroku app running Puma + Rails with any significant server-side lag.
With a reproduction in-hand, we were now in a position to investigate the high response times. We spun up our test app in a staging environment and started injecting a steady load of 30 requests per second.
Our first thought was that perhaps the legacy router is faster at forwarding requests to the dyno because its underlying TCP client manages connections in a way that plays nicer with the Puma server. We hopped on a router instance and began dumping netstat connection states for one of our Puma app's web dynos :
In the legacy router case, it seemed like there were fewer connections sitting in TIME_WAIT. This TCP state is a normal stop point along the lifecycle of a connection. It means the remote host (dyno) has sent a FIN indicating the connection should be closed. The local host (router) has sent back an ACK, acknowledging the connection is closed.
The connection hangs out for some time in TIME_WAIT, with the value varying among operating systems. The Linux default is 2 minutes. Once that timeout is hit, the socket is reclaimed and the router is free to re-use the address + port combination for a new connection.
With this understanding, we formed a hypothesis that the Router 2.0 HTTP client was churning through connections really quickly. Perhaps the new router was opening connections and forwarding requests at a faster rate than the legacy router, thus overwhelming the dyno.
Router 2.0 is written in Go and relies upon the language’s standard HTTP package. Some research turned up various tips for configuring Go’s http.Transport to avoid connection churn. The main recommendation involved tuning MaxIdleConnsPerHost . Without explicitly setting this configuration, the default value of 2 is used.
type Transport struct {
// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int
...
}
const DefaultMaxIdleConnsPerHost = 2
The problem with a low cap on idle connections per host is that it forces Go to close connections more often. For example, if this value is set to a higher value, like 10, our HTTP transport will keep up to 10 idle connections for this dyno in the pool. Only when the 11th connection goes idle does the transport start closing connections. With the number limited to 2, the transport will close more connections which also means opening more connections to our dyno. This could put strain on the dyno as it requires Puma to spend more time handling connections and less time answering requests.
We wanted to test our hypothesis, so we set DefaultMaxIdleConnsPerHost: 100 on the Router 2.0 transport in staging. The connection distribution did change and now Router 2.0 connections were more stable than before:
root@router.1020195 | # netstat | grep 'ip-10-1-2-62.ec2.:37183'
tcp 0 0 ip-10-1-34-185.ec:36350 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:11956 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:51088 ip-10-1-2-62.ec2.:37183 ESTABLISHED
tcp 0 0 ip-10-1-34-185.ec:60876 ip-10-1-2-62.ec2.:37183 ESTABLISHED
To our dismay, this had zero positive effect on our tail response times. We were still seeing the 99th percentile at well over 2 seconds for a Rails endpoint that should only take about 200 milliseconds to respond.
We tried changing some other configurations on the Go HTTP transport, but saw no improvement. After several rounds of updating a config, waiting for the router artifact to build, and then waiting for the deployment to our staging environment, we began to wonder—can we reproduce this issue locally?
Fortunately, we already had a local integration test set-up for running requests through Router 2.0 to a dyno. We typically utilize this set-up for verifying features and fixes, rarely for assessing performance. We subbed out our locally running “dyno” for a Puma server with a built-in 200ms lag on the /fixed endpoint. We then fired off 200 requests over 10 different connections with hey:
As you can see, the 95th percentile of response times is over 2 seconds, just as we had seen while running this experiment on the platform. We were now starting to worry that the router itself was inflating the response times. We tried targeting Puma directly at localhost:3000, bypassing the router altogether:
Wow! These results suggested the issue is reproducible with any ‘ole Go HTTP client and a Puma server. We next wanted to test out a different client. The load injection tool, hey is also written in Go, just like Router 2.0. We next tried ab which is written in C:
❯ ab -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 3000
Document Path: /fixed
Document Length: 3 bytes
Concurrency Level: 10
Time taken for tests: 8.538 seconds
Complete requests: 200
Failed requests: 0
Total transferred: 35000 bytes
HTML transferred: 600 bytes
Requests per second: 23.42 [#/sec] (mean)
Time per request: 426.911 [ms] (mean)
Time per request: 42.691 [ms] (mean, across all concurrent requests)
Transfer rate: 4.00 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 2
Processing: 204 409 34.6 415 434
Waiting: 204 409 34.7 415 434
Total: 205 410 34.5 415 435
Percentage of the requests served within a certain time (ms)
50% 415
66% 416
75% 416
80% 417
90% 417
95% 418
98% 420
99% 429
100% 435 (longest request)
Another wow! The longest request took about 400 milliseconds, much lower than the 2 seconds above. Had we just stumbled upon some fundamental incompatibility between Go’s standard HTTP client and Puma? Not so fast.
A deeper dive into the ab documentation surfaced this option:
❯ ab -h
Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
...
-k Use HTTP KeepAlive feature
That’s different than hey’s default of enabling keepalive by default. Could that be significant? We re-ran ab with -k:
❯ ab -k -c 10 -n 200 http://127.0.0.1:3000/fixed
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests
Server Software:
Server Hostname: 127.0.0.1
Server Port: 3000
Document Path: /fixed
Document Length: 3 bytes
Concurrency Level: 10
Time taken for tests: 8.564 seconds
Complete requests: 200
Failed requests: 0
Keep-Alive requests: 184
Total transferred: 39416 bytes
HTML transferred: 600 bytes
Requests per second: 23.35 [#/sec] (mean)
Time per request: 428.184 [ms] (mean)
Time per request: 42.818 [ms] (mean, across all concurrent requests)
Transfer rate: 4.49 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 6
Processing: 201 405 609.0 202 2453
Waiting: 201 405 609.0 202 2453
Total: 201 406 609.2 202 2453
Percentage of the requests served within a certain time (ms)
50% 202
66% 203
75% 203
80% 204
90% 2030
95% 2242
98% 2267
99% 2451
100% 2453 (longest request)
Now the output looked just like the hey output. Next, we ran hey with keepalives disabled:
Again, no long tail response times and the median values comparable to the first run with ab.
Even better, this neatly explained the performance difference between Router 2.0 and the legacy router. Router 2.0 adds support for HTTP keepalives by default, in line with HTTP/1.1 spec. In contrast, the legacy router closes connections to dynos after each request. Keepalives usually improve performance, reducing time spent in TCP operations for both the router and the dyno. Yet, the opposite was true for a dyno running Puma.
Note that we suggest reviewing this brief Puma architecture document if you’re unfamiliar with the framework and want to get the most out of this section. To skip the code review, you may fast-forward to The Solution.
This finding was enough of a smoking gun to send us deep into the the Puma server code, where we homed in on the process_client method. Let’s take a look at that code with a few details in mind:
Each Puma thread can only handle a single connection at at time. A client is a wrapper around a connection.
The handle_request method handles exactly 1 request. It returns false when the connection should be closed and true when it should be kept open. A client with keepalive enabled will end up in the true condition on line 470.
fast_check is only false once we’ve processed @max_fast_inline requests serially off the connection and when there are more connections waiting to be handled.
For some reason, even when the number of connections exceeds the max number of threads, @thread_pool.backlog > 0 is often times false.
Altogether, this means the below loop usually executes indefinitely until we’re able to bail out when handle_request returns false.
When does handle_request actually return false? That is also based on a bunch of conditional logic, the core of it is in the prepare_response method. Basically, if force_keep_alive is false, handle_request will return false. (This is not exactly true. It’s more complicated, but that’s not important for this discussion.)
The last thing to put the puzzle together: max_fast_inline defaults to 10. That means Puma will process at least 10 requests serially off a single connection before handing the connection back to the reactor class. Requests that may have come in a full second ago are just sitting in the queue, waiting for their turn. This directly explains our 10*200ms = 2 seconds of added response time for our longest requests!
We figured setting max_fast_inline=1 might fix this issue, and it does sometimes. However, under sufficient load, even with this setting, response times will climb. The problem is the other two OR’ed conditions circled in blue and red above. Sometimes the number of busy threads is less than the max and sometimes, there are no new connections to accept on the socket. However, these decisions are made at a point in time and the state of the server is constantly changing. They are subject to race conditions since other threads are concurrently accessing these variables and taking actions that modify their values.
After reviewing the Puma server code, we came to the conclusion that the simplest and safest way to bail out of processing requests serially would be to flat-out disable keepalives. Explicitly disabling keepalives in the Puma server means handing the client back to the reactor after each request. This is how we ensure requests are served in order.
Once confirming these results with the Heroku Ruby language owners, we opened a Github issue on the Puma project and a pull request to add an enable_keep_alives option to the Puma DSL. When set to false, keepalives are completely disabled. The option will be released soon, likely in Puma 6.5.0.
We then re-ran our load tests with enable_keep_alives disabled in Puma and Router 2.0 enabled on the app:
// config/puma.rb
...
enable_keep_alives false
The response times and throughput improved, as expected. Additionally, once disabling Router 2.0, the response times stayed the same:
Keeping connections alive reduces time spent in TCP operations. Under sufficient load and scale, avoiding this overhead cost can positively impact apps’ response times. Additionally, keepalives are the de facto standard in HTTP/1.1 and HTTP/2. Because of this, Heroku has chosen to move forward with keepalives as the default behavior for Router 2.0.
Through raising this issue on the Puma project, there has already been movement to fix the bad keepalive behavior in the Puma server. Heroku engineers remain active participants in discussions arounds these efforts and are committed to solving this problem. Once a full fix is available, customers will be able to upgrade their Puma versions and use keepalives safely, without risk of long response times.
In the meantime, we have provided another option for disabling keepalives when using Router 2.0. The following labs flag may be used in conjunction with Router 2.0 to disable keepalives between the router and your web dynos:
heroku labs:enable http-disable-keepalive-to-dyno -a my-app
You may find that your Puma app does not need keepalives disabled in order to perform well while using Router 2.0. We recommend testing and tuning other configuration options, so that your app can still benefit from persistent connections between the new router and your dyno:
Increase the number of threads. More threads means Puma is better able to handle concurrent connections.
Increase the number of workers. This is similar to increasing the number of threads.
Decrease the max_fast_inline number. This will limit the number of requests served serially off a connection before handling queued requests.
Our team also wanted to see if this same issue would present in other languages or frameworks. We ran load tests, injecting 200 milliseconds of server-side lag over the top languages and frameworks on the Heroku platform. Here are those results.
We were initially surprised by this keepalive behavior in the Puma server. Funny enough, we believe Heroku’s significance in the Puma/Rails world and the fact that the legacy router does not support keepalives may have been factors in this bug persisting for so long. Reports of it had popped up in the past (see Issue 3443, Issue 2625 and Issue 2331), but none of these prompted a fool-proof fix. Setting enable_keep_alives false does completely eliminate the problem, but this is not the default option. Now, Puma maintainers are taking a closer look at the problem and benchmarking potential fixes in a fork of the project. The intention is to fix the balancing of requests without closing TCP connections to the Puma server.
Our Heroku team is thrilled that we were able to contribute in this way and help move the Puma/Rails community forward. We’re also excited to release Router 2.0 as GA, unlocking new features like HTTP/2 and keepalives to your dynos. We encourage our users to try out this new router! For advice on how to go about that, see Tips & Tricks for Migrating to Router 2.0.