ABSTRACT

We present GarbledCPU, the first framework that realizes a hardware-based general purpose sequential processor for secure computation. Our MIPS-based implementation enables development of applications (functions) in a high-level language while performing secure function evaluation (SFE) using Yao’s garbled circuit protocol in hardware. GarbledCPU provides three degrees of freedom for SFE which allow leveraging the trade-off between privacy and performance: public functions, private functions, and semi-private functions. We synthesize GarbledCPU on a Virtex-7 FPGA as a proof-of-concept implementation and evaluate it on various benchmarks including Hamming distance, private set intersection and AES. Our results indicate that our pipelined hardware framework outperforms the fastest available software implementation.

Keywords

Garbled Circuit; Secure Function Evaluation

1. INTRODUCTION

Secure Function Evaluation (SFE) allows two (or more) mistrusting parties to jointly compute an arbitrary function on their private inputs without revealing information but the result. The seminal work of Yao [20] has introduced the concept of two-party SFE using the Garbled Circuits (GC) protocol which requires that the function is represented as a Boolean circuit. While the GC protocol was originally thought to be of theoretical interest only, algorithmic and implementation optimizations have significantly improved its efficiency during the last decades. In addition to the advances in computing platforms, the key enablers for the progress include newer cryptographic constructs, logic-level transformations, and software techniques.

Compilers for SFE have been continually evolving. A number of compilers [2, 4, 12, 13] translate a functionality written in a domain-specific input language into a Boolean circuit, also described in an intermediate language, which is then evaluated with Yao’s GC protocol. Other compilers [5, 11] use a subset of the C language as input. However, these methods imply building software-to-Boolean circuit compilers from scratch and often put limitations on the functionality. Moreover, verifying the correctness of these compilers is challenging [14].

Recently, it was shown that the long-established and verified hardware synthesis compilers can be used for generation of Boolean circuits for SFE, eliminating the need for building ad-hoc logic compilers or tedious handcrafting of Boolean circuits. Another key advantage of conventional logic synthesis is allowing a sequential logic description which can be adapted to map general functionalities to Boolean circuits optimized for Yao’s GC protocol. The approach was introduced in [17] and shown to yield great improvements in terms of memory and communication. A fully-automated toolchain which utilizes existing logic synthesis compilers and can be generalized for other SFE protocols was presented in [3]. This latter work takes advantage of the built-in intellectual property (IP) and custom design libraries which can be readily adapted during circuit synthesis to realize a broad suite of applications optimized for SFE.

The authors in [17] leverage the capability of synthesizing a sequential circuit and introduce the idea of a general-purpose sequential processor for private function SFE (PF-SFE) by GC, where both input data and function are private. PF-SFE is useful for scenarios where the function is proprietary or classified, e.g., credit checking or private database queries. Their so-called garbled processor allows to use existing software compilers for describing the function and generates compatible machine code which is also garbled. A partial implementation of a MIPS processor is provided in [17] which only considers PF-SFE where the entire processor circuit and Instruction Set (IS) have to be garbled in each instruction step during SFE in order to hide the executed instructions in the private function. This re-
results in a tremendous cost compared with SFE for public functions. Thus, their MIPS processor incurs an unnecessary overhead for numerous applications in which a private function is not required. (The only benchmark presented in [17] is the Hamming distance function.)

We propose GarbledCPU, the first configurable hardware-based general purpose sequential CPU for SFE. The FPGA realization of GarbledCPU is based on the MIPS instruction set. GarbledCPU provides a generalized support for SFE of varying flavors of privacy, beyond PF-SFE, to allow for more relaxed privacy demands and hence an improved performance. More explicitly, with GarbledCPU the parties can evaluate a private, semi-private or public function by revealing none, partial or all information about the function respectively while still benefiting from the simplicity of programming a processor. Both parties decide first which subset of IS they are willing to use which determines the level of privacy ensured. The function is compiled from a high-level language, e.g., C/C++ into assembly code of the agreed upon IS. Next, the garbled processor is securely evaluated given users’ garbled input and the compiled function instructions (also garbled) to compute the output. A recent technical report [19] also suggests a secure computation framework using MIPS code. The approach relies on garbled universal circuits to emulate the execution of each instruction of the MIPS program and on Oblivious RAM (ORAM) for memory access. They propose using static analysis of functions to reduce the set of instructions to be garbled. However, [19] only presents a software SFE implementation, while we present the first practical hardware sequential processor for both SFE and PF-SFE. An earlier hardware implementation of GC was reported in [7] but the approach only addresses SFE with no support for function hiding and is limited to combinational Boolean circuit as it predated [17]. A combinational description limits usability and is impractical for control-intensive functions such as CPUs that need to be expressed sequentially.

**Contributions.** In brief, our contributions are as follows:

- We propose the first hardware-only solution for 2-party GC-based secure sequential function computation with different SFE flavors that allows leveraging the trade-off between privacy and performance: application-specific IS for SFE (§3.1), restricted IS (§3.2) for semi-private SFE, and full IS (§3.3) for PF-SFE.
- We realize a proof-of-concept FPGA implementation which demonstrates the feasibility of the sequential garbled processor in hardware, and motivates further research in this direction. GarbledCPU achieves efficiency and performance by leveraging the most recent optimizations for GC [1, 9, 17, 21], along with a high-throughput pipelined GC evaluation on FPGA. It outperforms the fastest software implementation in the literature which relies on the Intel AES-NI [1].
- We extensively benchmark more complex functions such as AES, Private Set Intersection (PSI), and Hamming distance and evaluate them under our different privacy settings using our framework and when applicable, compare our performance with prior work.

## 2. PRELIMINARIES

In this section, we present an introduction to secure computation and SFE in §2.1 and GC optimizations in §2.2.

### 2.1 Secure Computation and Garbled Circuit

Yao’s GC protocol [20] allows two parties, Alice and Bob, to jointly compute a function \( f(x_{Alice}, x_{Bob}) \) on their private inputs \((x_{Alice}, x_{Bob})\). Alice garbles the function \( f \), where \( f \) is represented as a Boolean circuit. To do this, Alice maps the plain binary values of inputs and intermediate gates’ outputs to random labels (keys). For each gate in the circuit, an encrypted truth table is generated that allows computation of the gate’s output label based on its input labels. Alice sends the encrypted truth tables of all gates, along with her corresponding encrypted input labels to Bob. To compute \( f \), Bob needs to know the labels corresponding to his inputs without revealing them to Alice. For this, Bob obtains his labels obliviously through a 1-out-of-2 Oblivious Transfer (OT) protocol [15] and uses them to evaluate the garbled circuit gate by gate. Finally, Alice provides a mapping from the encrypted output label to the plain output.

Two-party Private Function SFE (PF-SFE) allows secure computation of a function \( f_{Alice}(\cdot) \) held by Alice on Bob’s data \( x_{Bob} \) (Bob) while both the data and the function are kept private, i.e., Bob learns \( f_{Alice}(x_{Bob}) \) but nothing else about \( f_{Alice} \). This is in contrast to the usual setting of SFE where the function is known to both parties. PF-SFE is especially useful when the function is proprietary or classified. Traditionally, PF-SFE is realized by running SFE of a Universal Circuit (UC) [8, 10, 18]. A UC is similar to a Universal Turing Machine that receives a Turing machine description \( f(\cdot) \) and applies it to the input data on its tape. Usage of UC results in complexity at least \( \mathcal{O}(n \log n) \) for a circuit with \( n \) gates [18].

### 2.2 GC Optimizations

Our GC evaluator architecture is based on fixed-key block cipher garbling [1] and utilizes garbling sequential circuits [17]. We also consider the most recent optimizations on GC: the half-gates technique [21] allows to use two ciphertexts for each non-XOR gate (instead of three) while still being compatible with the free-XOR technique [9].

## 3. GARBLED PROCESSOR

The idea of garbling a processor was first introduced in [17] as a solution for hiding the function in PF-SFE. Besides enabling PF-SFE, another advantage of a garbled processor is usability for non-expert users since it can be programmed using high-level languages, whereas other frameworks for the GC protocol require tedious Boolean circuit construction. However, garbling and evaluating the entire processor incurs a tremendous cost compared to SFE solutions due to stronger privacy requirements in PF-SFE.

**Adversary Model.** We assume an honest-but-curious (i.e., semi-honest or passive) adversary which is sufficient for most practical scenarios to enable efficient protocols. This establishes the first step towards protocols with stronger security guarantees against malicious or covert adversaries.

In this work, we propose GarbledCPU as a hardware-only garbled processor framework for secure computation that provides scalable support for generalized SFE with a relaxed privacy setting but improved performance, besides the more security-demanding PF-SFE, as well as a flavor in-between. To avoid information leakage about the function (i.e., PF-SFE), we use GarbledCPU with its full Instruction Set (IS),
which incurs a large overhead due to garbling and evaluating of the entire IS. We can also compile the function using only a subset of the IS: restricted IS (i.e., semi-private function). A third alternative is public function mode in which the function is compiled using only an application-specific subset of the IS that is required for executing the function. In the following, we discuss these modes of function evaluation and the trade-off between privacy and performance further.

Figure 1 shows the overview of GarbledCPU for 2-party computation between Alice (garbler) and Bob (evaluator). Alice generates the garbled instructions and tables by garbling the processor circuit for the selected IS mode and sends them to Bob. He also receives his garbled input data through OT from Alice without revealing his input to her. Bob evaluates GarbledCPU and produces the garbled output. Eventually, Alice reveals the output map to Bob and he “ungarbles” and learns the output data.

3.1 Garbled Processor for Public Functions

Using a general-purpose processor with its entire IS in SFE results in garbling a large processor which is very costly and unnecessary since both parties know the function instructions being executed but not their results. Hence, garbling a limited application-specific IS for executing each instruction is sufficient to achieve privacy. In §5.3 we show three examples of GarbledCPU with application-specific IS. To further reduce the IS, assuming for example, a function that consists of 10 instructions, we could theoretically generate $2^{10} - 1$ netlists (netlists of IS with different combinations of the 10 instructions, excluding the netlist with zero instructions). At run-time, one of these netlists is plugged in (garbled and evaluated) at each instruction step depending on the expected instructions. However, to make it more reasonable (generate fewer netlists), for functions with control flow independent of private data, we know in advance which instruction will be executed at each step. Thus, we need only the netlist of the processor implementing IS with that specific instruction, restricting the required netlists in this case to 10. For functions with control flow dependent on private data, a simple static analysis can be used to specify the combination of possible instructions at each step, and hence the required IS netlist as proposed in [19].

3.2 Garbled Processor for Semi-Private Functions

The main cost for garbling a processor with its entire IS results from garbling circuits for expensive instructions like multiplication and division. Most compilers are able to avoid these costly instructions and replace them with cheaper loops of shifts, addition, and subtraction instructions. This would eliminate the need for the Mult/Div unit in the processor and reduce the cost of garbling per instruction on one hand. However on the other hand, one expensive instruction will be replaced with multiple cheap instructions, thus increasing the total number of instructions. For example, multiplying two 32-bit numbers with the MULT instruction in MIPS requires 15 cycles and a circuit of 13,257 non-XOR gates\(^1\), while it requires at least 31 cycles and a circuit of 9,676 non-XOR gates when using a conditional loop over an ADD instruction. We call this mode “semi-private” since it only reveals partial information about instructions used in the program (that the program does not use division/multiplication) and increases the probability of guessing an instruction by reducing the subset of possible instructions (restricted IS).

3.3 Garbled Processor for Private Functions

In the standard 2-party PF-SFE, Alice provides the function $f_{\text{Alice}}(\cdot)$ and Bob provides the input data $x_{Bob}$ and the output is $f_{\text{Alice}}(x_{Bob})$. In this work, GarbledCPU receives a list of instructions as $u_{\text{Alice}}(\cdot)$ and applies to them the input data $x_{Bob}$ in memory and the output will be written back to the memory. To avoid information leakage about the private function, we use a general-purpose processor with its entire IS (full IS).

4. GarbledCPU IMPLEMENTATION

We present our garbled processor in §4.1, a high performance hardware architecture for GC to evaluate GarbledCPU in §4.2, and implementation challenges in §4.3.

4.1 MIPS

To implement GarbledCPU, we use the MIPS architecture from the Plasma project in Opencores [16]. We chose the single-cycle implementation of the 32-bit MIPS I instruction set which is based on the Reduced Instruction Set Computing (RISC), making its Boolean representation among the simplest of modern processors. Note that the gates should be garbled/evaluated one after another in the GC protocol, and it is challenging to benefit from physical level parallelism that exists inherently in hardware. Thus, using a multi-cycle, pipelined, or a more sophisticated architecture not only complicates the implementation, but also counter-intuitively, increases the overall cost of garbling for the same functionality. The time required for garbling a circuit depends only on the total number of gates and not the critical path. In §5.3 we present the garbling cost for MIPS with various memory sizes.

4.2 Our Hardware Architecture

To the best of our knowledge, the fastest implementation of GC is JustGarble [1]. JustGarble is a software GC realization using fixed-key AES garbling which benefits from the AES-NI instruction set in modern Intel processors. Its performance reaches about 20 clock cycles per gate for GC evaluation. Prior to JustGarble, Järvinen et al. introduced two hardware realizations for the GC protocol in [7]. However, their performance is much slower than JustGarble because JustGarble utilizes a more efficient fixed-key AES for garbling instead of an expensive hash function. Thus, it is possible that a hardware implementation leveraging the latest GC optimizations including fixed-key AES garbling would outperform JustGarble. Furthermore, a processor is

\(^1\)XOR gates are evaluated freely in GC according to the free-XOR optimization of [9].
essentially a sequential circuit and its evaluation requires sequential GC which none of these works supports.

Our GC evaluator is based on the most recent optimizations listed in §2.2. Its architecture is shown in Figure 2 and consists of: (1) Simple Circuit Description (SCD) memory: read-only memory that stores the information about gates in the MIPS circuit in SCD format [1, 17]. (2) GC Label memory: read-write random-access memory that stores GC ciphertext labels of all wires in the corresponding MIPS circuit. (3) Garbled Tables (GT) memory: read-write random-access memory that stores the ciphertext garbled tables of each non-XOR gate in the MIPS circuit that are generated by Alice (garbler). (4) Sequential Handler: controller that supports evaluation of the sequential circuits with the GC protocol. (5) Evaluator Engine: main functionality of GC evaluation according to Yao’s GC protocol and its most recent optimizations [1, 9, 17, 21].

As shown in Figure 2, Bob’s input labels in the Label memory are initialized by the OT protocol with Alice. The rest of the labels in the Labels memory and the Garbled Tables memory are received in clear-text from Alice.

**Pipelined Evaluator Engine and Gate Dependency.** To maximize the performance of the GC evaluator, we use a 20-stage pipelined AES implementation [6] inside our Evaluator Engine module. It increases the throughput of the module by increasing the maximum operating clock frequency of the engine. We also add one stage for the rest of the GC evaluation functionality. Due to the free-XOR technique [9], evaluating an XOR gate requires only XOR-ing the input labels while evaluating a non-XOR gate requires two AES encryptions (due to half-gates technique presented in §2.2, and was one encryption before). Therefore, evaluation of an XOR gate can be done in only one stage of the AES pipeline. Different timing for XOR and non-XOR gates introduces a challenge for handling dependencies of gates’ inputs and output. A gate cannot enter the evaluation pipeline if its inputs are another gate’s output which is not yet evaluated. This results in pipeline stalls which degrade the overall performance. To mitigate this, we push XOR gates to the latest empty stage of the pipeline such that the subsequent dependent gates can enter the pipeline as soon as possible.

### 4.3 Hardware Prototype Challenges

We only use on-chip memory for our proof-of-concept in this work. However, this prototype can be extended to support interfacing with off-chip memory which would store garbled tables and labels of larger garbled processor circuits and functions. It can also interface with another FPGA emulator of the garbler which generates the garbled tables and labels and streams them to our evaluator. A wide range of scenarios are now feasible owing to our current hardware platform and state-of-the-art optimized GC evaluator. Such extensions would incur additional area and performance overheads, but would allow upscaling of our implementation to support garbled processor circuits and benchmarks in the Gigabytes range. We emphasize that we provide in this work a proof-of-concept prototype to motivate further research in this direction to bring garbled processors some steps closer to the realm of feasible practical implementations.

### 5. EVALUATION

We give our evaluation setup in §5.1, benchmarks in §5.2, synthesis results in §5.3, and performance evaluation in §5.4.

#### 5.1 Evaluation Setup

We create different instances of a single-cycle MIPS architecture with specific, restricted, and full IS to support a trade-off between efficiency and privacy. (Full IS was proposed in [17] and is reported for comparison.) The different MIPS instances are synthesized using Synopsys Design Compiler DC H-2013.00-SP4 to generate optimized sequential Boolean circuits. These circuits are then evaluated on our hardware GC evaluator implemented using Vivado 2014.4.1 on a Xilinx Virtex-7 FPGA.

#### 5.2 Benchmarks

As benchmarks we used Hamming distance, private set intersection and AES. We compile these benchmarks from high-level C to MIPS binary using a MIPS cross-compiler. For some benchmarks, assembly code manipulation allows to reduce the number of clock cycles required. To assure correctness of both benchmarks and IS under test, we simulate the resulting binary file using the Modelsim simulator and calculate the number of required cycles to compute each of the benchmarks, reported in Table 1, for accurate performance measurements. For Hamming distance, the number of cycles depends on the size of the input strings. In the PSI benchmark, we compute a variant of PSI called PSI cardinality (PSI-CA) where only the number of common elements is revealed. The sets can have different sizes where each element is 32-bit. For AES, we assume that one party holds a 128-bit message and the other party holds eleven round keys each of 128-bit length to avoid unnecessary garbling and evaluation of the round-key generation function.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Input Size</th>
<th># of required cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hamming Distance</td>
<td>10</td>
<td>218</td>
</tr>
<tr>
<td>Hamming Distance</td>
<td>32</td>
<td>426</td>
</tr>
<tr>
<td>Hamming Distance</td>
<td>64</td>
<td>842</td>
</tr>
<tr>
<td>Hamming Distance</td>
<td>128</td>
<td>1,074</td>
</tr>
<tr>
<td>PSI</td>
<td>64</td>
<td>7,267</td>
</tr>
<tr>
<td>PSI</td>
<td>128</td>
<td>14,267</td>
</tr>
<tr>
<td>AES (no key expansion)</td>
<td>128</td>
<td>6,178</td>
</tr>
</tbody>
</table>

#### 5.3 Synthesis of the GarbledCPU IS

We synthesize the MIPS architecture, shown in Figure 3, with Synopsys DC for different ISs and memory sizes: 32 to 512 32-bit words for instruction and data memories. Generating these Boolean circuits is a one-time process and the circuits can be re-used without incurring further compila-
tion costs. Table 2 shows the synthesis time and number of non-XOR gates of the IS’s with different sizes of memories.

Table 2: Synthesis results of different variants of (IS)

<table>
<thead>
<tr>
<th>Memory Size (words)</th>
<th>Hamming Distance-specific IS</th>
<th>PSI-specific IS</th>
<th>AES-specific IS</th>
<th>ALU/Shift IS</th>
<th>ALU-only IS</th>
<th>Full IS [17]</th>
</tr>
</thead>
<tbody>
<tr>
<td>DM, IM = 32</td>
<td>19.648 s</td>
<td>6.715</td>
<td>2.021</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DM, IM = 64</td>
<td>31.092 s</td>
<td>9.380</td>
<td>3.046</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DM, IM = 128</td>
<td>62.217 s</td>
<td>16.062</td>
<td>5.095</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DM, IM = 512</td>
<td>358.186 s</td>
<td>63.974</td>
<td>17.385</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- **Restricted IS for semi-private functions**: We synthesized two variants of the restricted IS: one without the Mult/Div unit and another without Mult/Div and Shift units. Since the difference between the two depends mainly on reducing the control logic and select lines of multiplexers, the numbers of non-XOR gates for both are different. However, the number of flip-flops are the same.

- **Full IS [17] for private functions**: We show full IS synthesis results with different memory sizes in Table 2.

### 5.4 Performance Evaluation

**Area.** Table 3 shows the resource allocation and utilization of our GC Evaluator on a Xilinx Virtex-7 FPGA. Note that the FPGA utilization does not vary for different memory sizes and instances of the MIPS processor since the evaluator logic remains unaltered. For different memory sizes and IS instances, only the non-XOR gate count varies. This only impacts the garbled labels and tables memory which significantly affects the off-chip memory utilized for storing the garbled tables, and the Block Random-Access Memory (BRAM) resources utilization only to a small extent.

Table 3: Resource allocation and utilization of GarbledCPU GC Evaluator on a Xilinx Virtex-7 FPGA.

<table>
<thead>
<tr>
<th>Resource</th>
<th>Estimation</th>
<th>Utilization %</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flip-Flop (FF)</td>
<td>22.035 s</td>
<td>2.54</td>
</tr>
<tr>
<td>Slice LookUp Table (LUT)</td>
<td>21.229 s</td>
<td>4.90</td>
</tr>
<tr>
<td>BRAM</td>
<td>354</td>
<td>24</td>
</tr>
<tr>
<td>BUFG</td>
<td>2</td>
<td>6.25</td>
</tr>
</tbody>
</table>

**Performance.** Table 4 presents the runtime required to evaluate GarbledCPU for one instruction in terms of clock cycles and $\mu$s. Our GC evaluator operates at 100MHz on the FPGA. This is used to compute an average evaluation runtime of 1.1 clock cycles per gate for our pipelined GC evaluator which translates to an average of 11ns per gate in our FPGA implementation. The reported runtime can be further improved by providing tighter timing constraints.

**Comparison with Other Work.** Table 5 shows a comparison with other GC evaluator implementations. However, for fairness, we are leveraging GC optimizations that were not available at the time for [7]. We compare with our two implementations, the 21-stage pipelined evaluator and un-pipelined variant to show the effect of pipelining in improving our performance by a factor of 7.8. Table 5 compares our results with interpolated results estimated for other works. Results indicate that our pipelined GC evaluator FPGA implementation takes $51\times$ fewer clock cycles compared to the fastest software implementation JustGarble [1]. Although the CPU clock frequency (3.0GHz) is $30\times$ faster than that of our Virtex-7 FPGA (100MHz), our pipelined implementation would still be almost 2x faster than JustGarble in terms of absolute time. Note that our implementation is just a prototype on a reconfigurable FPGA as opposed to a custom design of Intel AES-NI in CPU. Implementing GarbledCPU on an ASIC would improve its performance in terms of absolute time even further. Moreover, our implementation is two orders of magnitude faster than the previously fastest hardware implementation of [7].

---

2 Instructions required for Hamming distance are: LW, SW, ADD, SUB, XOR, NOP, SLL, and BREQ.
3 Instructions required for PSI are: LW, SW, ADD, SUB, NOP, SLL, BREQ, BNE, TAL, JR, and SLT.
4 Instructions required for AES are: LW, LB, SW, SB, ADD, SUB, AND, XOR, OR, NOP, SLL, BREQ, BNE, TAL, JR, and SLT.
Table 4: Performance of GarbledCPU for different (ISs) with different memory sizes at 100MHz clock frequency.

<table>
<thead>
<tr>
<th>Memory Size (words)</th>
<th>32</th>
<th>64</th>
<th>128</th>
<th>256</th>
<th>512</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hamming Distance-IS</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># of non-XOR gates</td>
<td>6,715</td>
<td>9,580</td>
<td>16,062</td>
<td>25,483</td>
<td>53,343</td>
</tr>
<tr>
<td>Time per inst. (cc)</td>
<td>7,118</td>
<td>10,813</td>
<td>17,829</td>
<td>30,773</td>
<td>57,644</td>
</tr>
<tr>
<td>Time per inst. (µs)</td>
<td>71,18</td>
<td>108,13</td>
<td>178,29</td>
<td>307,72</td>
<td>576,44</td>
</tr>
<tr>
<td>Avg. Time per gate (cc)</td>
<td>1.06</td>
<td>1.10</td>
<td>1.11</td>
<td>1.08</td>
<td>1.08</td>
</tr>
<tr>
<td>AER-specific IS</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># of non-XOR gates</td>
<td>6,751</td>
<td>9,866</td>
<td>16,079</td>
<td>25,529</td>
<td>53,410</td>
</tr>
<tr>
<td>Time per inst. (cc)</td>
<td>7,428</td>
<td>10,953</td>
<td>18,029</td>
<td>30,814</td>
<td>57,149</td>
</tr>
<tr>
<td>Time per inst. (µs)</td>
<td>74,26</td>
<td>109,52</td>
<td>180,29</td>
<td>308,14</td>
<td>571,49</td>
</tr>
<tr>
<td>Avg. Time per gate (cc)</td>
<td>1.10</td>
<td>1.12</td>
<td>1.12</td>
<td>1.12</td>
<td>1.12</td>
</tr>
<tr>
<td>AER-only IS</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># of non-XOR gates</td>
<td>9,676</td>
<td>12,703</td>
<td>19,094</td>
<td>34,071</td>
<td>66,238</td>
</tr>
<tr>
<td>Time per inst. (cc)</td>
<td>10,644</td>
<td>13,974</td>
<td>22,007</td>
<td>36,319</td>
<td>71,537</td>
</tr>
<tr>
<td>Time per inst. (µs)</td>
<td>106,44</td>
<td>139,74</td>
<td>220,07</td>
<td>363,19</td>
<td>715,37</td>
</tr>
<tr>
<td>Avg. Time per gate (cc)</td>
<td>1.10</td>
<td>1.12</td>
<td>1.12</td>
<td>1.12</td>
<td>1.12</td>
</tr>
</tbody>
</table>

6. CONCLUSION

We introduce GarbledCPU, the first hardware realization of a sequential CPU for secure evaluation of MIPS code. GarbledCPU enables to evaluate either public, semi-private or private function on secret inputs by trading-off between privacy and performance. GarbledCPU is synthesized on a Virtex-7 FPGA as a proof-of-concept implementation, and we evaluated our framework for three benchmarks: Hamming distance, private set intersection, and AES.

Acknowledgements. We would like to thank the anonymous reviewers for their helpful comments. This work is partially supported by an Office of Naval Research grant (ONR-R17460), a National Science Foundation grant (CNS-1059416), and a Multidisciplinary University Research Initiative grant (FA9550-14-1-0351/ Rice 14-0538) to the ACES lab at Rice University. The work of the authors at TU Darmstadt is supported by the European Union’s Seventh Framework Program (FP7/2007-2013) grant agreement n. 609611 (PRACTICE), the German Science Foundation (DFG) as part of project E3 within the CRC 1119 CROSSING, the German Federal Ministry of Education and Research (BMBF) within CRISP, and the Hessian LOEWE excellenc initiative within CASED.

7. REFERENCES


