﻿Intel(R) QuickAssist Technology Software Readme
===============================================

Intel(R) QuickAssist Technology Software Package Version: QATSWPkgVersion
Intel(R) QuickAssist Technology Driver Version: 2.4.0.7


Contents
========
- License
- Details/Limitations of this Release
- Software Installation
- Intel QuickAssist Technology Compression library - QATzip
    - QATzip - Introduction
    - QATZip - Features
    - QATzip - Hardware Requirements
    - QATzip - Software Requirements
    - QATzip - API Manual
    - QATzip - Additional Information
    - QATzip - Limitations
- QATzip software Application (Parcomp)
- Cryptography performance micro-benchmark Tool (CNGTest)
- Rate Limiting
- Rate Limiting - API Programmer's Guide
- Troubleshooting


License
=======
Refer to license.txt in this package for the Intel software license agreement before using this software.
In addition to Intel software, this package includes the following components:
1) Built in sample code from Microsoft samples, licenced with MS-LPL (license.rtf)
   * License file included in this package and detailed at:
     http://code.msdn.microsoft.com/windowshardware/Windows-8-Driver-Samples-5e1aa62e
2) This software package uses the Intel(R) Storage Acceleration library (isa-l) to perform data
   compression using software.
   * The isa-l library is used under the terms of the license listed at:
     https://github.com/intel/isa-l/blob/master/LICENSE
   * This license file is included in this package as a file named isa-l_LICENSE.txt
3) This software package also uses the LZ4 software library to perform data compression using the
   LZ4 algorithm.
   * The LZ4 software library is used under the terms of the license listed at:
     https://github.com/lz4/lz4/blob/dev/lib/LICENSE
   * This license file is included in this package as a file named lz4_LICENSE.txt


Details/Limitations of this Release
===================================
* This software is only supported on Windows Server 2022.
* 32-bit applications are not supported on Windows Server 2022 when using the
  Intel QuickAssist Technology CNG provider.
* This software package supports virtualization (SR-IOV) using Hyper-V for Intel QuickAssist Technology devices.
  - Virtualization is only supported with Linux VMs running Ubuntu v18.04 or Ubuntu v20.04, over Hyper-V.
  - Virtualization is also supported on Windows VMs (Windows Server 2019 or newer) running over Hyper-V.
* Windows Remote Desktop is not supported if Intel QuickAssist Technology CNG providers are registered
  as default providers for cryptographic algorithms.


Software Installation
=====================
To install the Intel(R) QuickAssist Accelerator software:
  - Navigate to the QuickAssist\Setup sub-folder (within the folder where the package was extracted)
  - Run QatSetup.exe
  - Follow all instructions as displayed by the installation program.
  - For virtualization (SR-IOV) support without QAT Host services, select the option to install as
    a "virtualization host" in the installation program. A system restart is required at the end of
    the installation in order to fully enable virtualization support.
  - When the driver is installed, check the Device Manager for four devices under 'Security Accelerator'.
    - Ensure that the devices are in 'Enabled' state and 'Hardware Ids' in the ‘Details’ tab
      shows 4940 or 4942
  - In a Windows Virtual Machine (VM), after the driver is installed, check the Device Manager for
    accelerator devices under 'Security Accelerator'.
    - Ensure that the devices are in 'Enabled' state and 'Hardware Ids' in the ‘Details’ tab
      shows 4941 or 4943

To uninstall the Intel(R) QuickAssist Accelerator software:
  - Open "Programs and Features" from the Control Panel application
  - Click on the installed application "Intel(R) QuickAssist Technology 2.4.0.0010"
  - Choose Uninstall
  - Reboot


Release Notes
=============
For the latest information about this release, download the "Release Notes" from the same location
where you downloaded this software package.


Getting Started Guide
=====================
For general information on how to use this software package, download the "Getting Started Guide"
from the same location where you downloaded this software package.


Intel(R) QuickAssist Technology Compression library - QATzip
============================================================
* Other names and brands may be claimed as the property of others

The Intel(R) QuickAssist Technology Compression library called QATzip and its associated header file
can be found in the "<Program Files>\Intel\Intel(R) QuickAssist Technology\Compression\Library" folder.

The components of the library are:
1) qatzip.h - Header file describing the QATzip API.
2) libqatzip.lib - Static library containing the implementation of the QATzip API
3) qatzip.lib - Import library to interface with qatzip.dll
4) qatzip.dll - DLL containing the implementation of the QATzip API - installed into the <system32> folder

The library and header file can be compiled and linked into any software component that requires
Intel(R) QuickAssist Technology compression and/or decompression services. 

More information on the QATzip API and other details about the library are available upon request.


QATzip - Introduction
=====================
QATzip is a user space library which builds on top of the Intel(R) QuickAssist
Technology user space library, to provide extended accelerated compression and
decompression services by offloading the actual compression and decompression
request(s) to the Intel(R) QuickAssist Accelerator. QATzip produces data using
the standard Gzip* format (RFC 1952) with extended headers encapsulated with an
additional 4 bytes to accelerate data decompression. QATzip is designed to take
full advantage of the performance provided by Intel(R) QuickAssist Technology.

The currently supported formats include:

 * Formats based on algorithms:

| Data Format           | Parcomp Provider   | Description
| :---------------:     |  :---------------: | :------------------------------------------------------------: |
| `QZ_DEFLATE_4B`       | qat                |Data is in DEFLATE* with a 4 byte header|
| `QZ_DEFLATE_GZIP`     | qatgzip            |Data is in DEFLATE* wrapped by Gzip* header and footer|
| `QZ_DEFLATE_GZIP_EXT` | qatgzipext         |Data is in DEFLATE* wrapped by Intel(R) QAT Gzip* extension header and footer|
| `QZ_DEFLATE_RAW`      | N/A                |Data is in raw DEFLATE* without any additional header. (Not supported since release 1.4.)|

 * Available compression algorithms:

| Compression Algorithm | Parcomp Provider          | Description
| :---------------:     | :-----------------------: | :------------------------------------------------------------: |
| `QZ_DEFLATE`          | qat, qatgzip, qatgzipext  |Data is in DEFLATE*|
| `QZ_MSZIP_COMPATIBLE` | qatms                     |MSZIP* format wrapped with a 4 byte header. Deprecated in release 1.4|
| `QZ_ZLIB_COMPATIBLE`  | qatzlib                   |zlib* format wrapped with a 4 byte header|
| `QZ_SW_XPRESS`        | xpress                    |Software Compression using Xpress* algorithm wrapped with a 4 byte header|
| `QZ_SW_IGZIP`         | igzip                     |Software Compression using DEFLATE* algorithm wrapped with a 4 byte header|
| `QZ_LZ4`              | qatlz4                    |Compression using LZ4* algorithm|

QATZip - Features
=================
* Acceleration of compression and decompression utilizing Intel(R) QuickAssist
  Technology, including a utility to compress and decompress files.
* Instance over-subscription, allowing a number of threads in the same process
  to seamlessly share a smaller number of hardware instances.
* Optional software fallback for both compression and decompression services.
  QATzip Microsoft(R) Windows(TM) may switch to software if there is
  insufficient system resources including acceleration instances or memory.
  This feature allows for a common software stack between server platforms that
  have acceleration devices and non-accelerated platforms.
* Intel(R) QATzip 4 byte header: This header is composed of an
  unsigned integer [4 bytes] indicating the length of the compressed block
  followed by the standard header for the data format used.
* Introduction of QATzip Gzip* format. This consists of 10 bytes as the
  standard Gzip* data format, which is structured as follows:

  `| ID1(0x1F) 1B | ID2(0x8B) 1B | Compression Method (8 = DEFLATE*) 1B |
  Flags 1B | Modification Time 4B | Extra Flags 1B | Operating
  System 1B |`

* Introduction of QATzip Gzip* extended format. This consists of the standard
  10 byte Gzip* header and follows RFC 1952 to extend the header by an
  additional 14 bytes. Below is an outline of the extended headers structure:

  `| Length of ext. header 2B | ID1('Q') 1B | ID2('Z') 1B | Length of
  subheader 2B | Intel(R) defined field 'Chunksize' 4B | Intel(R) defined
  field 'Blocksize' 4B |`

  Chunksize and Blocksize are unsigned integers, which stores the original
  size of the data and the size of the compressed data block respectively.

* Introduction of Dynamically Linked Library for QATzip Microsoft(R)
  Windows(TM).
  
* Introduction of QATzip LZ4* format. This format compresses using multiple blocks,
  which are organized into a frame that is structured as follows:

  `| MagicNb(0x184D2204) 4B | FLG(0x60) 1B | BD(0x60) 1B | HC(0x51) 1B | Block 1 size 4B | Block 1 | ... | Block N size 4B | Block N | EndMark(0x00000000) 4B |`

* Introduction of Programmable Cyclic Redundancy Check (CRC). This will allow
  for custom CRC64 configurations to be set for a provided session with a user
  defined set of parameters as follows.

  | Parameter         | Description               |
  | :---------------: | :-----------------------: |
  | `polynomial`      | Polynomial used for CRC64 calculation. Default 0x42F0E1EBA9EA3693 |
  | `initial_value`   | Defaults to 0x0000000000000000 |
  | `reflect_in`      | Reflect bit order before CRC calculation. Default 0 |
  | `reflect_out`     | Reflect bit order after CRC calculation. Default 0 |
  | `xor_out`         | Defaults to 0x0000000000000000 |

  A custom programmable CRC64 configuration can only be set on a session after setup.
  A state machine tracks the state of the session to only allow programmable CRC64 configurations
  to be set in the `setup` state. The states of a session are defined as follows.

                | Updating |
                   ^   |
                   |   ⌄
  | Created | -> | Setup | -> | Active | -> | Closing |

  In order to propagate a new CRC64 configuration the session must be restarted.
  Requests received while multithreading in the 'updating' state will be rejected.

  The CRC64 value is calculated for the src buffer in compression and dst buffer
  in decompression. The completed compression or decompression blocks are placed
  in the output buffer and the CRC64 checksum will be in the user provided buffer *crc.


QATzip - Hardware Requirements
==============================
This QATzip library supports compression and decompression offload to the
following acceleration devices:

* Intel(R) 4XXX Accelerator


QATzip - API Manual
===================
Please refer to file `QATzip-man.pdf` found at this link
https://github.com/intel/QATzip/blob/master/docs/QATzip-man.pdf


QATzip - Limitations
====================
* When passing data for compression into the library the complete payload for
  compression should be passed in rather than sub divided due to the "last" bit
  being set on the final compressed block.
* This software is only supported on the Microsoft(R) Windows(TM) Server 2022.
* Largest compressible file size limitation of 999MB.
* Since release 1.4, "RAW" DEFLATE* (QZ_DEFLATE_RAW) is not a supported data
  format.
* Software fallback in QATzip is not applicable for the following formats:
    * `QZ_ZLIB_COMPATIBLE`
    * `QZ_MSZIP_COMPATIBLE`
    * `QZ_SW_XPRESS`
    * `QZ_SW_IGZIP`
* Gzip* decompression is currently only supported using software offload as
  Gzip* does not contain a blocksize value.


QATzip - Known Issues
=====================
* When decompressing a file it is important to match the chunk size/hw_buff_sz
  to the value that was specified to compress the data. If no value was
  specified 64KB is the default.

  This value is used during decompression to provision appropriate space
  between the inflated blocks to minimise the number of buffer copies during
  parallel driver decompression. Using an incorrect value will result in
  inflated blocks which overlap or blocks spaced too far apart.

  This issue applies to all formats with the exception of QZ_DEFLATE_GZIP
  compressed format. The primary mitigation for this issue is to record
  the chunk size/hw_buff_sz used during compression.


QATzip software Application (Parcomp)
=====================================
This package comes with a tool called 'parcomp' to test the performance of the Intel(R) QuickAssist
Technology compression accelerator.
Parcomp has been built using the QATzip API and library, and can be found in the
"<Program Files>\Intel\Intel(R) QuickAssist Technology\Compression" folder. It can be used to measure
and report the rate at which compression and decompression operations are performed using the
accelerator as well as those operations performed by the default services offered by the system OS.

Note:-
When the Parcomp application is running in Windows Virtual Machines (VMs), the memory (RAM) allocated to
the VM should be of sufficient size. For example, using Parcomp to compress a file of 999MB with 2 threads,
the memory allocated to the VM should be at least 4GB or more.

To run Parcomp:
--------------
1) Launch Command Prompt (cmd.exe) as Administrator.

2) Navigate to the following sub-folder where the software package was installed:
   "<Program Files>\Intel\Intel(R) QuickAssist Technology\Compression"

3) Run Parcomp (eg:)
   For compression:    parcomp -p qat –i <input_file> -o <output_file>
   For de-compression: parcomp -p qat -d –i <input_file> -o <output_file>

4) To see a list of supported command-line options, simply run the application without any
   command-line options (or use the -h option)

Command-line options:
--------------------
Usage: parcomp.exe -i <srcFilename> -o <dstFilename> [options]

Required options:

    -i srcFilename        Input (source) filename
    -o dstFilename        Output (destination) filename

Optional options can be:

    -b [cold|warm]        Use cold buffer or not.
    -p providerName       Specifies the provider (implementation).
                          Options include:
                           'qat' - QuickAssist accelerated DEFLATE* algorithm
                           'qatzlib' - QuickAssist accelerated DEFLATE*
                                       algorithm with zlib* header.
                           'qatms' - Deprecated QuickAssist accelerated
                                     DEFLATE* algorithm with MSZIP*
                                     header. Only decompression supported.
                           'qatgzip' - QuickAssist accelerated DEFLATE*
                                       algorithm with Gzip* header.
                                       Note: With qatgzip provider, options
                                       -k -t -Q cannot be used in
                                       decompression direction.
                                       The operation will be executed in a
                                       single thread with SW implementation.
                                       Parcomp application cannot process
                                       Gzip* source buffers bigger than 1Gb
                                       and the number of iterations is 1.
                           'qatgzipext' - QuickAssist accelerated DEFLATE*
                                          algorithm with Gzip* extended header.
                           'xpress' - Software based Xpress* algorithm.
                           'igzip' - Software based DEFLATE* algorithm using
                                     igzip dll.
                           'qatlz4' - QuickAssist accelerated algorithm
                                      with lz4* header.
    -c chunkSizeInKB      Chunk size, in KB.  Default is 64.
                          Files will be divided into chunks of this size
                          (the last chunk may be smaller), and each chunk
                          is compressed separately. State is not maintained
                          between chunks. The chunk size used for decompression
                          must match the value used for compression.
    -crc bitWidth         CRC bitwidth. Valid value is 64.
    -pcrc index           CRC64 configuration to use. Valid values are 0 (ECMA-182)
                          or 1 (Rocksoft). Default is 0.
    -l compressionLevel   Compression level.  Default is 1.
                          compressionLevel can range from 1 - 4. Lower values
                          imply less compressibility in less time.
    -d                    Decompress the input file. Default is to compress.
    -v (or -g)            Verbose (or debug)
    -x numLines           Print a summary of the inputs and outputs in a
                          comma-separated variable (CSV) format for easy
                          importing into a spreadsheet. Specify numLines as 1
                          for data only, or 2 to also include a header summary
    -t numThreads         Creates specified number of threads, splits input
                          file into numThreads (near-)equal chunks, and
                          performs the operation in each thread. If specifying
                          multiple threads, -Q is also required.
    -f cpuFreqInMHz       Specifies the CPU frequency in MHz. If not specified,
                          this will be measured (takes approx. 1 second)
    -n numIterations      Specifies the number of iterations (allows you to run
                          the same operation numIterations times) Default is 1.
    -Q                    Test independent threads writing one process using
                          one session. Uses multiple copies of same input file,
                          outputs one output file per thread.
                          Must be used with -t and threadcount of 1 or more.
    -k blockSizeInKB      Separate the source data into several blocks of size
                          specified by blockSizeInKB. -k uses -Q by default.
    -h                    Print this help message.

The following are applicable for the providers qat (all options) and qatlz4 (-FB, -SW and -FT) only:

    -j maxOutstandingJobs Maximum number of outstanding jobs (requests) that
                          may be outstanding at any one time.  Default is 30.
    -s                    Static compression. Default is dynamic.
    -D                    Dynamic compression. This is default.
    -FB                   Enable igzip and LZ4 software fallback.
    -FT thresSizeInKB     Threshold value for fallback. If the offload size
                          is less than the threshold, software provider is used.


Sample Results
--------------
Here are a few examples of the results obtained by running Parcomp for compression and decompression:
Note: These examples are for illustrative purposes only.

1) Using Intel(R) QuickAssist Technology accelerator for compression and decompression:

   parcomp -p qat -i largetext -o largetext.compressed
   ---------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext (1000000000 Bytes)
   Writing output file: C:\CompressionFiles\largetext.compressed (489318805 Bytes)

   Deflation Ratio (%age)     : 48.9
   Thruput (uncompressed Mbps): 12312.576
   Time (ms)                  : 571.054

   Note:- All times exclude file I/O and are measured around the call to the qzCompress() API only.

   parcomp -p qat -d -i largetext.compressed -o largetext.original
   ---------------------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext.compressed (489318806 Bytes)
   Writing output file: C:\CompressionFiles\largetext.original (1000000000 Bytes)

   Inflation Ratio (%age)     : 204.4
   Thruput (uncompressed Mbps): 23407.884
   Time (ms)                  : 280.976

   Note:- All times exclude file I/O and are measured around the call to the qzDecompress() API only.


2) Using Gzip* extended header with Intel(R) QuickAssist Technology accelerator for compression and decompression:

   parcomp -p qatgzipext -i largetext -o largetext.compressed
   ----------------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2020, Intel(R) Corporation

   Reading input file: largetext.txt (1000000000 Bytes)
   Writing output file: largetext.compressed (396497299 Bytes)

   Deflation Ratio (%age)     : 39.6
   Thruput (uncompressed Mbps): 14677.658
   Time (ms)                  : 356.475

   Note:- All times exclude file I/O and are measured around the call to the qzCompress() API only.

   parcomp -p qatgzipext -d -i largetext.compressed -o largetext.original
   ----------------------------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2020, Intel(R) Corporation

   Reading input file: largetext.compressed (396497299 Bytes)
   Writing output file: largetext.original (1000000000 Bytes)

   Inflation Ratio (%age)     : 252.2
   Thruput (uncompressed Mbps): 24304.489
   Time (ms)                  : 188.323

   Note:- All times exclude file I/O and are measured around the call to the qzDecompress() API only.

3) Using Windows(TM) API (Xpress*) for compression and decompression:

   parcomp -p xpress -i largetext -o largetext.compressed
   ------------------------------------------------------
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext (1000000000 Bytes)
   Writing output file: C:\CompressionFiles\largetext.compressed (573799425 Bytes)

   Deflation Ratio (%age)     : 57.4
   Thruput (uncompressed Mbps): 1070.143
   Time (ms)                  : 7283.892

   Note:- All times exclude file I/O and are measured around the call to the qzCompress() API only.

   parcomp -p xpress -d -i largetext.compressed -o largetext.original
   -----------------------------------------------------------------
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext.compressed (573799425 Bytes)
   Writing output file: C:\CompressionFiles\largetext.original (1000000000 Bytes)

   Inflation Ratio (%age)     : 174.3
   Thruput (uncompressed Mbps): 3129.774
   Time (ms)                  : 2408.823

   Note:- All times exclude file I/O and are measured around the call to the qzDecompress() API only.


4) For performance test of QAT compression and decompression:

   **NOTE**: When using -Q as a parameter in the compression command, this will produce a number of
   identical output files with an appended numeric feather starting with 0.
   This feather will also be required for the decompression command.
   The use of -k implicitly brings the same enablement of the -Q option.

   parcomp.exe -p qat -Q -t 6 -c 64 -k 4096 -j 60 -n 200 -i largetext -o largetext.compressed
   ------------------------------------------------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext (1000000000 Bytes)
   All threads completed as Expected.

   Deflation Ratio (%age)     : 47.9
   Thruput (uncompressed Mbps): 63967.443
   Processing Block size      : 4096 KB
   Block count                : 239
   Time/block (ms)            : 3.140

   Note:- All times exclude file I/O and are measured around the call to the qzCompress() API only.

   parcomp.exe -p qat -d -Q -t 6 -c 64 -k 4096 -j 60 -n 200 -i largetext.compressed0 -o largetext.original
   ------------------------------------------------------------------------------------------------------
   Warning: The hw_buff_sz parameter value used for decompression must match the value used for compression.
   Default hw_buff_sz value: 65536 bytes.
   hw_buff_sz value used in current execution: 65536 bytes.
   Parcomp: Tool to test compression & decompression
   (c) 2018, Intel(R) Corporation

   Reading input file: C:\CompressionFiles\largetext.compressed0 (489318782 Bytes)
   All threads completed as Expected.

   Inflation Ratio (%age)     : 204.4
   Thruput (uncompressed Mbps): 131133.883
   Processing Block size      : 4096 KB
   Block count                : 477
   Time/block (ms)            : 1.535

   Note:- All times exclude file I/O and are measured around the call to the qzDecompress() API only.


Cryptography performance micro-benchmark Tool (cngtest)
=======================================================
The software driver and providers come with a micro-benchmark tool (cngtest) to test the
performance of various cryptography algorithms. This tool can be used to measure and report the
rate at which crypto algorithm operations (like encrypt, decrypt, signhash, verifysignature,
finalizekeypair, secretagreement, etc.) are performed using the Windows(TM) Cryptography Next-Generation
(CNG) framework, with either of two providers:
- the default software provider (provided by Microsoft and which is part of the OS)
- the provider based on Intel(R) QuickAssist Technology.

Using the cngtest tool, it is possible to quickly see the substantial CPU savings that can be gained
by offloading public key cryptography – for example, the RSA 2048 decrypt operations used by a web server
during SSL handshakes – from the CPU to the hardware accelerator.

You can use the batch file Perf_User.bat installed in the following location:
<Program Files>\Intel\Intel(R) QuickAssist Technology\Crypto\Samples\bin
to obtain results using cngtest for user mode tests. Kernel-mode tests are no longer supported.
The batch file contains cngtest commands to perform various cryptographic operations using different
algorithms and parameters.

To use the batch file, you will need to open a command-prompt window with Administrator privileges.
1. Navigate to the <Program Files>\Intel\Intel(R) QuickAssist Technology\Crypto\Samples\bin folder
2. For user mode performance of RSA, DSA, ECDSA, DH, ECDH algorithms, run:
   Perf_User.bat

How to run cngtest independently
--------------------------------
The cngtest application (cngtest.exe) is located in the
"<Program Files>\Intel\Intel(R) QuickAssist Technology\Crypto\Samples\bin" folder.
  - Launch the Command Prompt (cmd.exe) window as Administrator.
  - Navigate to the following sub-folder:
    <Program Files>\Intel\Intel(R) QuickAssist Technology\Crypto\Samples\bin

This micro-benchmark tool is a command-line utility which allows you to specify, via command line
parameters, the provider (HW vs. SW), the algorithm, the number of operations to perform, the number
of software threads across which to spread the requests, the CPU cores to which those software threads
should be affinitized, and the key length.

CNGTest Flags
-------------
The micro-benchmark tool includes some brief usage help, which can be seen by running: cngtest -help

cngtest is a "microbenchmark" which measures and reports the rate (measured
        in ops/second) at which encrypt and decrypt operations are performed
        using the CNG framework, with one of two providers: the default
        (software) provider, or a provider based on Intel(R) QuickAssist
        Technology (this requires the presence on the platform of a hardware
        accelerator).

Usage: cngtest [flags]

The flags are as follows:

        -provider={sw|qa}
            Specifies the provider to use. The default value is "sw",
            meaning use the default (software) provider. The only other legal
            value is "qa", which means to use the QuickAssist provider.

Note that the remaining parameters have different defaults depending on the
provider, as indicated below:

        -algo=<algo>
            Specifies the algorithm to test. The values of <algo> supported
            are:

              rsa: This measures the rate at which RSA decrypt operations
              (using CRT), and encrypt operations, are performed.  Additional
              flags that can be provided in this case include:
                -keyLength=<num>
                    Specifies the key size (modulus size) for the
                    operation. The default is 2048.  Other legal values
                    include 512, 1024, 1536, 3072 and 4096. Any other value
                    will result in the default 2048 value being used.

              ecdsa: This measures the performance of ECDSA algorithm.
              Addtional flag that can be provided in this case include.
                -ecccurve=<curvename>
                    Specifies the ECC curve name used for ECDSA and ECDH
                    algorithm, default curve name is nistP256, other legal
                    values can refer to CNG Named Elliptic Curves on MSDN

              ecdh:  This measures the performance of ECDH algorithm
              Addtional flag that can be provided in this case include.
                -ecccurve=<curvename>
                    Specifies the ECC curve name used for ECDSA and ECDH
                    algorithm, default curve name is nistP256, other legal
                    values can refer to CNG Named Elliptic Curves on MSDN

              dsa: This measures the performance of DSA algorithm (DSA
              algorithm is not supported in kernel mode, if kernel mode is
              specified, will route to running user mode test). Additional
              flags that can be provided in this case include:
                -keyLength=<num>
                    Specifies the key size (modulus size) for the
                    operation. The default is 2048.  Other legal values
                    include 1024 and 3072. Any other value will
                    result in the default 2048 value being used.

              dh: This measures the performance of DH algorithm. Additional
              flags that can be provided in this case include:
                -keyLength=<num>
                    Specifies the key size (modulus size) for the
                    operation. The default is 2048.  Other legal values
                    include 768, 1024, 1536, 3072 and 4096. Any other value
                    will result in the default 2048 value being used.

        -padding=<paddingmode>
            RSA algorithm only, ignored for other algorithms.
            pkcs1: PKCS1 padding mode.
            oaep : OAEP padding mode.
            pss  : PSS padding mode.

        -numThreads=<num>
            Specifies the number of software threads to spawn. The default
            value is 2 (sw) or 150 (qa). Note that the number of outstanding
            requests required to "max out" the hardware is approximately 150.
            Note too that specifying numThreads=n is equivalent to specifying
            -minThreads=n and -maxThreads=n.

        -numIter=<numIterations>
            Specifies the number of iterations to perform. The default value
            is 10000 (sw) or 100000 (qa).

        -affinityMask=<mask>
            Specifies the CPU/core affinity mask for the threads. The
            affinity mask is interpreted as a bitmask, with each bit
            indicating a CPU core to which a thread should be affinitized,
            where core number c is represented as 2^c. Note that the mask
            is specified as a hexadecimal number, and must begin with the
            the prefix "0x".
            For example, to run the software threads on cores 2 and 3,
            specify -affinityMask=0x0C (binary 00001100). Software threads
            are assigned to the cores in a round-robin fashion, with the
            first software thread being assigned to the lowest numbered core,
            etc.

        -minThreads=<num>
            Specifies the minimum number of software threads. The benchmark
            will be performed using each number of software threads from
            minThreads to maxThreads. In this case, the value of -numThreads
            is ignored.

        -maxThreads=<num>
            Specifies the maximum number of software threads. The benchmark
            will be performed using each number of software threads from
            minThreads to maxThreads. In this case, the value of -numThreads
            is ignored. If minThreads is larger than maxThreads, minThreads
            and maxThreads are set as default value 1. The maxThreads limit
            is 150

        -check
            Specifies that the software and hardware providers should both be
            executed exactly once each, and the results compared. This is a
            purely functional check. All other parameters are ignored in this
            case.

        -encrypt
            Measure performance of only the encryption operation for the
            specified algorigthm test. By default performance is measured
            over both encryption and decryption operations.

        -decrypt
            Measure performance of only the decryption operation for the
            specified algorigthm test.

        -derivekey
            Measure performance of only the derive key operation for the
            specified algorigthm (ECDH or DH) test. This option is not
            supported when using software provider in kernel mode.

        -secretderive
            Measure performance of the secretagreement + derive key operation
            for the specified algorigthm (ECDH or DH) test. This option is not
            supported when using software provider in kernel mode.
        -finalizesecret
            Measure performance of the key generate key + secretagreement
            operation for the specified algorigthm (ECDH or DH) test.

        -finalizekey
            Measure performance of only the finalize key operation for the
            specified algorigthm (ECDH or DH) test.

        -secretagreement
            Measure performance of only the secretagreement operation for the
            specified algorigthm (ECDH or DH) test.

        -sign
            Measure performance of only the sign hash operation for the
            specified algorigthm (ECDSA or DSA) test.

        -verify
            Measure performance of only the verify signature operation for the
            specified algorigthm (ECDSA or DSA) test.

        -generatekey
            Measure performance of generate key. This parameter is ignored if
            algo=DSA is specified.

        -debug
            Print debug messages.

Cngtest Sample Results
----------------------
Here is an example output of cngtest test results for asymmetric cryptography operations using
the Intel(R) QuickAssist Accelerator Software script (Perf_user.bat) to test user-mode performance:

This script must be run in a Command-prompt/Shell window with Administrator privileges

(1) Test 100000 RSA decrypt operations using a key size of 2048:
Running in user mode...
    Time [ms]                         : 1029
    Number of iterations              : 100000
    RSA Decrypt Ops/s                 : 97181.73
    CPU core utilization percentage   : 9%
    CPU overall utilizaion percentage : 4%

(2) Test 100000 DSA signhash/verifysignature operations using a key size of 2048:
Running in user mode...
    Time [ms]                         : 563
    Number of iterations              : 100000
    DSA Sign Hash Ops/s               : 177619.89
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 12%
    Time [ms]                         : 1044
    Number of iterations              : 100000
    DSA Verify Signature Ops/s        : 95785.44
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 3%

(3) Test 100000 ECDSA signhash/verifysignature operations using the P-256 curve:
Running in user mode...
    Time [ms]                         : 673
    Number of iterations              : 100000
    ECDSA Sign Hash Ops/s             : 148588.41
    CPU core utilization percentage   : 1%
    CPU overall utilizaion percentage : 8%
    Time [ms]                         : 1371
    Number of iterations              : 100000
    ECDSA Verify Signature Ops/s      : 72939.46
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 2%

(4) Test 100000 DH finalizekeypair/secretagreement operations using a key size of 2048:
Running in user mode...
    Time [ms]                         : 2950
    Number of iterations              : 100000
    DH Stage 1 Ops/s                  : 33898.31
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 6%
    Time [ms]                         : 3708
    Number of iterations              : 100000
    DH Stage 2 Ops/s                  : 26968.72
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 1%

(5) Test 100000 ECDH finalizekeypair/secretagreement operations using the P-256 curve:
Running in user mode...
    Time [ms]                         : 1724
    Number of iterations              : 100000
    ECDH Stage 1 Ops/s                : 58004.64
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 11%
    Time [ms]                         : 914
    Number of iterations              : 100000
    ECDH Stage 2 Ops/s                : 109409.19
    CPU core utilization percentage   : 0%
    CPU overall utilizaion percentage : 11%

Rate Limiting
==============
Rate limiting is a solution designed in the Intel(R) QuickAssist Technology accelerator
software to enforce Service Level Agreements (SLA). SLAs can be used to allocate a specified
amount of acceleration capacity for a specified service, including symmetric cryptography (SYM),
PKE (ASYM) and compression (DC), at a ring-pair or queue-pair (QP) granularity.

The rate limiting solution provides the following features:
 1. Virtualization technology agnostic SLA management API
 2. Ability to query rate limiting information for QAT instances in the Guest/Host
 3. Ability to query rate limiting information for QAT devices in the Host
 4. Ability to configure rate limiting SLA for service Instances based on QP

Rate Limiting - API Programmer's Guide
======================================
To configure rate limits on your Intel® QuickAssist Technology device,
please refer to the Rate Limiting API Guide on Intel’s QuickAssist Technology website.
