Skip to content

Releases: intel/neural-compressor

Intel Neural Compressor Release 3.4

23 May 12:09
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Bug Fixes
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned Gaudi SW Release 1.21 with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • INT4 quantization enhancements for Intel CPU/GPU

Features

  • Support expert parallelism for Mixtral model on Gaudi
  • Enhance multi-cards FP8 model save and load on Gaudi
  • Enable static FP8 quantization of DeepSeek V3/R1 model on Gaudi
  • Support W4A8 mixed precision on Gaudi (experimental)
  • Improve compile time on Gaudi when using FP8 (experimental)

Improvements

  • Remove numpy version limit for 3.x PyTorch package

Bug Fixes

  • Fix graph compile error when quantizing Llama3.2 11B/90B vision model on Gaudi
  • Fix segmentation fault issue in LLama2-70B INT4 model on Intel GPU
  • Fix accuracy issue caused by duplicated g_idx update for INT4 model on Intel GPU

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)
  • Intel® Arc™ B-Series Graphics GPU (B580)

Validated Configurations

  • Centos 8.4 & Ubuntu 24.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.4, 2.5, 2.6

Intel Neural Compressor Release 3.3

04 Mar 08:55
679def0
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Bug Fixes
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned Gaudi SW Release 1.20 with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • VLM INT4 weight-only quantization support in transformers-like API on Intel CPU/GPU

Features

  • Saving vLLM compatible FP8 model on Gaudi
  • FP8 Per-channel Q/DQ and GC integration on Gaudi
  • FP8 quantization for mixture of experts (MoE) module on Gaudi
  • Saving Hugging Face compatible weight-only INT4 format on Gaudi
  • VLM quantization with AutoRound in transformers-like API on Intel CPU/GPU
  • Accuracy-aware tuning on PT2E including mixed precision support

Improvements

  • FP8 multi-device (Gaudi & GPU) infrastructure support
  • Support scaler scale save on Gaudi

Bug Fixes

  • Fix incorrect hf_device_map setting for Transformers-like API
  • Fix missing IPEX CPU dependency in Transformers-like API example
  • Fix device mapping issue found in GPTQ on Llama model
  • Fix saving issue in weight-only per-channel quantization

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)
  • Intel® Arc™ B-Series Graphics GPU (B580)

Validated Configurations

  • Centos 8.4 & Ubuntu 24.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.3, 2.4, 2.5

Intel Neural Compressor Release 3.2

28 Dec 13:17
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Bug Fixes
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned with Habana 1.19 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • INT4 weight-only quantization on Intel® Arc™ B-Series Graphics GPU (code-named BattleMage)

Features

  • Saving and loading FP8 checkpoint on Gaudi
  • Loading vLLM/llm-compressor compatible FP8 checkpoint on Gaudi
  • Arbitrary scale method support on Gaudi
  • AutoRound INT4 weight-only quantization on Gaudi
  • Block-wise calibration for LLM on Gaudi
  • INT4 weight-only quantization on BattleMage

Improvements

  • Improve FP8 performance by setting scale as scalar tensor on Gaudi
  • Integrate AutoRound 0.4.2 with VLM quantization improvements
  • Improve safetensors loading for layer-wise quantization in Transformers-like API
  • Improve non-contiguous weight saving in Transformers-like API

Bug Fixes

  • Fix layer-wise quantization issue in GPTQ on client GPU
  • Fix glm-4-9b model out-of-memory issue on BattleMage

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)
  • Intel Arc B-Series Graphics GPU (B580)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.3, 2.4, 2.5

Intel Neural Compressor Release 3.1

25 Oct 08:18
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Validated Hardware
  • Validated Configurations

Highlights

  • Aligned with Habana 1.18 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
  • Provided Transformer-like quantization API for weight-only quantization on LLM, which offers transformer-based user one-stop experience for quantization & inference with IPEX on Intel GPU and CPU.

Features

  • Add Transformer-like quantization API for weight-only quantization on LLM
  • Support fast quantization with light weight recipe and layer-wise approach on Intel AI PC
  • Support INT4 quantization of Visual Language Model (VLM), like Llava, Phi-3-vision, Qwen-VL with AutoRound algorithm

Improvements

  • Support AWQ format INT4 model loading and converting for IPEX inference in Transformer-like API
  • Enable auto-round format export for INT4 model
  • Support per-channel INT8 Post Training Quantization for PT2E

Validated Hardware

  • Intel Gaudi Al Accelerators (Gaudi 2 and 3)
  • Intel Xeon Scalable processor (4th, 5th, 6th Gen)
  • Intel Core Ultra Processors (Series 1 and 2)
  • Intel Data Center GPU Max Series (1100)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11
  • Python 3.9, 3.10, 3.11, 3.12
  • PyTorch/IPEX 2.2, 2.3, 2.4

Intel® Neural Compressor v3.0 Release

12 Aug 04:09
7056720
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Examples
  • Bug Fixes
  • Documentations
  • Validated Configurations

Highlights

  • FP8 quantization and INT4 model loading support on Intel® Gaudi® AI accelerator
  • Framework extension API for quantization, mixed-precision and benchmarking
  • Accuracy-aware FP16 mixed precision support on Intel® Xeon® 6 Processors
  • Performance optimizations and usability improvements on client-side quantization

Features

Improvements

  • [Quantization] Integrate AutoRound v0.3 (bfa27e, [fd9...
Read more

Intel® Neural Compressor v2.6 Release

14 Jun 13:55
2928d85
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvements
  • Examples
  • Bug Fixes
  • External Contributions
  • Validated Configurations

Highlights

  • Integrated recent AutoRound with lm-head quantization support and calibration process optimizations
  • Migrated ONNX model quantization capability into ONNX project Neural Compressor

Features

  • [Quantization] Integrate recent AutoRound with lm-head quantization support and calibration process optimizations (4728fd)
  • [Quantization] Support true sequential options in GPTQ (92c942)

Improvements

  • [Quantization] Improve WOQ Linear pack/unpack speed with numpy implementation (daa143)
  • [Quantization] Auto detect available device when exporting (7be355)
  • [Quantization] Refine AutoRound export to support Intel GPU (409231)
  • [Benchmarking] Detect the number of sockets when needed (e54b93)

Examples

  • Upgrade lm_eval to 0.4.2 in PT and ORT LLM example (fdb509) (54f039)
  • Add diffusers/dreambooth example with IPEX (ba4798)

Bug Fixes

  • Fix incorrect dtype of unpacked tensor issue in PT (29fdec)
  • Fix TF LLM SQ legacy Keras environment variable issue (276449)
  • Fix TF estimator issue by adding version check on TF2.16 (855b98)
  • Fix missing tokenizer issue in run_clm_no_trainer.py after using lm-eval 0.4.2 (d64029)
  • Fix AWQ padding issue in ORT (903da4)
  • Fix recover function issue in ORT (ee24db)
  • Update model ckpt download url in prepare_model.py (0ba573)
  • Fix case where pad_max_length set to None (960bd2)
  • Fix a failure for GPU backend (71a9f3)
  • Fix numpy versions for rnnt and 3d-unet examples (12b8f4)
  • Fix CVEs (5b5579) (25c71a) (47d73b) (41da74)

External Contributions

  • Update model ckpt download url in prepare_model.py (0ba573)
  • Fix case where pad_max_length set to None (960bd2)
  • Add diffusers/dreambooth example with IPEX (ba4798)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • PyTorch/IPEX 2.1, 2.2, 2.3
  • TensorFlow 2.14, 2.15, 2.16
  • ITEX 2.13.0, 2.14.0, 2.15.0
  • ONNX Runtime 1.16, 1.17, 1.18

Intel® Neural Compressor v2.5.1 Release

03 Apr 14:03
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Validated Configurations

Improvement

  • Improve WOQ AutoRound export (409231, 7ee721)
  • Adapt ITREX v1.4 release for example evaluate (9d7a05)
  • Update more supported LLM recipes (ce9b16)

Bug Fixes

  • Fix WOQ RTN supported layer checking condition (079177)
  • Fix in-place processing error in quant_weight function (92533a)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.15
  • ITEX 2.14.0
  • PyTorch/IPEX 2.2
  • ONNX Runtime 1.17

Intel® Neural Compressor v2.5 Release

26 Mar 10:21
24419c9
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • External Contributes
  • Validated Configurations

Highlights

  • Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
  • Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes

Features

  • [Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
  • [Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
  • [Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
  • [Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)

Improvement

  • [Quantization] SmoothQuant code structure refactor (a8d81c)
  • [Quantization] Optimize the workflow of parsing Keras model (b816d7)
  • [Quantization] Support static_groups options in GPTQ API (1c426a)
  • [Quantization] Update TEQ train dataloader (d1e994)
  • [Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
  • [Quantization] Add version condition for IPEX prepare init (d96e14)
  • [Quantization] Enhance the ORT node name checking (f1597a)
  • [Pruning] Stop the tuning process early when enabling smooth quant (844a03)

Productivity

  • ORT LLM examples support latest optimum version (26b260)
  • Add coding style docs and recommended VS Code setting (c1f23c)
  • Adapt transformers 4.37 loading (6133f4)
  • Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
  • Support CI summary in PR comments (d4bcdd))
  • Notebook example update to install latest INC & TF, add metric in fit (4239d3)

Bug Fixes

  • Fix QA IPEX example fp32 input issue (c4de19)
  • Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
  • Fix TF saved_model issues (d8e60b)
  • Fix comparison of module_type and MulLinear (ba3aba)
  • Fix ORT calibration issue (cd6d24)
  • Fix ORT example bart export failure (b0dc0d)
  • Fix TF example accuracy diff during benchmark and quantization (5943ea)
  • Fix bugs for GPTQ exporting with static_groups (b4e37b)
  • Fix ORT quant issue caused by tensors having same name (0a20f3)
  • Fix Neural Solution SQL/CMD injection (14b7b0)
  • Fix the best qmodel recovery issue (f2d9b7)
  • Fix logger issue (83bc77)
  • Store token in protected file (c6f9cc)
  • Define the default SSL context (b08725)
  • Fix IPEX stats bug (5af383)
  • Fix ORT calibration for Dml EP (c58aea)
  • Fix wrong socket number retrieval for non-english system (5b2a88)
  • Fix trust remote for llm examples (2f2c9a)

External Contributes

  • Intel Mac support (21cfeb)
  • Add PTQ example for PyTorch CV Segment Anything Model (bd5e69)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 2.13.0, 2.14.0
  • PyTorch/IPEX 2.0, 2.1, 2.2
  • ONNX Runtime 1.15, 1.16, 1.17

Intel® Neural Compressor v2.4.1 Release

29 Dec 13:12
b8c7f1a
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Examples
  • Validated Configurations

Improvement

  • Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
  • Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
  • Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)

Bug Fixes

  • Fix SmoothQuant alpha-space generation (33ece9)
  • Fix inputs error for SmoothQuant example_inputs (39f63a)
  • Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
  • Fix quantizable add ops detection on IPEX backend (4c004d)
  • Fix range step bug in ORTSmoothQuant (40275c)
  • Fix unit test bugs and update CI versions (6c78df, 835805)
  • Fix notebook issues (08221e)

Examples

  • Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
  • Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.14
  • ITEX 2.14.0.1
  • PyTorch/IPEX 2.1.0
  • ONNX Runtime 1.16.3

Intel® Neural Compressor v2.4 Release

17 Dec 03:26
111b3ce
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • Validated Configurations

Highlights

  • Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
  • Supported Weight-Only Quantization tuning for ONNX Runtime backend.
  • Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
  • Supported SmoothQuant of Big Saved Model for TensorFlow Backend.

Features

  • [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
  • [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
  • [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
  • [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
  • [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
  • [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
  • [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
  • [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
  • [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
  • [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
  • [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)

Improvement

  • [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
  • [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
  • [Quantization] Support restore ipex model from json (c3214c)
  • [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
  • [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
  • [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
  • [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
  • [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
  • [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
  • [Quantization] Support trace with dictionary type example_inputs (afe315)
  • [Quantization] Support falcon Weight-Only Quantization (595d3a)
  • [Common] Add deprecation decorator in experimental fold (aeb3ed)
  • [Common] Remove 1.x API dependency (ee617a)
  • [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)

Productivity

  • Support quantization and benchmark on macOS (16d6a0)
  • Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
  • Support TensorFlow new API for gnr-base (8160c7)

Bug Fixes

  • Fix GraphModule object has no attribute bias (7f53d1)
  • Fix ONNX model export issue (af0aea, eaa57f)
  • Add clip for ONNX Runtime SmoothQuant (cbb69b)
  • Fix SmoothQuant minmax observer init (b1db1c)
  • Fix SmoothQuant issue in get/set_module (dffcfe)
  • Align sparsity with block-wise masks in progressive pruning (fcdc29)

Examples

  • Support peft model with SmoothQuant (5e21b7)
  • Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 1.2.0, 2.13.0.0, 2.14.0.1
  • PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
  • ONNX Runtime 1.14.1, 1.15.1, 1.16.3
  • MXNet 1.9.1
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy