Uploaded image for project: 'SonarQube'
  1. SonarQube
  2. SONAR-6100

Improve support of files with different encoding other than default encoding

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: Scanner
    • Labels:
      None

      Description

      Roslyn and Microsoft C++ compilers have their own way to detect encoding, even when there is no BOM:
      https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

      Roslyn compiler also detect binary files when two consecutive NUL (U+0000) are found. See:
      https://github.com/dotnet/roslyn/blob/32c80e871735c86ab02262f6704205ab11b43f57/src/Compilers/Core/Portable/Text/SourceText.cs#L325

      Regarding encoding, when there is no BOM, we could use the following algorithm:
      https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/

      Use the following strategy:

      • Check if there is a BOM;
      • See if first 4k bytes are valid in UTF-8 (seems to be safe);
      • See if first 4k bytes are valid and look to be UTF-16 based on heuristics;
      • Use user configuration (sonar.sourceEncoding), if present otherwise use the default platform encoding (with char replacement when possible, giving a warning);

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              duarte.meneses Duarte Meneses
              Reporter:
              julien.henry Julien Henry
              Votes:
              3 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Due:
                Created:
                Updated:
                Resolved: