Uploaded image for project: 'SonarQube'
  1. SonarQube
  2. SONAR-6100

Improve support of files with different encoding other than default encoding

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: Scanner
    • Labels:
      None

      Description

      Roslyn and Microsoft C++ compilers have their own way to detect encoding, even when there is no BOM:
      https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

      Roslyn compiler also detect binary files when two consecutive NUL (U+0000) are found. See:
      https://github.com/dotnet/roslyn/blob/32c80e871735c86ab02262f6704205ab11b43f57/src/Compilers/Core/Portable/Text/SourceText.cs#L325

      Regarding encoding, when there is no BOM, we could use the following algorithm:
      https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/

      Use the following strategy:

      • Check if there is a BOM;
      • See if first 4k bytes are valid in UTF-8 (seems to be safe);
      • See if first 4k bytes are valid and look to be UTF-16 based on heuristics;
      • Use user configuration (sonar.sourceEncoding), if present otherwise use the default platform encoding (with char replacement when possible, giving a warning);

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                duarte.meneses Duarte Meneses
                Reporter:
                julien.henry Julien Henry
              • Votes:
                3 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Due:
                  Created:
                  Updated:
                  Resolved: