menu scancode-results-analyzer documentation

JSON Output Format

scancode-results-analyzer is meant to be used as a post-scan Plugin for Scancode, where after running a scan, the scan results are then analyzed for scan errors, and that information is added to the scancode JSON results.

Command Line Argument to use scancode-results-analyzer: --analyze-license-results

Here’s how example result-JSONs from scancode-results-analyzer could look like, post-analysis.

Structure and Location in Scancode JSON

The scancode JSON format has a list of files/directories (resources), and for each of those resources there is a list of license matches as dictionaries, with each license match in a dictionary, with relevant attributes and corresponding values.

Now similarly the license detection analysis results will also be a resource-level list, for each resource in the codebase this list of dictionary will be added, where each dictionary is for each corresponding file-region File Region, having the results of the analysis for all the match(es) in that file-region.

Note

[WIP] There would also be a codebase-level dictionary added, 1. With statistics on the license_detection issues. 2. All the unique license detection issues and their occurrences. 3. Header information.

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.1.1.post351.850399bc3",
      "options": {
        "input": [
          "/path_to/downloaded_licenses/"
        ],
        "--analyze-license-results": true
      }
    }
  ],
  "files": [
      {
          "path": "path/to/file",
          "licenses": [
              {
                  "path": "file1.cpp"
              }
          ],
          "license_detection_issues": [
              {
                "issue_category": "correct-detection"
              }
          ]
      },
      {
          "path": "path/to/file2",
          "licenses": [
              {
                  "path": "file2.cpp"
              }
          ],
          "license_detection_issues": [
              {
                  "issue_category": "false-positive"
              }
          ]
      }
  ]
}

Scan errors

The attribute license_detection_issues is a list of dictionaries, each dictionary representing a file-region, and containing analysis results for all the license matches in a file-region.

{
    "path": "path/to/file",
    "licenses": [
        {
            "path": "file1.cpp"
        }
    ],
    "license_detection_issues": [
        {
            "file_regions": [
                {
                    "start_line": 3,
                    "end_line": 3
                }
            ],
            "issue_category": "imperfect-match-coverage",
            "issue_description": "The license detection is incorrect, a large variation is present from the matched rule(s) and is matched to only one part of the whole text",
            "issue_type": {
                "classification_id": "notice-single-key-notice",
                "classification_description":  "a notice that notifies a single license",
                "is_license_text": false,
                "is_license_notice": true,
                "is_license_tag": false,
                "is_license_reference": false,
                "analysis_confidence": "high",
                "is_suggested_matched_text_complete": true
            },
            "suggested_license": {
                "license_expression": "gpl-2.0",
                "matched_text": "/* Published under the GNU General Public License V.2, see file COPYING */"
            },
            "original_licenses": [
                {
                    "key": "mit"
                }
            ]
        }
    ]
}

The attributes containing the analysis results are:

These 3 attributes in the analysis results has information on which file-region the matches are in.

  1. start_line and end_line marking the file-region.

  2. issue_category and issue_description categorizes and describes the issue.

  3. issue_type has further types of issues and their related attributes, listed below.

  4. original_license having the license matches with issues.

The issue type has these attributes:

  1. classification_id and classification_description

  2. 4 boolean fields is_license_text, is_license_notice, is_license_tag, and is_license_reference.

  3. is_suggested_matched_text_complete and analysis_confidence

Scan Errors per File-Region

This is a dict for every file-region, which has one or more matches in them, grouped together by location.

 {
    "files": [
        {
            "path": "scan-files/genshell.c",
            "licenses": [
              {
                "key": "lgpl-2.0"
              }
            ],
            "licence_detection_issues": [
                {
                    "start_line": 14,
                    "end_line": 34,
                    "issue_category": "imperfect-match-coverage",
                    "issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched.",
                    "issue_type": {
                        "classification_id": "notice-has-unknown-match",
                        "classification_description": "License notices with unknown licenses detected.",
                        "is_license_text": false,
                        "is_license_notice": true,
                        "is_license_tag": false,
                        "is_license_reference": false,
                        "analysis_confidence": "medium",
                        "is_suggested_matched_text_complete": true
                    },
                    "suggested_license": {
                        "license_expression": "lgpl-2.0-plus",
                        "matched_text": " *  licensed under the terms of the LGPL.... "
                    }
                },
                {
                    "start_line": 54,
                    "end_line": 62,
                    "issue_category": "extra-words",
                    "issue_description": "The license detection is conclusive with high confidence because all the rule text is matched, but some unknown extra words have been inserted in the text.",
                    "issue_type": {
                        "classification_id": "notice-single-key-notice",
                        "classification_description":  "A notice with a single license.",
                        "is_license_text": false,
                        "is_license_notice": true,
                        "is_license_tag": false,
                        "is_license_reference": false,
                        "analysis_confidence": "high",
                        "is_suggested_matched_text_complete": true
                    },
                    "suggested_license": {
                        "license_expression": "gpl-3.0-plus",
                        "matched_text": "\"genshellopt is free software: you can redistribute it and/or modify it under \\\nthe terms of the GNU General Public License as published by the Free Software \\\nFoundation, either version 3 of the License, or (at your option) any later \\\nversion."
                    },
                    "original_licenses": []
                }
            ]
        }
    ]
}

Generated Rules

There are 3 cases of file-regions and corresponding different outputs for each:

  1. Correct Detection. 1. Correct Detection

  2. Incorrect Detection but only one match in a file-region. 2. Incorrect Detection (one match)

  3. Incorrect Detection but multiple matches in a file-region. 3. Incorrect Detection (multiple matches)

1. Correct Detection

In case of a correct license detection the issue has no corresponding dictionary in license_detection_issues, and if all the licenses in a resource are correctly detected, it is an empty list.

{
    "license_detection_issues": []
}

2. Incorrect Detection (one match)

{
    "license_detection_analysis": [
        {
            "file_regions": [
                {
                    "start_line": 14,
                    "end_line": 34,
                }
            ],
            "issue_category": "imperfect-match-coverage",
            "issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched.",
            "issue_type": {
                "classification_id": "notice-has-unknown-match",
                "classification_description": "License notices with unknown licenses detected.",
                "is_license_text": false,
                "is_license_notice": true,
                "is_license_tag": false,
                "is_license_reference": false,
                "analysis_confidence": "medium",
                "is_suggested_matched_text_complete": true
            },
            "suggested_license": {
                "license_expression": "lgpl-2.0-plus",
                "matched_text": " *  licensed under the terms of the LGPL...."
            },
            "original_licenses": [
                {
                    "key": "unknown"
                },
                {
                    "key": "lgpl-2.0-plus"
                }
            ]
        }
    ]
}

3. Incorrect Detection (multiple matches)

{
    "license_detection_analysis": [
        {
            "file_regions": [
                {
                    "start_line": 14,
                    "end_line": 34,
                }
            ],
            "issue_category": "imperfect-match-coverage",
            "issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched.",
            "issue_type": {
                "classification_id": "notice-has-unknown-match",
                "classification_description": "License notices with unknown licenses detected.",
                "is_license_text": false,
                "is_license_notice": true,
                "is_license_tag": false,
                "is_license_reference": false,
                "analysis_confidence": "medium",
                "is_suggested_matched_text_complete": true
            },
            "suggested_license": {
                "license_expression": "lgpl-2.0-plus",
                "matched_text": " *  licensed under the terms of the LGPL. "
            }
        }
    ]
}

License Detection Issues Summary

Along with a resource level attribute with the respectice license detection issues for a file (file-wise approach), there’s also a codebase level attribute having the summary, which contains two main parts:

  1. All unique license detection issues (and their occurances)

  2. Statistics on license detection and their issues

This provides a different view, i.e. issue wise view, which is easier to review and if applicable, resolve.

All Unique License Detection Issues

"unique_license_detection_issues": [
    {
        "unique_identifier": 1,
        "files": [
            {
                "path": "1921-socat-2.0.0-error.h",
                "start_line": 3,
                "end_line": 3
            }
        ],
        "license_detection_issue": {
            "issue_category": "imperfect-match-coverage",
            "issue_description": "The license detection is inconclusive with high confidence, because only a small part of the rule text is matched."
        }
    }
]

Basic Statistics

{
"statistics": {
    "total_files_with_license": 1,
    "total_files_with_license_detection_issues": 1,
    "total_unique_license_detection_issues": 1,
    "issue_category_counts": {
        "imperfect-match-coverage": 1
    },
    "issue_classification_id_counts": {
        "notice-single-key-notice": 1
    },
    "analysis_confidence_counts": {
        "high": 1
    },
    "license_info_type_counts": {
        "license_notice": 1
    }
}

Header Text

This could be an optional, codebase-level header dict, which has details on the analyzer and BERT model versions used.

Note

This is Work In Progress.

{
    "header": {
        "tool_name": "scancode-results-analyzer",
        "version": 0.1,
        "cases_version": 0.1,
        "ml_models": [
            {
                "name": "lic-class-scancode-bert-base-cased-L32-1",
                "type": "sentence-classifier-bert",
                "link": "https://huggingface.co/ayansinha/lic-class-scancode-bert-base-cased-L32-1",
                "model": "BertBaseCased",
                "Sentence Length": 32,
                "Labels": 4,
                "Label Names": {
                  "License Text": 1,
                  "License Notice": 2,
                  "License Tag": 3,
                  "License Reference": 4
                }
            },
            {
                "name": "false-positives-scancode-bert-base-uncased-L8-1",
                "type": "sentence-classifier-bert",
                "link": "https://huggingface.co/ayansinha/false-positives-scancode-bert-base-uncased-L8-1",
                "model": "BertBaseUncased",
                "Sentence Length": 8,
                "Labels": 2,
                "Label_Names": {
                  "License Tag": 1,
                  "False Positive": 2
                }
            }
        ],
        "low_score_threshold": 95,
        "group_location_lines_threshold": 4
    }
}